目的
nebulaGraph官方自带的Nebula Dashboard 监控工具里的监控指标和筛选粒度已经足够丰富。但基于以下原因,还是在此基础上自己做了层监控
- 缺少关键指标的定义,对除DBA以外的用户不友好。我们希望把性能数据让研发同学也能看到
- 缺少一个对所有服务器的横向对比和集中展示Dashboard
- 与我们现有的DBA监控不在一个平台,需要多平台切换使用
- 没有性能指标评分,同样的指标,没有给出是否合格的标准和提示
- 官方已经提供了可以直接读取性能数据的接口 http://%s/stats,在些基础上二次开发监控的难度会很低,预估开发工作量1pd ,实际开发工作量1.5pd
1.指标筛选
官方提供的性能接口里有几百项指标,我们从graph,storage,rockdb 三个层面,筛选了60几个重点的,需要关注的指标,筛选的依据是
- 是否能从指标定位到资源或性能问题(响应时间等)
- 是否在排查问题出现时,有助于定位异常(命令次数,qps等)
- 是否有类似的指标已经存在。
最终,挑选了以下指标
2.编写收集代码(Python)
这部分因为有官方接口,所以直接请求就可以了
for nodeid,nodehost in grpahlist:
dict_result=self.get_nebula_stats(nodehost)
#入库 dict_result
def get_nebula_stats(self,nodehost)
dict_result={}
urlstr='http://%s/stats?format=json'%(nodehost,)
mlist=self.get_urldata(urlstr)
for dictc in mlist:
for k,v in dictc.items():
if k in dict_graph:
keyname=dict_graph[k][0]
dict_result[keyname]=v
return dict_result
3.性能评价
这是非常重要的一步,沿用我在2005年做的数据库性能模型的方法(参考:https://github.com/51ak/DatabaseRating/)
- 对我们的每一项指标,我们需要对其取值范围进行判断,给其打分:优,良,中,劣。 标记为:weight
- 对每一项指标,对其权重做标记(0-5),标记为:height
- 由weight和height计算出这个实例的健康程度
- 但是我们的nebula服务只有一个集群,不需要太细化,所以我们只做了weight标记
做weight标记的逻辑是定义如下的一个区间列表
dict_graph={
"num_active_queries.sum.60":("num_active_queries",(-4,5,10,50,1000000)),
"num_active_sessions.sum.60":("num_active_sessions",(-4,5,10,50,1000000)),
"num_opened_sessions.rate.60":("num_opened_sessions",(-4,100,500,10000,10000000)),
"num_queries.rate.60":("num_queries_rate",(-4,100,500,10000,10000000)),
"num_queries.sum.60":("num_queries_sum",(-4,5000,50000,900000,100000000)),
"num_sentences.rate.60":("num_sentences_rate",(-4,100,500,10000,10000000)),
"num_sentences.sum.60":("num_sentences_sum",(-4,5000,50000,900000,100000000)),
"query_latency_us.p95.60":("query_latency_us_p95",(-4,30000,100000,500000,10000000000)),
"query_latency_us.avg.60":("query_latency_us_avg",(-4,30000,100000,500000,10000000000)),
"num_queries_hit_memory_watermark.sum.60":("num_queries_hit_memory_watermark",(-4,100,5000,90000,100000000)),
"num_query_errors.rate.60":("num_query_errors",(-4,100,5000,90000,100000000)),
"num_slow_queries.sum.60":("num_slow_queries",(-4,50,500,5000,100000000))
}
对每一个值,对其区间进行判断,如果第一区间,则是优,以此类推。所以对应的采集程序就变成了
def get_nebula_stats(self,nodehost)
dict_result={}
urlstr='http://%s/stats?format=json'%(nodehost,)
mlist=self.get_urldata(urlstr)
for dictc in mlist:
for k,v in dictc.items():
if k in dict_graph:
keyname=dict_graph[k][0]
keyvalueweightlist=dict_graph[k][1]
if v<0:
v=0
value=float(v)
valueweight=self.func_checkweight(value,keyvalueweightlist)
dict_result[keyname]=v
dict_result[f"{keyname}_weight"]=valueweight
return dict_result
4.配置一张报表,进行集中展示
在我们的dboop平台上增加一张报表配置。
代码如下:
<page>
<!-- 自定义表头,可省略,默认是列名,如需要自定义宽度或表格头,写html格式 -->
<tableheader><![CDATA[
<tr style="height:25px;">
<th colspan="1" style="border-bottom:0"> </th>
<th colspan="4" style="border-bottom:0"><h3>主机</h3></th>
<th colspan="4" style="border-bottom:0"><h3>网络</h3></th>
<th colspan="2" style="border-bottom:0"><h3>活跃</h3></th>
<th colspan="3" style="border-bottom:0"><h3>每秒</h3></th>
<th colspan="2" style="border-bottom:0"><h3>响应时间</h3></th>
<th colspan="4" style="border-bottom:0"><h3>1分钟指标</h3></th>
</tr>
<tr style="height:25px;">
<th><h3>Graph服务</h3></th>
<th><h3>CPU</h3></th>
<th><h3>内存</h3></th>
<th><h3>IOWait</h3></th>
<th><h3>磁盘空间</h3></th>
<th><h3>进流量</h3></th>
<th><h3>出流量</h3></th>
<th><h3>TcpWait</h3></th>
<th><h3>Tcp连接</h3></th>
<th><h3>查询</h3></th>
<th><h3>进程</h3></th>
<th><h3>QPS</h3></th>
<th><h3>语句PS</h3></th>
<th><h3>打开会话</h3></th>
<th><h3>P95(毫秒)</h3></th>
<th><h3>平均(毫秒)</h3></th>
<th><h3>查询数</h3></th>
<th><h3>内存溢出</h3></th>
<th><h3>错误次数</h3></th>
<th><h3>慢查询</h3></th>
</tr>
]]>
</tableheader>
<viewtype>table</viewtype>
<connstr>link:cnpg3rzn2x********byenoa</connstr>
<sqlstr><![CDATA[
select
case when left(h.hosttype,1)='物' then
concat(n.ipstr,' <span class="f_1">物</span>')
else concat(n.ipstr,' <span class="f_2">虚</span>')
end as 'IP',
format_dash(`cpu`,`cpu_weight`,'d203qtapmhv1ywbojc579lkrgx','cpu','host',a.hostid,'percent') as 'cpu',
format_dash(`mempercent`,`mempercent_weight`,'d203qtapmhv1ywbojc579lkrgx','mempercent','host',a.hostid,'percent') as 'mempercent',
format_dash(`iowait`,`iowait_weight`,'dl4bzsk7pfg89v1uyoq2x0j65c','iowait','host',a.hostid,'') as 'iowait',
case when `diskused_data` is null or `diskused_data`<0 then
format_dash(`diskused_root`,`diskused_root_weight`,'d203qtapmhv1ywbojc579lkrgx','diskused_root','host',a.hostid,'percent')
else
format_dash(`diskused_data`,`diskused_data_weight`,'d203qtapmhv1ywbojc579lkrgx','diskused_data','host',a.hostid,'percent')
end as 'diskused_data',
format_dash(`network_in`,`network_in_weight`,'dl4bzsk7pfg89v1uyoq2x0j65c','network_in','host',a.hostid,'bytes') as 'network_in',
format_dash(`network_out`,`network_out_weight`,'dl4bzsk7pfg89v1uyoq2x0j65c','network_out','host',a.hostid,'bytes') as 'network_out',
format_dash(`tcp_wait`,`tcp_wait_weight`,'dl4bzsk7pfg89v1uyoq2x0j65c','tcp_wait','host',a.hostid,'floor') as 'tcp_wait',
format_dash(`tcp_count`,`tcp_count_weight`,'dl4bzsk7pfg89v1uyoq2x0j65c','tcp_count','host',a.hostid,'floor') as 'tcp_count',
format_dash(`num_active_queries`,`num_active_queries_weight`,'day9kl2pm1wbts8c0xzhvfnueq','num_active_queries','graph',b.graphid,'floor') as 'num_active_queries',
format_dash(`num_active_sessions`,`num_active_sessions_weight`,'day9kl2pm1wbts8c0xzhvfnueq','num_active_sessions','graph',b.graphid,'floor') as 'num_active_sessions',
format_dash(`num_queries_rate`,`num_queries_rate_weight`,'day9kl2pm1wbts8c0xzhvfnueq','num_queries_rate','graph',b.graphid,'floor') as 'num_queries_rate',
format_dash(`num_sentences_rate`,`num_sentences_rate_weight`,'day9kl2pm1wbts8c0xzhvfnueq','num_sentences_rate','graph',b.graphid,'floor') as 'num_sentences_rate',
format_dash(`num_opened_sessions`,`num_opened_sessions_weight`,'day9kl2pm1wbts8c0xzhvfnueq','num_opened_sessions','graph',b.graphid,'floor') as 'num_opened_sessions',
format_dash(`query_latency_us_p95`/1000,`query_latency_us_p95_weight`,'day9kl2pm1wbts8c0xzhvfnueq','query_latency_us_p95','graph',b.graphid,'floor') as 'query_latency_us_p95',
format_dash(`query_latency_us_avg`/1000,`query_latency_us_avg_weight`,'day9kl2pm1wbts8c0xzhvfnueq','query_latency_us_avg','graph',b.graphid,'floor') as 'query_latency_us_avg',
format_dash(`num_queries_sum`,`num_queries_sum_weight`,'day9kl2pm1wbts8c0xzhvfnueq','num_queries_sum','graph',b.graphid,'floor') as 'num_queries_sum',
format_dash(`num_queries_hit_memory_watermark`,`num_queries_hit_memory_watermark_weight`,'day9kl2pm1wbts8c0xzhvfnueq','num_queries_hit_memory_watermark','graph',b.graphid,'floor') as 'num_queries_hit_memory_watermark',
format_dash(`num_query_errors`,`num_query_errors_weight`,'day9kl2pm1wbts8c0xzhvfnueq','num_query_errors','graph',b.graphid,'floor') as 'num_query_errors',
format_dash(`num_slow_queries`,`num_slow_queries_weight`,'day9kl2pm1wbts8c0xzhvfnueq','num_slow_queries','graph',b.graphid,'floor') as 'num_slow_queries'
from dboop.linknode n
join dboop.linkhost h on n.hostid=h.hostid
left join wc_host a on a.hostid=n.hostid and a._timestamp>DATE_SUB(NOW(), INTERVAL 15 MINUTE)
left join wc_nebula_graph b on n.instanceid=b.graphid and b._timestamp>DATE_SUB(NOW(), INTERVAL 15 MINUTE)
where h.usetype='nebula' and h.ontype=9 and n.clustertype like '%g%'
;
]]>
</sqlstr>
<title2>Nebula Graph性能</title2>
<width>1250px</width>
</page>
最终效果
我们得到了一个
- 集中在DBA平台里的,统一风格,统一管理的
- 可以实时展示所有nebula服务器的重点指标的
- 可以对比历史数据的(指标详情页的绿色虚线是历史数据算出来的预测线)
- 已经做过数据标识和评定的性能指标的DashBoard
我们付出了
- 1.25pd的工作量(主要用于指标筛选和标准判断)
- 性能数据的存储空间(这里有了MySQL存储)