图数据库nebula性能监控| dboop.com

目的

nebulaGraph官方自带的Nebula Dashboard 监控工具里的监控指标和筛选粒度已经足够丰富。但基于以下原因，还是在此基础上自己做了层监控

缺少关键指标的定义，对除DBA以外的用户不友好。我们希望把性能数据让研发同学也能看到
缺少一个对所有服务器的横向对比和集中展示Dashboard
与我们现有的DBA监控不在一个平台，需要多平台切换使用
没有性能指标评分，同样的指标，没有给出是否合格的标准和提示
官方已经提供了可以直接读取性能数据的接口 http://%s/stats，在些基础上二次开发监控的难度会很低，预估开发工作量1pd ,实际开发工作量1.5pd

1.指标筛选

官方提供的性能接口里有几百项指标，我们从graph,storage,rockdb 三个层面，筛选了60几个重点的，需要关注的指标，筛选的依据是

是否能从指标定位到资源或性能问题（响应时间等）
是否在排查问题出现时，有助于定位异常（命令次数，qps等)
是否有类似的指标已经存在。

最终，挑选了以下指标

nebula性能监控—指标项

2.编写收集代码（Python）

这部分因为有官方接口，所以直接请求就可以了


for nodeid,nodehost in grpahlist:
    dict_result=self.get_nebula_stats(nodehost)
    #入库 dict_result


  def get_nebula_stats(self,nodehost)
    dict_result={} 
    urlstr='http://%s/stats?format=json'%(nodehost,)
    mlist=self.get_urldata(urlstr)           
    for dictc in mlist:
        for k,v in dictc.items():
            if k in dict_graph:
                keyname=dict_graph[k][0]
                dict_result[keyname]=v
    return dict_result

3.性能评价

这是非常重要的一步，沿用我在2005年做的数据库性能模型的方法（参考：https://github.com/51ak/DatabaseRating/）

对我们的每一项指标，我们需要对其取值范围进行判断，给其打分：优，良，中，劣。标记为:weight
对每一项指标，对其权重做标记（0-5）,标记为：height
由weight和height计算出这个实例的健康程度
但是我们的nebula服务只有一个集群，不需要太细化，所以我们只做了weight标记

做weight标记的逻辑是定义如下的一个区间列表

 dict_graph={
        "num_active_queries.sum.60":("num_active_queries",(-4,5,10,50,1000000)),
"num_active_sessions.sum.60":("num_active_sessions",(-4,5,10,50,1000000)),
"num_opened_sessions.rate.60":("num_opened_sessions",(-4,100,500,10000,10000000)),
"num_queries.rate.60":("num_queries_rate",(-4,100,500,10000,10000000)),
"num_queries.sum.60":("num_queries_sum",(-4,5000,50000,900000,100000000)),
"num_sentences.rate.60":("num_sentences_rate",(-4,100,500,10000,10000000)),
"num_sentences.sum.60":("num_sentences_sum",(-4,5000,50000,900000,100000000)),
"query_latency_us.p95.60":("query_latency_us_p95",(-4,30000,100000,500000,10000000000)),
"query_latency_us.avg.60":("query_latency_us_avg",(-4,30000,100000,500000,10000000000)),
"num_queries_hit_memory_watermark.sum.60":("num_queries_hit_memory_watermark",(-4,100,5000,90000,100000000)),
"num_query_errors.rate.60":("num_query_errors",(-4,100,5000,90000,100000000)),
"num_slow_queries.sum.60":("num_slow_queries",(-4,50,500,5000,100000000))
        }

对每一个值，对其区间进行判断，如果第一区间，则是优，以此类推。所以对应的采集程序就变成了


  def get_nebula_stats(self,nodehost)
    dict_result={} 
    urlstr='http://%s/stats?format=json'%(nodehost,)
    mlist=self.get_urldata(urlstr)           
    for dictc in mlist:
        for k,v in dictc.items():
            if k in dict_graph:
                keyname=dict_graph[k][0]
                keyvalueweightlist=dict_graph[k][1]
                if v<0:
                    v=0
                value=float(v)
                valueweight=self.func_checkweight(value,keyvalueweightlist) 
                dict_result[keyname]=v
                dict_result[f"{keyname}_weight"]=valueweight
    return dict_result

4.配置一张报表，进行集中展示

在我们的dboop平台上增加一张报表配置。

nebula性能监控—后台配置

代码如下：



<page>

<!-- 自定义表头，可省略,默认是列名，如需要自定义宽度或表格头，写html格式 -->
  <tableheader><![CDATA[
  <tr style="height:25px;">
<th colspan="1" style="border-bottom:0"> </th>
<th colspan="4" style="border-bottom:0"><h3>主机</h3></th>
<th colspan="4" style="border-bottom:0"><h3>网络</h3></th>
<th colspan="2" style="border-bottom:0"><h3>活跃</h3></th>
<th colspan="3" style="border-bottom:0"><h3>每秒</h3></th>
<th colspan="2" style="border-bottom:0"><h3>响应时间</h3></th>
<th colspan="4" style="border-bottom:0"><h3>1分钟指标</h3></th>
</tr>
  <tr style="height:25px;">
<th><h3>Graph服务</h3></th>

<th><h3>CPU</h3></th>
<th><h3>内存</h3></th>
<th><h3>IOWait</h3></th>
<th><h3>磁盘空间</h3></th>
<th><h3>进流量</h3></th>
<th><h3>出流量</h3></th>
<th><h3>TcpWait</h3></th>
<th><h3>Tcp连接</h3></th>

<th><h3>查询</h3></th>
<th><h3>进程</h3></th>

<th><h3>QPS</h3></th>
<th><h3>语句PS</h3></th>
<th><h3>打开会话</h3></th>


<th><h3>P95(毫秒)</h3></th>
<th><h3>平均(毫秒)</h3></th>


<th><h3>查询数</h3></th>
<th><h3>内存溢出</h3></th>
<th><h3>错误次数</h3></th>
<th><h3>慢查询</h3></th>


</tr>

 ]]>
  </tableheader>  
  

  <viewtype>table</viewtype>
   <connstr>link:cnpg3rzn2x********byenoa</connstr>
  <sqlstr><![CDATA[

select 
case when left(h.hosttype,1)='物' then 
concat(n.ipstr,' <span class="f_1">物</span>')
 else concat(n.ipstr,' <span  class="f_2">虚</span>')
 end as 'IP',
            format_dash(`cpu`,`cpu_weight`,'d203qtapmhv1ywbojc579lkrgx','cpu','host',a.hostid,'percent') as 'cpu',
            format_dash(`mempercent`,`mempercent_weight`,'d203qtapmhv1ywbojc579lkrgx','mempercent','host',a.hostid,'percent') as 'mempercent',
            format_dash(`iowait`,`iowait_weight`,'dl4bzsk7pfg89v1uyoq2x0j65c','iowait','host',a.hostid,'') as 'iowait', 
case when `diskused_data` is null or `diskused_data`<0 then 
   format_dash(`diskused_root`,`diskused_root_weight`,'d203qtapmhv1ywbojc579lkrgx','diskused_root','host',a.hostid,'percent')
else 
   format_dash(`diskused_data`,`diskused_data_weight`,'d203qtapmhv1ywbojc579lkrgx','diskused_data','host',a.hostid,'percent')
end as 'diskused_data',
            format_dash(`network_in`,`network_in_weight`,'dl4bzsk7pfg89v1uyoq2x0j65c','network_in','host',a.hostid,'bytes') as 'network_in',
            format_dash(`network_out`,`network_out_weight`,'dl4bzsk7pfg89v1uyoq2x0j65c','network_out','host',a.hostid,'bytes') as 'network_out',
            format_dash(`tcp_wait`,`tcp_wait_weight`,'dl4bzsk7pfg89v1uyoq2x0j65c','tcp_wait','host',a.hostid,'floor') as 'tcp_wait',
            format_dash(`tcp_count`,`tcp_count_weight`,'dl4bzsk7pfg89v1uyoq2x0j65c','tcp_count','host',a.hostid,'floor') as 'tcp_count',
format_dash(`num_active_queries`,`num_active_queries_weight`,'day9kl2pm1wbts8c0xzhvfnueq','num_active_queries','graph',b.graphid,'floor') as 'num_active_queries',
format_dash(`num_active_sessions`,`num_active_sessions_weight`,'day9kl2pm1wbts8c0xzhvfnueq','num_active_sessions','graph',b.graphid,'floor') as 'num_active_sessions',


format_dash(`num_queries_rate`,`num_queries_rate_weight`,'day9kl2pm1wbts8c0xzhvfnueq','num_queries_rate','graph',b.graphid,'floor') as 'num_queries_rate',
format_dash(`num_sentences_rate`,`num_sentences_rate_weight`,'day9kl2pm1wbts8c0xzhvfnueq','num_sentences_rate','graph',b.graphid,'floor') as 'num_sentences_rate',
format_dash(`num_opened_sessions`,`num_opened_sessions_weight`,'day9kl2pm1wbts8c0xzhvfnueq','num_opened_sessions','graph',b.graphid,'floor') as 'num_opened_sessions',

format_dash(`query_latency_us_p95`/1000,`query_latency_us_p95_weight`,'day9kl2pm1wbts8c0xzhvfnueq','query_latency_us_p95','graph',b.graphid,'floor') as 'query_latency_us_p95',
format_dash(`query_latency_us_avg`/1000,`query_latency_us_avg_weight`,'day9kl2pm1wbts8c0xzhvfnueq','query_latency_us_avg','graph',b.graphid,'floor') as 'query_latency_us_avg',


format_dash(`num_queries_sum`,`num_queries_sum_weight`,'day9kl2pm1wbts8c0xzhvfnueq','num_queries_sum','graph',b.graphid,'floor') as 'num_queries_sum',
format_dash(`num_queries_hit_memory_watermark`,`num_queries_hit_memory_watermark_weight`,'day9kl2pm1wbts8c0xzhvfnueq','num_queries_hit_memory_watermark','graph',b.graphid,'floor') as 'num_queries_hit_memory_watermark',
format_dash(`num_query_errors`,`num_query_errors_weight`,'day9kl2pm1wbts8c0xzhvfnueq','num_query_errors','graph',b.graphid,'floor') as 'num_query_errors',
format_dash(`num_slow_queries`,`num_slow_queries_weight`,'day9kl2pm1wbts8c0xzhvfnueq','num_slow_queries','graph',b.graphid,'floor') as 'num_slow_queries'
from  dboop.linknode n 
join dboop.linkhost h on n.hostid=h.hostid
left join wc_host a on a.hostid=n.hostid  and a._timestamp>DATE_SUB(NOW(), INTERVAL 15 MINUTE)
left join wc_nebula_graph b on n.instanceid=b.graphid  and b._timestamp>DATE_SUB(NOW(), INTERVAL 15 MINUTE)
where  h.usetype='nebula' and h.ontype=9 and n.clustertype like '%g%'
;



]]>
  
  </sqlstr> 
  
  
 <title2>Nebula Graph性能</title2>

  <width>1250px</width>
</page>

最终效果

nebula性能监控—后台配置

我们得到了一个

集中在DBA平台里的，统一风格，统一管理的
可以实时展示所有nebula服务器的重点指标的
可以对比历史数据的（指标详情页的绿色虚线是历史数据算出来的预测线）
已经做过数据标识和评定的性能指标的DashBoard

我们付出了

1.25pd的工作量（主要用于指标筛选和标准判断）
性能数据的存储空间（这里有了MySQL存储)

>> Home

图数据库nebula性能监控

目的