目的
因为我们的图数据库从neo4j社区版转到nebula graph方向。最近在项目压测的时候,一开始很平稳,运行一段时间后,NQL会越来越慢,发现性能巨差。nebula经常卡死,表现为:
- nebula-storaged.service和nebula-graphd.service服务经常挂掉。
- 业务侧反馈执行什么NQL都很慢
- nebula show query里发现有大量(300-500个)running的进程。平时很快的NQL也卡在那
系统资源
- 内存紧张 ,大量的虚拟内存被占用
- io,cpu也较高
在做了一些参数调优后,发现状态有所缓解,但是一段时间后,还是会慢慢卡死,从现象上推测是有一种或几种慢NQL多次执行后,把系统资源消耗完了,导致大面积的堵塞。所以面临的需求还是要有个慢查询排查工具。
参考上一次做oracle慢日志收集展示的方法, https://www.dboop.com/oracle/oracle%E6%80%8E%E6%A0%B7%E5%AE%9E%E6%97%B6%E6%94%B6%E9%9B%86%E5%B1%95%E7%A4%BA%E6%85%A2%E6%9F%A5%E8%AF%A2/
变化的是:
- 因为nebula的集群就一个,不需要做oracle慢日志的大表套小表,数担据量不大就建了一张monitor_nebula_slow表存放数据.
- nebula抓到的慢查询里的NQL是没有去参数化的,需要自己做去参数化,把相同类型的NQL,标识为同一个md5id
1.建一张表,每隔1分钟(时间可调,但我们的场景1分钟足够了)
CREATE TABLE `monitor_nebula_slow` (
`logid` int unsigned NOT NULL AUTO_INCREMENT,
`SessionID` varchar(50) NOT NULL DEFAULT '',
`ExecutionPlanID` varchar(50) NOT NULL DEFAULT '',
`User` varchar(50) NOT NULL DEFAULT '',
`Host` varchar(50) NOT NULL DEFAULT '',
`StartTime` datetime DEFAULT NULL,
`DurationInUSec` int unsigned NOT NULL DEFAULT '0',
`Status` varchar(50) NOT NULL DEFAULT '',
`Query` varchar(5000) NOT NULL DEFAULT '',
`_timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`md5id` varchar(64) NOT NULL DEFAULT '',
PRIMARY KEY (`logid`),
KEY `ix_monitor_nebula_slow` (`_timestamp`)
) ENGINE=InnoDB AUTO_INCREMENT=13231 DEFAULT CHARSET=utf8mb3
2.收集到的信息,做个dashboard页,实时展示出来
源码如下:
<para>
<name>btime</name>
<title>开始时间</title>
<type>datetime</type>
<defaultvalue>getdate()-4h</defaultvalue>
<format>%Y-%m-%d %H:%M:00</format>
</para>
<para>
<name>etime</name>
<title>结束时间</title>
<type>datetime</type>
<defaultvalue>getdate()-1m</defaultvalue>
<format>%Y-%m-%d 23:59:59</format>
</para>
<para>
<name>nql</name>
<title>NQL</title>
<type>string</type>
<defaultvalue></defaultvalue>
</para>
<para>
<title>nodeid</title>
<name>nodeid</name>
<type>select</type>
<item_text>全部,10.0.0.77,10.0.0.78,10.0.0.79</item_text>
<item_value>all,10.0.0.77:9669,10.0.0.78:9669,10.0.0.79:9669</item_value>
<defaultvalue>all</defaultvalue>
</para>
<page>
<viewtype>mutibar</viewtype>
<chart_title>title : {text: 'Nebula实时慢日志(节点)',x:'center',y:20},</chart_title>
<chart_legendstyle2>x: 'right',y: '200',borderWidth: 0,orient:'vertical'</chart_legendstyle2>
<chart_legendstyle>x: 'right',y:0,borderWidth: 0,orient:'vertical'</chart_legendstyle>
<chart_addin>grid:{width:880,y:10},</chart_addin>
<connstr>link:dboop_nebula_slow</connstr>
<sqlstr><![CDATA[
select
date_format(aa.ftime, '%%m-%%d %%H:%%i') as ftime
,ifnull(bb.DBOwner,'') as sqlstr
,ifnull(bb.counts,0) as counts
from dba_report_datetime1 aa
left join (
SELECT date_format(a._timestamp, '%%Y-%%m-%%d %%H:%%i') as ftime,
replace(Host,':9669','') as 'DBOwner',
count(*) as counts
from `monitor_nebula_slow` a
where a._timestamp between %s AND %s
and (a.Query like concat('%%',%s,'%%') or %s='')
and (a.Host=%s or %s='all')
group by ftime, DBOwner
)bb on aa.ftime=bb.ftime
where aa.ftime between %s AND %s and aa.ftime<curtime()
order by ftime,counts desc
]]>
</sqlstr>
<sqlpara>#1</sqlpara>
<sqlpara>#2</sqlpara>
<sqlpara>#3</sqlpara> <sqlpara>#3</sqlpara>
<sqlpara>#4</sqlpara> <sqlpara>#4</sqlpara>
<sqlpara>#1</sqlpara>
<sqlpara>#2</sqlpara>
<width>1080px</width>
<height>160px</height>
<tableformat>rowconvert 1 15</tableformat>
<chart_group>a</chart_group>
<chart_datazoom>60</chart_datazoom>
</page>
<page>
<viewtype>mutibar</viewtype>
<chart_title>title : {text: 'Nebula实时慢日志(NQL)',x:'center',y:20},</chart_title>
<chart_legendstyle2>x: 'right',y: '200',borderWidth: 0,orient:'vertical'</chart_legendstyle2>
<chart_legendstyle>x: 'right',y:0,borderWidth: 0,orient:'vertical'</chart_legendstyle>
<chart_addin>grid:{width:880,y:10},</chart_addin>
<connstr>link:dboop_nebula_slow</connstr>
<sqlstr><![CDATA[
select
date_format(aa.ftime, '%%m-%%d %%H:%%i') as ftime
,ifnull(bb.DBOwner,'') as sqlstr
,ifnull(bb.counts,0) as counts
from dba_report_datetime1 aa
left join (
SELECT date_format(a._timestamp, '%%Y-%%m-%%d %%H:%%i') as ftime,
left(md5id,15) as 'DBOwner',
count(*) as counts
from `monitor_nebula_slow` a
where a._timestamp between %s AND %s
and (a.Query like concat('%%',%s,'%%') or %s='')
and (a.Host=%s or %s='all')
group by ftime, DBOwner
)bb on aa.ftime=bb.ftime
where aa.ftime between %s AND %s and aa.ftime<curtime()
order by ftime,counts desc
]]>
</sqlstr>
<sqlpara>#1</sqlpara>
<sqlpara>#2</sqlpara>
<sqlpara>#3</sqlpara> <sqlpara>#3</sqlpara>
<sqlpara>#4</sqlpara> <sqlpara>#4</sqlpara>
<sqlpara>#1</sqlpara>
<sqlpara>#2</sqlpara>
<tableformat>rowconvert 1 15</tableformat>
<width>1080px</width>
<height>360px</height>
<chart_group>a</chart_group>
<chart_datazoom>60</chart_datazoom>
</page>
<page>
<viewtype>table</viewtype>
<connstr>link:dboop_nebula_slow</connstr>
<title2>时间段内最频繁慢NQLTop50</title2>
<sqlstr><![CDATA[
SELECT
concat('<a href="view.html?action=v&report_action=v&report_ptname=dvjtlec9pqgza1782rx4wb0hyk&sqlid=',a.md5id,'" target="_blank">',left(a.md5id,10),'</a>' )
as md5,
concat(max(left(a.Query,120)),'...') as nqlstr,
case when count(*)>10 then concat('<span class="f_red f_bold">',count(*),'</span>') else concat(count(*)) end as '累计次数',
case when sum(DurationInUSec)>1000000000 then concat('<span class="f_red ">',format(sum(DurationInUSec)/1000000,2),'秒</span>') else concat(format(sum(DurationInUSec)/1000000,2),'秒') end as '累计耗时',
case when avg(DurationInUSec)>100000000 then concat('<span class="f_red ">',format(avg(DurationInUSec)/1000000,2),'秒</span>') else concat(format(avg(DurationInUSec)/1000000,2),'秒') end as '平均耗时',
case when max(DurationInUSec)>100000000 then concat('<span class="f_red ">',format(max(DurationInUSec)/1000000,2),'秒</span>') else concat(format(max(DurationInUSec)/1000000,2),'秒') end as '最慢耗时'
from `monitor_nebula_slow` a
where a._timestamp between %s AND %s
and (a.Query like concat('%%',%s,'%%') or %s='')
and (a.Host=%s or %s='all')
group by a.md5id
order by sum(DurationInUSec) desc
limit 50
]]>
</sqlstr>
<sqlpara>#1</sqlpara>
<sqlpara>#2</sqlpara>
<sqlpara>#3</sqlpara> <sqlpara>#3</sqlpara>
<sqlpara>#4</sqlpara> <sqlpara>#4</sqlpara>
<width>1140px</width>
</page>
<debug>true</debug>
页面展示效果如下:
3.对于每个慢查询,做个子页面,看到详细的NQL和每次执行用时等详细信息。
子页面源代码:
<page>
<viewtype>table</viewtype>
<tableheader><![CDATA[
<tr style="height:25px;">
<th width="100"><h3>发生时间</h3></th>
<th width="50"><h3>用时</h3></th>
<th><h3>NQL</h3></th>
</tr>
]]>
</tableheader>
<connstr>link:dboop_nebula_slow</connstr>
<sqlstr><![CDATA[
select date_format(StartTime, '%%m-%%d %%H:%%i:%%s') as ftime
,case when max(DurationInUSec)/1000000>5 then
concat('<span class="f_red">',format(max(DurationInUSec)/1000000,2),'</span>秒')
else
concat('<span>',format(max(DurationInUSec)/1000000,2),'</span>秒')
end
as costtime
,
concat('',a.`Query`
,' <br>'
,' <b>SessionID: </b><span class="f_666">',a.SessionID,'</span>'
,' <b>状态:</b><span class="f_red">'
,case
when Status='RUNNING' then '<span class="f_666">RUNNING</span>'
else Status end,'</span>'
,' <b>User:</b><span class="f_666">',a.User,'@',a.Host,'</span>'
,''
) as 'NSQL' from
monitor_nebula_slow a
where _timestamp between %s and %s
and md5id=%s
group by StartTime,NSQL
order by StartTime desc
]]>
</sqlstr>
<sqlpara>#1</sqlpara>
<sqlpara>#2</sqlpara>
<sqlpara>#3</sqlpara>
<width>100%</width>
</page>
<para>
<name>开始时间</name>
<type>datetime</type>
<defaultvalue>getdate()-4h</defaultvalue>
<format>%Y-%m-%d %H:%M:00</format>
</para>
<para>
<name>结束时间</name>
<type>datetime</type>
<defaultvalue>getdate()-1m</defaultvalue>
<format>%Y-%m-%d %H:%M:00</format>
</para>
<para>
<name>sqlid</name>
<type>string</type>
<defaultvalue></defaultvalue>
</para>
子页面效果如下:
4.总结
通过以上的配置,我们快速的制作了一个辅助定位nebula Graph的慢查询问题的工具,可以实时的定位问题点。再和研发同学一起解决问题。
这个工具基于Python3.9+Django+MySQL完成,开发加调试共用时:3小时。
>> Home