Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

canal_instance_traffic_delay指标问题 #4454

Open
2 tasks
yourse007 opened this issue Oct 14, 2022 · 3 comments
Open
2 tasks

canal_instance_traffic_delay指标问题 #4454

yourse007 opened this issue Oct 14, 2022 · 3 comments

Comments

@yourse007
Copy link

  • I have searched the issues of this repository and believe that this is not a duplicate.
  • I have checked the FAQ of this repository and believe that this is not a duplicate.

environment

  • canal version 1.1.4、1.1.5
  • mysql version 5.7

Issue Description

canal_instance_traffic_delay的计算逻辑是currentTimestamp - localExecTime
仅在收到有效的binlog或heartbeat的时候才会更新localExecTime
有效的binlog指的是:根据filter.regex过滤之后的binlog

在某些场景下,该指标会持续上涨,造成数据有延迟的假象。

Steps to reproduce

问题场景:
mysql实例上有schema A和schema B,filter.regex只配置了schema A,但只有schema B上有业务流量
问题现象:
此时mysql master持续发送schema B的binlog,但被instance全部过滤掉了,不会更新localExecTime;
且此时mysql master也不会发送heartbeat事件,所以localExecTime就永远不会被更新;
造成的现象就是canal_instance_traffic_delay指标持续上涨,但其实此时canalInstance和mysql master之间是没有任何延迟的。

另外,AbstractEventParser#buildHeartBeatTimeTask中构造的heartBeat类型的entry并没有起任何作用,在sink环节直接被丢掉了,也没有被用于更新localExecTime.

Expected behaviour

如上述场景,canalInstance和master无延迟,canal_instance_traffic_delay理论上不应该持续上涨。

Actual behaviour

解法

两个思路:

  1. 用过滤之前的binlog.executeTime来更新localExecTime
  2. 在MysqlDetectingTimeTask机制中周期构造heartBeat类型的entry,且eventType=MHEARTBEAT,以此来模拟mysql master的心跳效果

If there is an exception, please attach the exception trace:

Just put your stack trace here!
@jackila
Copy link

jackila commented Nov 16, 2022

还有一种处理方式:优化了EntryCollector的指标采集方式。获取latestExecTime时同时获取latestInterval
如果now - latestExecTime > MASTER_HEARTBEAT_PERIOD_SECONDS * 1000,那么就使用now - latestExecTime
而如果now - latestExecTime < MASTER_HEARTBEAT_PERIOD_SECONDS * 1000,则使用latestInterval

这里面考虑的因素一方面是上面你的问题,另一个问题是当前处理方式的不够准确。如果没有数据,延迟可能回到MASTER_HEARTBEAT_PERIOD_SECONDS之久

@agapple
Copy link
Member

agapple commented Nov 16, 2022

先确认一下是否是老版本问题,印象中有修复过这个问题,#2616

目前的机制:默认在过滤后,会基于一定的策略放过binlog中的事务begin和commit事件,比如每间隔5秒 或者 8192个空事件,通过这些event来触发cursor的位点推荐 和 延迟状态更新

@jackila
Copy link

jackila commented Nov 17, 2022

过滤空事务头的机制对于一般情况是能够保证的。但是如果一个长久处于假死状态(本地测试)的数据库,还是会出现issue中的问题?

不过我觉得这只是一种edge case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants