Skip to content
This repository has been archived by the owner on Jun 23, 2022. It is now read-only.

feat(dup): add metrics for duplication #393

Merged
merged 14 commits into from
Feb 19, 2020
Merged

Conversation

neverchanje
Copy link
Contributor

@neverchanje neverchanje commented Feb 12, 2020

This PR introduces several metrics for duplication:

perf_counter_wrapper _counter_dup_log_read_bytes_rate;
perf_counter_wrapper _counter_dup_log_read_mutations_rate;
perf_counter_wrapper _counter_dup_shipped_bytes_rate;
perf_counter_wrapper _counter_dup_confirmed_rate;
perf_counter_wrapper _counter_dup_pending_mutations_count;
perf_counter_wrapper _counter_dup_time_lag;

log_read_bytes_rate

name: replica*eon.replica_stub*dup.log_read_bytes_rate

Calculates the bytes rate read from the private-log.

The curve line is usually identical with replica*eon.replica_stub*shared.log.recent.write.size. Because when everything normal, what is written is what duplicated, then:

log_read_bytes_rate = shared.log.recent.write.size = shipped_bytes_rate

But in some failure conditions, log_read_bytes_rate may be much larger, which can be used to identify if log reading during duplication works abnormally.

log_read_mutations_rate

name: eon.replica_stub dup.log_read_mutations_rate

Read rate in mutations number. The same usage as "log_read_bytes_rate".

shipped_bytes_rate

name: eon.replica_stub dup.shipped_bytes_rate

The network output bytes for successfully delivered duplication_request.

In some failure conditions, the curve may be dropped to 0, for example when the inter-cluster network is unavailable.

confirmed_rate

  • eon.replica_stub dup.confirmed_rate

The rate of confirmed writes, which indicates the number of writes that are duplicated and also confirmed by meta server.

pending_mutations_count

  • eon.replica_stub dup.pending_mutations_count

The number of writes that are not duplicated, this is one of the most important metrics for duplication. The more pending means weaker consistency. By practice, it's recommended to set an alarm threshold for this metric. Beyond the threshold, the duplication should

time_lag(ms)

  • eon.replica_stub dup.time_lag(ms)

The "latency" between 1. time of the client write arrives at replica server 2. time that the write duplicated and applied to the remote cluster.

t0 -> t1 -> t2
client -> replica server -> remote cluster
time_lag = t2-t1

@neverchanje neverchanje changed the title feat(dup): add metrics for duplication [WIP] feat(dup): add metrics for duplication Feb 14, 2020
src/dist/replication/common/replication_common.cpp Outdated Show resolved Hide resolved
@@ -273,9 +273,6 @@ void replication_options::initialize()

duplication_disabled = dsn_config_get_value_bool(
"replication", "duplication_disabled", duplication_disabled, "is duplication disabled");
if (allow_non_idempotent_write && !duplication_disabled) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么不要这个约束了?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

暂时不考虑这个约束,一方面因为我们线上默认开启 allow_non_idempotent_write,一方面是开启热备份的表可以在接入层面对业务进行要求,不一定要写死在程序里。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

那现在开热备的表能够同时进行非幂等吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前不行,如果禁止的话,可能也不会依赖配置来禁止非幂等的写,毕竟一个集群可能有的表热备份,有的表不热备份。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯,好,那目前有什么措施保证在热备的表没有进行非幂等操作呢?如果这个由业务控制而我们代码上没有限制,感觉还是有些不安全

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前还没有,初步的想法是在 pegasus 那边改,遇到 INCR 和 CHECK_AND_SET 就写一个 empty write,然后返回错误。但是还没实现。HBase 是支持热备份 INCR 的,就是复制的过程中,把 INCR 转为 PUT,但是这个流程 pegasus 这边很难写。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯,好,记个TODO吧,这个最好还是代码上限制一下,靠业务的自觉性太不安全了

@hycdong hycdong merged commit 7a46628 into XiaoMi:master Feb 19, 2020
@neverchanje neverchanje added the type/perf-counter PR that made modification on perf-counter, which should be noted in release note. label Mar 12, 2020
@neverchanje neverchanje deleted the dup-metrics branch March 19, 2020 06:41
neverchanje pushed a commit that referenced this pull request Mar 31, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
1.12.3 component/duplication type/perf-counter PR that made modification on perf-counter, which should be noted in release note.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants