-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement new tablet repair and balance framework #336
Conversation
I think TabletScanner and TabletFactory are not suitable names. Scanner is used in query to scan tablet data. Also I think TabletFactory is a factory used to create tablet. |
If tablets to be repaired are at different priorities, how to handle it. For example, Tablet A loss two of three replicas, Tablet B loss one of three replicas; Tablet A should be at high priority when to repair. |
Factory class is not suitable for implementing procedure t�o RECOVER and REBLANCE, but only for creating. |
continue; | ||
} | ||
|
||
OlapTable olapTbl = (OlapTable) table; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MysqlTable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
table.needScan() blocks any tables which type is not OLAP
I think TabletChecker is more precise to describe this class than TabletScanner. |
private static int BATCH_NUM = 10; // handle at most BATCH_NUM tablets in one loop | ||
|
||
private static int SLOT_NUM_PER_PATH = 1; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Constant should be final.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
private static final Logger LOG = LogManager.getLogger(TabletFactory.class); | ||
|
||
private PriorityBlockingQueue<TabletInfo> q = new PriorityBlockingQueue<>();; | ||
private Set<Long> runningTabletIds = Sets.newHashSet();; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable name should't be single letter in most conditions, to describle the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
updateClusterLoadStatisticsIfNecessary(); | ||
|
||
handleQ(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tabletQ
@@ -0,0 +1,655 @@ | |||
package org.apache.doris.clone; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no license header
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
d6fbea8
to
b24cb26
Compare
|
||
// db id -> (tbl id -> partition id) | ||
// priority of replicas of partitions in this table will be set to VERY_HIGH if not healthy | ||
private com.google.common.collect.Table<Long, Long, Set<Long>> prios = HashBasedTable.create(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the abbreviation 'prios' is not a good name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want make the name shorter..
8476d0c
to
c11be48
Compare
return; | ||
} | ||
|
||
scanTablets(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exception need catching in runOneCycle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we only catch expected exception, any unexpected exception is caught in Deamon
checkTablets(schedulerHasTask); | ||
stat.counterTabletCheckRound.incrementAndGet(); | ||
|
||
LOG.info(stat.incrementalBrief()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exception of checkTablets
need catching in runOneCycle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we only catch expected exception, any unexpected exception is caught in Deamon
43d6464
to
58ff8eb
Compare
ea61361
to
7be863d
Compare
15c63da
to
82b450e
Compare
82b450e
to
41d2c53
Compare
d3857fa
to
2dc81da
Compare
abbb4ed
to
0c96717
Compare
be/src/agent/task_worker_pool.cpp
Outdated
uint64_t total_file_size = 0; | ||
MonotonicStopWatch watch; | ||
watch.start(); | ||
for (auto file_name : file_name_list) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto& to avoid a string copy
Doris 副本修复和均衡策略
名词解释
TabletChecker
检查规则
TC 根据以下规则进行检查:
存活副本数小于期望副本数
存活副本数大于等于期望副本数,但其中健康副本数小于期望副本数
健康副本数大于等于期望副本数,但在对应 cluster 内的副本数小于期望副本数
健康副本都在对应 cluster 内,但数量大于期望副本数
健康分片,即条件[1-4]都不满足。
优先级
TC 在检查 tablet 状态的同时,也会对非 HEALTHY 状态的 tablet 分配一个初始优先级。该优先级决定了在 TS 中的处理优先级。
VERY_HIGH
HIGH
NORMAL
LOW
检查调度
默认每 20 秒进行一次全量检查。(TabletChecker.CHECK_INTERVAL_MS)
在每次检查前,如果发现 TS 中正在调度或排队的 Tablet 数量超过 5000(TabletChecker.MAX_SCHEDULING_TABLETS),则放弃本轮检查。
每轮检查会将已经修复完成的分区,从优先修复队列中移除。
TabletSchedCtx
TabletSchedCtx 包含了一个 Tablet 在 TS 处理过程中,所有的资源占用和中间结果。当 Tablet 从 TS 中移除时,会调用 releaseResource() 释放所有占用的资源(比如 Slot)。
同时,TSC 好包括一些副本选择和优先级调整的逻辑,我们会在 TabletScheduler 中介绍。
TabletScheduler
TabletScheduler 也是一个后台常驻线程,默认每五秒(SCHEDULE_INTERVAL_MS)运行一次。TS 有两个主要工作,一是处理由 TC 发来的 tablet,尝试修复他们。二是进行集群均衡。
副本修复
由 TC 发来的 Tablet,封装为 TabletSchedCtx,加入到一个优先级队列中(pendingTablets)。TS 每次会从队列中取出最多 10 个(MIN_BATCH_NUM),对每一个 tablet,进行如下操作:
PathSlot
为了保证在执行副本修复或均衡过程中,不会导致某些机器因分配任务太多而被打满,我们为 BE 上的每块盘指定了固定个数的 Slot。对于每块盘(对应一个 BE 上的 root path),用于副本修复的 slot 数量为 1 (Config.schedule_slot_num_per_path),而用于均衡的 slot 数量为 2(BALANCE_SLOT_NUM_FOR_PATH)。
副本恢复和均衡,都是对一个副本,从源端拷贝到目的端的操作。这个操作,会对源端和目的端各占用一个 slot。如果在调度时拿不到 slot,则会抛出 SCHEDULE_FAILED 异常。
各种恢复操作
REPLICA_MISSING:
handleReplicaMissing()
尽量选择一个负载低的 path 作为目的端,同时选择一个可用副本作为源端,启动一个 CloneTask。
VERSION_INCOMPLETE:
handleReplicaMissing()
选择一个版本缺失的副本作为目的端。优先选择:
last success version > last failed version
或更小的 last failed version
或last failed version 大于 0
,同时选择一个可用副本作为源端,启动一个 CloneTask。REDUNDANT:
handleRedundantReplica()
选择一个多余的副本,删除对应的元数据,按以下顺序,直到选择一个:
REPLICA_MISSING_IN_CLUSTER:
handleReplicaClusterMigration()
基本同 REPLICA_MISSING,只是在选择目的端时,需要选择对应的 cluster 的BE。
优先级调整
我们使用一个优先级队列 (pendingTablets) 来存放待调度的 Tablet。当调度失败时,Tablet 会被重新放回优先级队列。如果不调整优先级,则可能导致重复调度一个一直会失败的 tablet,而导致其他 tablet 得不到调度。
TS 每隔 1min,会检查一遍优先级队列中 tablet 的优先级,并按需调整(TabletSchedCtx.adjustPriority())。
超时和任务失败
对于在 pendingTablets 中等待被调度的 tablet,没有超时设置,如果调度不成功,则会一直尝试调度,或者因为 tablet 已经恢复而终止调度。
对于 runningTablets 中等待 Clone 任务完成的 tablet,设有一个任务超时时间。当任务超时后,会移除这个 tablet。任务超时时间根据副本的大小预估(TabletSchedCtx.getApproximateTimeoutMs()),在 3min(MIN_CLONE_TASK_TIMEOUT_MS) 到 120min(MAX_CLONE_TASK_TIMEOUT_MS) 之间,按 5MB/s(MIN_CLONE_SPEED_MB_PER_SECOND) 的速度,预估超时时间。
如果cloneTask失败次数超过3次(TabletSchedCtx.RUNNING_FAILED_COUNTER_THRESHOLD),则该任务会被移除。防止一个一直失败的任务占用资源。
集群均衡
TS 在每轮调度的同时,也会尝试进行集群均衡的操作。均衡操作使用独立的 slot,防止副本恢复操作因集群不均衡而无法进行,而均衡操作又因没有 slot 而无法触发。
ClusterLoadStatistics
ClusterLoadStatistics(CLS)用于表示一个 cluster 中各个 Backend 的负载均衡情况(BackendLoadStatistic)。TS 根据这个统计值,来触发集群均衡。我们当前通过 磁盘使用率 和 副本数量 两个指标,为每个BE计算一个 loadScore,作为 BE 的负载分数。分数越高,表示该 BE 的负载越重。
磁盘使用率和副本数量各有一个权重系数,分别为 capacityCoefficient 和 replicaNumCoefficient,其 和衡为1。其中 capacityCoefficient 会根据实际磁盘使用率动态调整。当一个 BE 的总体磁盘使用率在 50% 以下,则 capacityCoefficient 值为 0.5,如果磁盘使用率在 75%(Config.capacity_used_percent_high_water)以上,则值为 1。如果使用率介于 50% ~ 75% 之间,则改权重系数平滑增加,公式为:
capacityCoefficient= 2 * 磁盘使用率 - 0.5
该权重系数保证当磁盘使用率过高时,该 Backend 的负载分数会更高,以保证尽快均衡这个 BE。
可以通过
show proc "/cluster_balance/cluster_load_stat";
查看整个集群的负载分数。TS 会每隔 1 分钟(STAT_UPDATE_INTERVAL_MS)更新一次 CLS。
均衡策略
我们在 LoadBalancer 类中处理均衡任务。TS 每轮调度,会通过 LoadBalancer 选择一些合适的 Tablet,封装为 BALANCE 类型的 TabletSchedCtx,放入 pendingTablets。
将 BE 根据 cluster 的平均 loadScore,换分为 high,mid,low 三挡。如果高于平均 loadScore 10% (Config.balance_load_score_threshold),则标记为 high,如果低于平均 10%,则标记为 low,否则,标记位 mid。如果所有 BE 都为 mid,则我们认为集群是均衡的。
同样,我们会将每个 BE 上的 path,根据使用容量,也划分为 high,mid,low 三挡。
我们根据以下规则选择 tablet(注意这里只是选择 tablet,而不确定具体的源端或目的端副本,这些操作交由 TS 在调度时完成。)
对于选择出来的 tablet,我们随机打乱后,封装为 BALANCE 类型的 TabletSchedCtx,优先级为 LOW,放入 pendingTablets。
同时我们会检查,处于 low 档的BE 是否都不可用。如果都不可用,则我们也不会触发均衡,因为那会导致均衡任务无法成功调度。
均衡调度
当 TS 调度到 BALANCE 类型的 tablet 时,会进行以下操作:
isMoreBalanced()
检查这次均衡,是否会使集群更均衡(所有 BE 的 loadScore 与平均 loadScore 的差值的和变小)。附录
TabletSchedulerStat
一个简单的统计类,用于统计 scheduler 过程中的各种情况计数。可以通过
show proc "/cluster_balance/sched_stat";
查看,同时,Master FE 日志一会定期打印改统计值的周期变化情况。副本调度与 Alter 流程
Alter 流程会设置 table 的状态,从而阻止 TC 或 TS 处理非 NORMAL 状态 table 的 tablet。这个规则用于简化两个两种交叉导致的各种问题。在实际使用中,建议用户通过 手动优先修复 先恢复集群,再进行 Alter 作业。
持久化
所有和调度相关的信息都不会持久化,包括处于 CLONE 状态的副本。如果 MASTER FE 宕机或者切主,则当前的所有调度任务丢失。