Skip to content
Ryan Leung edited this page Jul 14, 2020 · 4 revisions

Scheduling

Q: Why the count of leaders/Regions is not evenly distributed?

A: The rating mechanism of PD determines that leader count and Region count of different stores cannot fully reflect the load balancing status. Therefore, it is necessary to confirm whether there is load imbalancing from the actual load of TiKV or storage usage.

Once you have confirmed that leaders/Regions are not evenly distributed, you need to check the rating of different stores.

If the scores of different stores are close, it means PD mistakenly believes that leaders/Regions are evenly distributed. Possible reasons are:

  • There are hot Regions that cause load imbalancing. In this case, you need to analyze further based on hot Regions scheduling.
  • There are a large number of empty Regions or small Regions, which leads to a great difference in the number of leaders in different stores and high pressure on the Raft store. This is the time for a Region merge scheduling.
  • Hardware and software environment varies among stores. You can adjust the values of leader-weight and region-weight accordingly to control the distribution of leader/Region.
  • Other unknown reasons. you can still adjust the values of leader-weight and region-weight to control the distribution of leader/Region.

If there is a big difference in the rating of different stores, you need to examine the operator-related metrics, with special focus on the generation and execution of operators. There are two main situations:

  • When operators are generated normally but the scheduling process is slow, it is possible that:

    • The scheduling speed is limited by default for the load balancing purpose. You can adjust leader-schedule-limit or region-schedule-limit to larger values without significantly impacting regular services. In addition, you can also properly ease the restrictions specified by max-pending-peer-count and max-snapshot-count.
    • Other scheduling tasks are running concurrently, which slows down the balancing. In this case, if the balancing takes precedence over other scheduling tasks, you can stop other tasks or limit their speeds. For example, if you take some nodes offline when balancing is in progress, both operations consume the quota of region-schedule-limit. In this case, you can limit the speed of scheduler to remove nodes, or simply set disable-replace-offline-replica = true to temporarily disable it.
    • The scheduling process is too slow. You can check the Operator step duration metric to confirm the cause. Generally, steps that do not involve sending and receiving snapshots (such as TransferLeader, RemovePeer, PromoteLearner) should be completed in milliseconds, while steps that involve snapshots (such as AddLearner and AddPeer) are expected to be completed in tens of seconds. If the duration is obviously too long, it could be caused by high pressure on TiKV or bottleneck in network, etc., which needs specific analysis.
  • PD fails to generate the corresponding balancing scheduler. Possible reasons include:

    • The scheduler is not activated. For example, the corresponding scheduler is deleted, or its limit is set to "0".
    • Other constraints. For example, evict-leader-scheduler in the system prevents leaders from being migrating to the corresponding store. Or label property is set, which makes some stores reject leaders.
    • Restrictions from the cluster topology. For example, in a cluster of 3 replicas across 3 data centers, 3 replicas of each Region are distributed in different data centers due to replica isolation. If the number of stores is different among these data centers, the scheduling can only reach a balanced state within each data center, but not balanced globally.

Q: Why Taking nodes offline is slow?

A: This scenario requires examining the generation and execution of operators through related metrics.

If operators are successfully generated but the scheduling process is slow, possible reasons are:

  • The scheduling speed is limited by default. You can adjust leader-schedule-limit or replica-schedule-limit to a larger value. Similarly, you can consider loosening the limits on max-pending-peer-count and max-snapshot-count.
  • Other scheduling tasks are running concurrently and racing for resources in the system. You can refer to the solution in Why the count of leaders/Regions is not evenly distributed?
  • When you take a single node offline, a number of Region leaders to be processed (around 1/3 under the configuration of 3 replicas) are distributed on the node to remove. Therefore, the speed is limited by the speed at which snapshots are generated by this single node. You can speed it up by manually adding an evict-leader-scheduler to migrate leaders.

If the corresponding operator fails to generate, possible reasons are:

  • The operator is stopped, or replica-schedule-limit is set to "0".
  • There is no proper node for Region migration. For example, if the available capacity size of the replacing node (of the same label) is less than 20%, PD will stop scheduling to avoid running out of storage on that node. In such a case, you need to add more nodes or delete some data to free the space.

Q: Why Bringing nodes online is slow?

A: Currently, bringing nodes online is scheduled through the balance Region mechanism. You can refer to Why the count of leaders/Regions is not evenly distributed? for troubleshooting.

Q: Hot Regions are not evenly distributed?

A: Hot Regions scheduling issues generally fall into the following categories:

  • Hot Regions can be observed via PD metrics, but the scheduling speed cannot keep up to redistribute hot Regions in time.

    Solution: adjust hot-region-schedule-limit to a larger value, and reduce the limit quota of other schedulers to speed up hot Regions scheduling. Or you can adjust hot-region-cache-hits-threshold to a smaller value to make PD more sensitive to traffic changes.

  • Hotspot formed on a single Region. For example, a small table is intensively scanned by a massive amount of requests. This can also be detected from PD metrics. Because you cannot actually distribute a single hotspot, you need to manually add a split-region operator to split such a Region.

  • The load of some nodes is significantly higher than that of other nodes from TiKV-related metrics, which becomes the bottleneck of the whole system. Currently, PD counts hotspots through traffic analysis only, so it is possible that PD fails to identify hotspots in certain scenarios. For example, when there are intensive point lookup requests for some Regions, it might not be obvious to detect in traffic, but the high QPS might lead to bottlenecks in key modules.

    Solutions: Firstly, locate the table where hot Regions are formed based on the specific business. Then add a scatter-range-scheduler scheduler to make all Regions of this table evenly distributed. TiDB also provides an interface in its HTTP API to simplify this operation. Refer to TiDB HTTP API for more details.

Q: Why Region merge is slow?

A: Similar to slow scheduling, the speed of Region merge is most likely limited by the configurations of merge-schedule-limit and region-schedule-limit, or the Region merge scheduler is competing with other schedulers. Specifically, the solutions are:

  • If it is known from metrics that there are a large number of empty Regions in the system, you can adjust max-merge-region-size and max-merge-region-keys to smaller values to speed up the merge. This is because the merge process involves replica migration, so the smaller the region to be merged, the faster the merge is. If the merge operators are already generated rapidly, to further speed up the process, you can set patrol-region-interval to 10ms. This makes Region scanning faster at the cost of more CPU consumption.

  • A lot of tables have been created and then emptied (including truncated tables). These empty Regions cannot be merged if the split table attribute is enabled. You can disable this attribute by adjusting the following parameters:

    • TiKV: Set split-region-on-table to false. You cannot modify the parameter dynamically.

    • PD

      • Set key-type to "txn" or "raw". You can modify the parameter dynamically.
      • Keep key-type as table and set enable-cross-table-merge to true. You can modify the parameter dynamically.

      Note:

      After placement rules are enabled, properly switch the value of key-type between txn and raw to avoid the failure of decoding.

For v3.0.4 and v2.1.16 or earlier, the approximate_keys of Regions are inaccurate in specific circumstances (most of which occur after dropping tables), which makes the number of keys break the constraints of max-merge-region-keys. To avoid this problem, you can adjust max-merge-region-keys to a larger value.