Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDClient and RegionStoreClient are slow down massively in heavy workload. #639

Closed
iosmanthus opened this issue Jul 29, 2022 · 0 comments · Fixed by #638
Closed

PDClient and RegionStoreClient are slow down massively in heavy workload. #639

iosmanthus opened this issue Jul 29, 2022 · 0 comments · Fixed by #638
Labels
type/bug Something isn't working

Comments

@iosmanthus
Copy link
Member

iosmanthus commented Jul 29, 2022

Bug Report

1. Describe the bug

Every request will create a RegionStoreClient by asking the RegionManager about the region info before it is sent to TiKV servers.

public synchronized RegionStoreClient build(
ByteString key, TiStoreType storeType, BackOffer backOffer) throws GrpcException {
Pair<TiRegion, TiStore> pair =
regionManager.getRegionStorePairByKey(key, storeType, backOffer);
return build(pair.first, pair.second, storeType);
}

However, this method is declared synchronized, which might block the entire client every time it launches a request. It's safe to remove the synchronized keyword since the underlying code of RegionManager is already synced.

Another code path that suffers from the synchronized is:

public synchronized void updateLeaderOrForwardFollower(BackOffer backOffer) {

This code is executed when there is a request error or response error while interacting with PD servers. Under heavy retry, this code path might be slowed down by the lock. It is also safe to remove the synchronized keyword since the PD server is not required to update to the latest one, we might use AtomicReference to wrap and update the pdClientWrapper. Even if we get a stale leader in this code path, there is a thread that constantly updates the PD leader every 10 seconds.

2. Minimal reproduce step (Required)

  1. Launch a workload with 32 threads.
  2. Kill all the PD servers and TiKV servers
  3. You will find two types of slow logs that are blocked in the code paths above.

3. What are your Java Client and TiKV versions? (Required)

  • Client Java: master
  • TiKV: any version
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant