Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC initial window is too large that may cause OOM #2673

Closed
overvenus opened this issue Aug 30, 2021 · 5 comments · Fixed by #2699
Closed

gRPC initial window is too large that may cause OOM #2673

overvenus opened this issue Aug 30, 2021 · 5 comments · Fixed by #2699
Labels
area/ticdc Issues or PRs related to TiCDC. component/kv-client TiKV kv log client component. severity/major type/bug The issue is confirmed as a bug.

Comments

@overvenus
Copy link
Member

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do? If possible, provide a recipe for reproducing the error.

Capture 10k tables with ~90G incremental data using single CDC node.

heap profile: profile.pb.gz

gRPC client flamegraph
image image
  1. What did you expect to see?

No OOM.

  1. What did you see instead?

gRPC client consumes too many memory and cause OOM.

  1. Versions of the cluster

v5.1.0

@overvenus overvenus added type/bug The issue is confirmed as a bug. component/kv-client TiKV kv log client component. severity/major labels Aug 30, 2021
@overvenus
Copy link
Member Author

This issue relates to https://github.com/pingcap/ticdc/issues/2553

@amyangfei
Copy link
Contributor

amyangfei commented Aug 30, 2021

Data accumulated in kv client, it may be caused by low throughput of gRPC message processing. Is there any CPU profile dump during the test.

@overvenus
Copy link
Member Author

Data accumulated in kv client, it may be caused by low throughput of gRPC message processing.

Yes, data accumulation is caused by unbalanced producing and consuming speed, in this case, we have a slow sorter (i/o bottleneck), kv client is much faster than the sorter.

Also, after changing initial window size to 64KB and initial conn window size to 8MB, the OOM disappeared.

-	grpcInitialWindowSize     = 1 << 26 // 64 MB The value for initial window size on a stream
-	grpcInitialConnWindowSize = 1 << 27 // 128 MB The value for initial window size on a connection
+	grpcInitialWindowSize     = 65535 // 64 KB The value for initial window size on a stream
+	grpcInitialConnWindowSize = 1 << 23 // 8 MB The value for initial window size on a connection

@overvenus
Copy link
Member Author

Is there any CPU profile dump during the test.

profile.pb.gz

@lonng
Copy link
Contributor

lonng commented Aug 30, 2021

Data accumulated in kv client, it may be caused by low throughput of gRPC message processing.

Yes, data accumulation is caused by unbalanced producing and consuming speed, in this case, we have a slow sorter (i/o bottleneck), kv client is much faster than the sorter.

Also, after changing initial window size to 64KB and initial conn window size to 8MB, the OOM disappeared.

-	grpcInitialWindowSize     = 1 << 26 // 64 MB The value for initial window size on a stream
-	grpcInitialConnWindowSize = 1 << 27 // 128 MB The value for initial window size on a connection
+	grpcInitialWindowSize     = 65535 // 64 KB The value for initial window size on a stream
+	grpcInitialConnWindowSize = 1 << 23 // 8 MB The value for initial window size on a connection

Is there performance issue if we lower those values?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ticdc Issues or PRs related to TiCDC. component/kv-client TiKV kv log client component. severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants