-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High memory usage with long HTTP/2 connections (GRPC) #9891
Comments
Thanks for the really detailed report. Can you potentially reach out to myself and @lizan on Envoy Slack to discuss this further and debug a bit? (mklein and lizan on Slack). Thank you. |
Not sure if this will help, but we had an issue with one of our larger, friskier deployments where long lived gRPC bi-di connections were oom'ing Envoy. It specifically happened in cases where the downstreams were sending request data to Envoy faster than upstreams could read it, and vice versa for responses. When we looked at Envoy's memory usage something like 80-90% was just buffers. We had a chat with a member of our gRPC team, @ejona86, who recommended we reduce the initial_stream_window_size and initial_connection_window_size http/2 options to 64KB each and that addressed the issue nicely. That had the effect of sending backpressure to the downstream when a slow upstream cannot keep up and vice versa. It's better to keep that data in the sender rather than try to buffer it up in Envoy. |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions. |
We have encountered similar problems in testing ratelimit, using the option "nvoy.overload_actions.shrink_heap" ? Use the admin: memory interface to check that most memory usage is page_heap_free
|
Title: Envoy consumes a lot of memory with long running GRPC connections.
Description:
In one of our setups Envoy forwards GRPC traffic to cluster containing four endpoints. Downstream clients open long lasting HTTP/2 connections which transfer GRPC traffic. After each restart Envoy, starting from reasonably low memory usage, consumed ever increasing amount of RAM ending in the order of several GB. When connection terminated, memory usage decreased.
As a test we have shortened downstream connections from 3 hours to 15 minutes, what caused memory usage to drop significantly and fluctuate in the order of few hundred MB. In this case memory usage grows until first connections start to expire and only then memory is reclaimed.
In this scenario there are no more than 60
downstream_cx_active
connections.It looks like Envoy buffers (part of?) traffic as long as connection is open.
GRPC traffic contains quick request/response calls with no streaming, retries are disabled.
At the same time other instance of Envoy forwarding comparable traffic using HTTP/1 consumes around 80MB with very small fluctuations and this is how I think our former instance should behave.
Our Envoys work as a load balancers for kubernetes clusters with custom control plane (ADS) and libopentracing plugin.
Memory usage on envoy with profiling enabled:
Repro steps:
I have not managed to reproduce this outside of our cluster. Using simple "hello world" GRPC client/server did not expose this beaviour. I have collected memory profiles from our production instance. I can supply more information if needed (allocation stacks, etc.).
As a sidenote, older Envoy (custom build based on some 1.10 release) behaved the same, what suggests that it has nothing to do with buffer implementation I saw in the log.
Flame graph of one of heap files:
Admin and Stats Output: (with apologies for spamming)
Server info:
Shortened
/stats
output:Clusters:
Config:
Fragments from from
/config_dump
:Logs:
Starting logs:
The text was updated successfully, but these errors were encountered: