Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak on Interceptor #1078

Open
Leonardo-Ferreira opened this issue Jun 28, 2024 · 17 comments
Open

Memory leak on Interceptor #1078

Leonardo-Ferreira opened this issue Jun 28, 2024 · 17 comments
Labels
bug Something isn't working

Comments

@Leonardo-Ferreira
Copy link

Report

I think we have a memory leak.
Here is the memory usage over 2 days:
image
The memory drops is the pod getting OOM killed. At first we thought 64mb is too little, so we decided to experiment higher values in the last 24hrs, but the symptom persisted:
image

Expected Behavior

the pods should release memory and as traffic fluctuates down and not get OOM killed

Actual Behavior

Memory usage never decreases

Steps to Reproduce the Problem

  1. Install version 0.8
  2. Put a lot of load on apis controlled by this guy

Logs from KEDA HTTP operator

example

HTTP Add-on Version

0.8.0

Kubernetes Version

< 1.28

Platform

Microsoft Azure

Anything else?

AKS v 1.27.9

@Leonardo-Ferreira Leonardo-Ferreira added the bug Something isn't working label Jun 28, 2024
@Leonardo-Ferreira Leonardo-Ferreira changed the title Memory leak Memory leak on Interceptor Jun 29, 2024
@tomkerkhove
Copy link
Member

@JorTurFer / @wozniakjan do you have bandwidth to check this one?

@wozniakjan
Copy link
Member

wozniakjan commented Jul 1, 2024

thank you for reporting this, from the graphs this indeed looks like a memory leak, I can investigate later today.

@wozniakjan
Copy link
Member

after a quick investigation, this might actually be related to HTTP connections overhead, GC and request caching rather than an actual memory leak. I am able to reproduce quick memory growth with concurrent requests. When idle on a tiny cluster with a single HTTPScaledObject, the interceptor uses around 21Mi

$ kubectl top pods -nkeda --selector app.kubernetes.io/component=interceptor
NAME                                            CPU(cores)   MEMORY(bytes)
keda-add-ons-http-interceptor-6f6d78bd9-rkn2r   4m           21Mi

with requests sent sequentially, the memory usage barely increases and then goes down to idle levels. With simple benchmark using wrk with 100 concurrent threads and 100 connections

$ wrk -t100 -c100 -d30s -H "Host: myhost2.com" http://172.19.0.2

the memory usage does increase, peaking at 54Mi:

$ kubectl top pods -nkeda --selector app.kubernetes.io/component=interceptor
NAME                                            CPU(cores)   MEMORY(bytes)
keda-add-ons-http-interceptor-6f6d78bd9-rkn2r   494m         54Mi

but after wrk benchmark terminates, the CPU usage drops down quickly and while the memory usage decreases rather slowly, both eventually reach similar levels to the initial idle. With 100 threads and 500 connections, interceptor OOMKills pretty fast with a default limit of 64Mi, after bumping the memory limit to 1Gi, the memory again seems well flexible - growing as well as shrinking with benchmark executions.

Attached is also trace output from net/http/pprof during the execution of the wrk benchmark. The memory growth is apparent but so is also the memory decrease. There doesn't appear to be any place in code where the memory just grows. I will perform some longer running benchmarks with profiling traces to try to spot something more sneaky.

trace-wrk.txt

@wozniakjan
Copy link
Member

wozniakjan commented Jul 1, 2024

I may have noticed something, with higher error rate in the benchmark, the memory consumption doesn't go down. I just bumped the memory limit for interceptor to 1Gi and executed benchmark that may have been too much, 5000 connections in 200 threads for 10 minutes

$ wrk -t200 -c5000 -d10m -H "Host: myhost2.com" http://172.19.0.2
Running 10m test @ http://172.19.0.2
  200 threads and 5000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   688.19ms  506.07ms   2.00s    69.41%
    Req/Sec     6.19      5.99   101.00     83.53%
  351661 requests in 10.00m, 277.64MB read
  Socket errors: connect 0, read 0, write 0, timeout 323052
  Non-2xx or 3xx responses: 16769
Requests/sec:    586.00
Transfer/sec:    473.75KB

out of ~350000 requests, 91% timeouts and 4.7% errors. The memory usage is still at 531Mi which is close to its ~600Mi peak during the benchmark even though it's been idle for 10 minutes, and it doesn't seem to go down.

@Leonardo-Ferreira
Copy link
Author

@wozniakjan my current number is around 15k requests every minute. I currently have 3 interceptor pods, at 640 MB each... in about one hour and half that number is expected to peak at 25k... during the late night, between 2am-5am, the number drops to 5k...

@wozniakjan
Copy link
Member

@Leonardo-Ferreira I played around with benchmarks a bit more and afaict interceptor seems to be very CPU intensive, for wrk simulated load that would match the request rate of 25k/min I am able to saturate 12 CPU cores easily. With this amount of CPU, it's able to maintain a relatively low error rate 0.42%.

With a low error rate, the memory usage appears to go down to idle levels eventually, but when the error rate is high, even when later the interceptor is off traffic, memory usage doesn't seem to go down.

Go's pprof heap analysis doesn't point to any code that would accumulate memory and not release it, but also heap as counted by pprof appears to be just a fraction of memory used compared to cgroup accounting. The pprof recorded 44MB while cgroup show 318MB used.

@Leonardo-Ferreira
Copy link
Author

@wozniakjan after looking deeply at my data, I could not correlate the memory usage/increase-rate to the number of errors. Luckily (or not) we had an issue this morning where a dependency was flaky so there was a significant spike in errors related to timeouts. on the memory consumption graph there was no significant change during the 1 hour long event...

@Leonardo-Ferreira
Copy link
Author

Leonardo-Ferreira commented Jul 1, 2024

simulated load that would match the request rate of 25k/min I am able to saturate 12 CPU cores easily

well, that's weird... our pods have a limit of 0.7
image
this is the last 24hrs looking at the 95th percentile of cpu usage

@wozniakjan
Copy link
Member

wozniakjan commented Jul 1, 2024

well, that's weird... our pods have a limit of 0.7

good to hear that, perhaps my artificial benchmark setup just ends up being CPU-intensive.

my current hypothesis I am trying to either prove or disprove is that maybe when the interceptor itself fails to process the requests and it has a higher error rate, maybe then it leaves some resources dangling, e.g. open connections or entries in routing tables caches..

@Leonardo-Ferreira
Copy link
Author

would that output a specific log message that I could query for?

@wozniakjan
Copy link
Member

wozniakjan commented Jul 1, 2024

I can see plenty of these access logs in the interceptor

2024-07-01T14:26:38Z    INFO    LoggingMiddleware   10.42.0.8:35634 - - [01/Jul/2024:14:26:36 +0000] "GET / HTTP/1.1" 502 11 "" ""

despite the fact the scaled app returns always OK 200.

Also, I can observe these errors in the interceptor logs

2024-07-01T14:27:13Z    ERROR   LoggingMiddleware.RoutingMiddleware.CountingMiddleware.UpstreamHandler.StaticHandler    Bad Gateway {"routingKey": "//myhost2.com/", "namespacedName": {"name":"demo","namespace":"keda"}, "stream": "http://demo.keda:8080/", "error": "context canceled"}
github.com/kedacore/http-add-on/interceptor/handler.(*Static).ServeHTTP
    github.com/kedacore/http-add-on/interceptor/handler/static.go:36
github.com/kedacore/http-add-on/interceptor/handler.(*Upstream).ServeHTTP.func2
    github.com/kedacore/http-add-on/interceptor/handler/upstream.go:54
net/http/httputil.(*ReverseProxy).ServeHTTP
    net/http/httputil/reverseproxy.go:472
github.com/kedacore/http-add-on/interceptor/handler.(*Upstream).ServeHTTP
    github.com/kedacore/http-add-on/interceptor/handler/upstream.go:57
main.runProxyServer.newForwardingHandler.func2
    github.com/kedacore/http-add-on/interceptor/proxy_handlers.go:87
net/http.HandlerFunc.ServeHTTP
    net/http/server.go:2166
github.com/kedacore/http-add-on/interceptor/middleware.(*Counting).ServeHTTP
    github.com/kedacore/http-add-on/interceptor/middleware/counting.go:39
github.com/kedacore/http-add-on/interceptor/middleware.(*Routing).ServeHTTP
    github.com/kedacore/http-add-on/interceptor/middleware/routing.go:93
github.com/kedacore/http-add-on/interceptor/middleware.(*Logging).ServeHTTP
    github.com/kedacore/http-add-on/interceptor/middleware/logging.go:42
github.com/kedacore/http-add-on/interceptor/middleware.(*Metrics).ServeHTTP
    github.com/kedacore/http-add-on/interceptor/middleware/metrics.go:24
net/http.serverHandler.ServeHTTP
    net/http/server.go:3137
net/http.(*conn).serve
    net/http/server.go:2039

@Leonardo-Ferreira
Copy link
Author

should we log some extra info and release a 0.8.1 version? unfortunately my security dpto does not allow me to use "unofficial" images in production

@wozniakjan
Copy link
Member

I'm still in the process of trying to figure out the root cause but as soon as I have some valid hypothesis to test, I can distribute a build for your testing

@Leonardo-Ferreira
Copy link
Author

any news @wozniakjan, can I help somehow?

@wozniakjan
Copy link
Member

I didn't get very far and now I had to put it on hold temporarily, but I will get back to this soon

@Leonardo-Ferreira
Copy link
Author

hey @wozniakjan, I'd like to contribute here. would you be willing to connect for like 30-45min so you can "boost" me? this way I could make significant contributions faster

@wozniakjan
Copy link
Member

sure, you can ping me on KEDA slack https://keda.sh/community/
https://kubernetes.slack.com/team/UF5R8HHB7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: To Triage
Development

No branches or pull requests

3 participants