-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory issue with ruby Timeout module #73
Comments
@davidhu2000 thanks for the detailed write up! We do indeed have one usage of The memory being used by libhoney should only be the events that are stored in memory waiting to be sent. Are you able to see some kind of memory leak here or just an increase in memory usage? When you say "significantly less memory used" do you have numbers for this? |
Looks like we have two separate traces (we added more traces to see. the first one looks to be when we send a trace to honeycomb. The 2nd one is just the polling?
and
In terms of the # of calls. The majority of calls is #2, which is around 500 per min per server. Looking at the code, this looks to be the block with the issue. I think the just using the timeout itself might be causing the bloated memory issues? Thread.handle_interrupt(Timeout::Error => :on_blocking) do
while (event = Timeout.timeout(@send_frequency) { @batch_queue.pop })
key = [event.api_host, event.writekey, event.dataset]
batched_events[key] << event
end
end We have also seen use of |
It's not strictly
I think the fact this library is using
The memory continue to increase throughout the lifetime of the application, we definitely think it's a leak. If it's not a leak, we'd expect the memory to plateau.
Unfortunately I don't. I can try to pull some numbers, it'll take a while to get an accurate measurement thought. If i remember correctly, it's about half. |
we did a few sample tests with honeycomb on vs off. Basically, it's a sidekiq worker that rans some task that takes 1s to 15s. Here are the results Our code is basically something like ObjectSpace.trace_object_allocations_start
...perform some tasks (specifically open a file and analyze it)
ObjectSpace.trace_object_allocations_stop
dump = ObjectSpace.dump_all
logger.info "Heap dump at #{dump.path}" Then we use
With honeycomb (in bytes)
Without honeycomb (in bytes)
The only change between the two tests are just whether we have honeycomb installed. As you can see, taking off honeycomb reduces the memory usage of the |
After spending some more quality time with a Beeline'd Rails app, I can confirm the overall assessment here: using Ruby core's The results of running
Line 83 of Ruby 2.5.8's |
libhoney's use of Timeout is a workaround for Ruby's Queue and SizedQueue classes not providing a timeout for popping objects off the queue. Possible SolutionImplement our own PROS:
CONS:
|
@robbkidd wondering if there's a timeline for a fix or if there is a workaround? our sidekiq workers continue to hit our kubernetes memory limits and crash and restart -- which leads to a number of jobs occasionally crashing. not sure if helpful, but we're also seeing is SEGFAULTs reported in our logs when ^ happens. |
@lewisf We've got several things in other integrations ahead of this in our backlog. We don't have an estimate at the moment for when this issue will be addressed. |
@robbkidd i hate to ping again (i understand y'all are busy and appreciate all the work you're doing) but we recently had to remove honeycomb from production because the memory leaks were causing our processes to die fairly consistently (getting OOM killed) leading to other instability issues.
|
@lewisf I just talked it over with some folks. As the current thinking stands, we'd have to change the libhoney guts around, which is (a) invasive enough that workarounds are hard to come by and (b) risky enough that patches are high effort and not likely to be finished soon. 😞 How reliant is your code on the Beeline's specific auto-instrumentation? I ask because I poked around the opentelemetry-ruby code base, and I don't actually spot any usage of the We might use a similar approach in libhoney, but again it's going to take some effort to implement & test. Meanwhile, Honeycomb now accepts OTLP natively, so in theory you could avail of the existing |
A caveat to the opentelemetry-ruby route: OTel Ruby speaks only OTLP-via-HTTP and Honeycomb currently accepts only OTLP-via-GRPC. Using OTel Ruby to send to Honeycomb would require running an OpenTelemetry Collector to receive the OTLP/http and translate traces into OTLP/grpc. @ajvondrak and I stared contemplatively at this code yesterday. We're investigating an experimental queue replacement. |
Thank you for the update! We do use the provided Beeline auto-instrumentation. Will read up on OLTP to see how it works and to get a sense of if we can spin up the necessary effort in the near term. The experimental queue replacement sounds interesting ... definitely keep us posted. |
The experiment (#87) is up! |
The alternative transmission is now available #87 has been merged and we also now support OTLP/HTTP at the Honeycomb API so the OpenTelemetry SDK can be used in place of the Beeline if you'd prefer to use that. Closing this as there are now two alternatives. Please reopen if you think it's still relevant. |
We were debugging a memory problem in one of our applications, the memory usage was going up by around 2GB in 6 hours. To help debug, we used the ObjectSpace to see where the memory is used.
Here is the result of our analysis (for about 2 hours of data collection)
The identified issue is this line in the ruby module
After some research, looks to be happening in other places as well. Example. On more research, seems like ruby Timeout is considered somewhat dangerous (here and here).
I took a quick look around this repo, and didn't see any specific uses of Timeout, but timeout is used for
Net::HTTP
We then monkey-patched the timeout module and logged the caller to figure out where the massive number of timeout calls is coming here.
What we found is we are getting around 185k calls per minute between 4 servers coming from Honeycomb. Here are the first few stack traces from our log.
We turned off Honeycomb and measured memory, and it was significantly less memory used in the timeout module.
I want to open this issue to see what the team thinks of this? And whether there are alternatives that can help this issue?
Versions of stuff we use
The text was updated successfully, but these errors were encountered: