-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROF-9136] Fix rare remote configuration worker thread leak #3519
Conversation
**What does this PR do?** This PR fixes the remote configuration worker thread leak that is discussed in more detail in the internal "dd-trace-rb Remote::Worker thread leak investigation" doc. In a nutshell, there's a potential race between threads that call `Datadog::Core::Remote::Tie.boot` and threads that call `Datadog.configure` that can, in really unlucky situations, mean that the `Remote::Worker thread gets leaked during library reconfiguration. This rare issue is O(number of times Datadog.configure gets called) while requests are being processed in a web-app. I was only able to reproduce it with the weird reproducer below, and by adding `sleep` to some parts of our code, so while we have observed it in production, we expect it to be rare. Being more specific, this issue is related to a race between the remote configuration thread being shut down, and other parts of the code base trying to restart it. In practice, the remote configuration would previously always get started by calls to `Datadog::Core::Remote.active_remote.barrier(:once)`. During dd-trace-rb reconfiguration, it’s possible for the remote configuration thread to be shut down, and before the new configuration kicks in, for any other thread to call `Datadog::Core::Remote.active_remote.barrier(:once)`, restarting the remote configuration thread that belonged to the old configuration. This PR fixes this issue introducing a `@stop_requested` variable that flags a "terminal state" for the remote configuration worker. Thus, the default behavior of `Remote::Worker.start` is changed from always starting the thread to not starting the thread if the same worker was stopped before. **Motivation:** Fix issue. **Additional Notes:** We are also working to remove cases where `Datadog.configure` gets called during request handling, e.g. #3515 . **How to test the change?** I've added unit test coverage for the worker not restarting itself again. To validate that this fixes the worker thread leak, the following reproducer can be used: ```diff diff --git a/lib/datadog/core/configuration.rb b/lib/datadog/core/configuration.rb index 7a20444784..fe0ca8ba0f 100644 --- a/lib/datadog/core/configuration.rb +++ b/lib/datadog/core/configuration.rb @@ -254,6 +254,12 @@ module Datadog components = Components.new(settings) old.shutdown!(components) + puts "After shutdown, sleeping!" + + sleep 1 + + puts "Woke up after shutdown!" + components.startup!(settings) components end diff --git a/remote-worker-thread-leak.rb b/remote-worker-thread-leak.rb new file mode 100644 index 0000000000..3552c654b5 --- /dev/null +++ b/remote-worker-thread-leak.rb @@ -0,0 +1,22 @@ +require 'ddtrace' + +configure_thread = Thread.new do + 15.times { + Datadog.configure { Thread.pass } + Thread.pass + } +end + +trigger_rc_thread = Thread.new do + loop { + sleep 0.5 + Datadog::Core::Remote.active_remote.barrier(:once) + Thread.pass + } +end + +configure_thread.join +trigger_rc_thread.kill +trigger_rc_thread.join + +puts Thread.list ``` Before the fix, the list of threads printed would show a lot of leaked remote worker configuration threads. With the fix, the threads no longer leak, and if you enable debug logging you will see that the restart is being prevented and the debug log I added will show up ``` D, [2024-03-12T16:27:02.951950 #742185] DEBUG -- ddtrace: [ddtrace] (dd-trace-rb/lib/datadog/core/remote/worker.rb:28:in `start') remote worker: refusing to restart after previous stop ``` This debug message is harmless and shows the fix is working fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tackling that one!
I seem to remember we talked about protecting concurrent access with a lock inside Datadog::Some::Thing.active_remote
. @ivoanjo do you feel this PR is instead enough to solve the issue?
lib/datadog/core/remote/worker.rb
Outdated
|
||
@interval = interval | ||
raise ArgumentError, 'can not initialize a worker without a block' unless block | ||
|
||
@block = block | ||
end | ||
|
||
def start | ||
def start(allow_restart: false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a situation where we want to allow restart?
My memory is a bit fuzzy but it seemed to me that once a remote worker has started in practice its only subsequent states are "stopping" and "stopped", but maybe I missed something?
(IOW is state 5. disallowed and maybe should be guarded against?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in 760e084
lib/datadog/core/remote/worker.rb
Outdated
@started = false | ||
@stop_requested = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not fond of what the naming implies here ^^', I feel it introduces incidental complexity.
Here's my train of thought.
This sequence diagram shows how I view the state machine†:
{ ¬started, ¬starting, ¬stopping } # 1.
|
|
#start
|
V
{ ¬started, starting, ¬stopping } # 2.
|
|
success
|
V
{ started, ¬starting, ¬stopping } # 3.
|
|
#stop
|
V
{ started, ¬starting, stopping } # 4.
|
|
success
|
V
{ ¬started, ¬starting, ¬stopping } # == 1.
In that context, stop_requested
captures a transition (IOW #stop
has been called) instead of representing a state. In effect this leads to a representational issue: @stop_requested
will be forever true
even when a restartable worker has been restarted and no #stop
has been issued to that second #start
state.
AIUI what stop_requested
intends to introduce is:
- a distinction between the first and last state of the diagram, turning the last one into a 5. then making 5. equivalent to 1. by a conditional on
@allow_restart
- a distinction between 3. and 4. which became identical due to the removal of
@stopping
.
As a consequence I feel introducing a stopped
state would better represent all possible states:
{ ¬started, ¬starting, ¬stopping, ¬stopped } # 1.
|
|
#start
|
V
{ ¬started, starting, ¬stopping, ¬stopped } # 2.
|
|
success
|
V
{ started, ¬starting, ¬stopping, ¬stopped } # 3.
|
|
#stop
|
V
{ started, ¬starting, stopping, ¬stopped } # 4.
|
|
success
|
V
{ ¬started, ¬starting, ¬stopping, stopped } # 5.
|
|
#start
|
V
{ ¬started, starting, ¬stopping, stopped } # 6.
|
|
success
|
V
{ started, ¬starting, ¬stopping, ¬stopped } # == 3.
It then follows that:
@allow_restart
is only effective whenstopped
(and not effective when¬stopped
)- the current implementation of
stop_requested
is equivalent tostopping OR stopped
- ... except when the worker is
started
; in that cas I feel a stop count (or maybe better, a start count) could be useful
- ... except when the worker is
- the ambiguity about
stop_requested
on a restartable worker is resolved and fully captured bystopping
I also feel it makes it easier to implement helper methods from these fundamental state primitives (e.g synthesising #stop_requested
if needed)
† success
is actually a diamond with an error
branch, elided so as not to overload the sequence diagram, and also because it happens via exceptional code flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed @stop_requested
to @stopped
in 22f3330
I initially added this thinking it may make it easier for testing and/or experimentation, but after feedback from review, let's go without it for now -- we can always re-add later.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #3519 +/- ##
==========================================
- Coverage 98.24% 98.23% -0.01%
==========================================
Files 1275 1275
Lines 75273 75281 +8
Branches 3554 3556 +2
==========================================
+ Hits 73949 73956 +7
- Misses 1324 1325 +1 ☔ View full report in Codecov by Sentry. |
What does this PR do?
This PR fixes the remote configuration worker thread leak that is discussed in more detail in the internal
"dd-trace-rb Remote::Worker thread leak investigation" doc.
In a nutshell, there's a potential race between threads that call
Datadog::Core::Remote::Tie.boot
and threads that callDatadog.configure
that can, in really unlucky situations, mean that the `Remote::Worker thread gets leaked during library reconfiguration.This rare issue is O(number of times Datadog.configure gets called) while requests are being processed in a web-app.
I was only able to reproduce it with the weird reproducer below, and by adding
sleep
to some parts of our code, so while we have observed it in production, we expect it to be rare.Being more specific, this issue is related to a race between the remote configuration thread being shut down, and other parts of the code base trying to restart it.
In practice, the remote configuration would previously always get started by calls to
Datadog::Core::Remote.active_remote.barrier(:once)
. During dd-trace-rb reconfiguration, it’s possible for the remote configuration thread to be shut down, and before the new configuration kicks in, for any other thread to callDatadog::Core::Remote.active_remote.barrier(:once)
, restarting the remote configuration thread that belonged to the old configuration.This PR fixes this issue introducing a
@stop_requested
variable that flags a "terminal state" for the remote configuration worker.Thus, the default behavior of
Remote::Worker.start
is changed from always starting the thread to not starting the thread if the same worker was stopped before.Motivation:
Fix issue.
Additional Notes:
We are also working to remove cases where
Datadog.configure
gets called during request handling, e.g. #3515 .How to test the change?
I've added unit test coverage for the worker not restarting itself again.
To validate that this fixes the worker thread leak, the following reproducer can be used:
Before the fix, the list of threads printed would show a lot of leaked remote worker configuration threads. With the fix, the threads no longer leak, and if you enable debug logging you will see that the restart is being prevented and the debug log I added will show up
This debug message is harmless and shows the fix is working fine.
For Datadog employees:
credentials of any kind, I've requested a review from
@DataDog/security-design-and-guidance
.