-
Notifications
You must be signed in to change notification settings - Fork 377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Remote Configuration boot tags #3315
Conversation
JRuby fails with:
And of course I can't reproduce it locally. |
So once the queue is popped the main thread is able to rush to
So the whole attempt to make sure things are where we want is gloriously failing. It happens with JRuby because it can run in parallel but I see no reason it could not happen with CRuby as well whenever it decides to schedule threads in a certain way. Moved to use a dumb |
unless Datadog::Core::Remote.active_remote.nil? | ||
barrier = nil | ||
|
||
t = Datadog::Core::Utils::Time.measure do | ||
barrier = Datadog::Core::Remote.active_remote.barrier(:once) | ||
end | ||
|
||
if barrier != :pass && (span = active_span) && !span.has_tag?('_dd.rc.boot.time') | ||
span.set_tag('_dd.rc.boot.time', t) | ||
|
||
if barrier == :timeout | ||
span.set_tag('_dd.rc.boot.timeout', true) | ||
else | ||
# TODO: 'ready' should evolve into ensuring RC received a proper reply | ||
span.set_tag('_dd.rc.boot.ready', true) | ||
end | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to encapsulate/review this logic, so it's compartmentalized and shard with other occurrence on lib/datadog/tracing/contrib/rack/middlewares.rb
?
My motivation for asking this is because it's read like an unrelated unless/end
block in the middle of this call
method. I wouldn't mind it if it was hidden behind an abstraction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. I initially set out to do that, especially given that the tracing side does exactly the same thing except that on the tracing side it's split in two because of the span creation.
I'll try something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I've folded that in a Datadog::Core::Remote::Tie
namespace.
Now getting failures like so:
|
This is now functionally complete, some refactoring is up next. |
4ca58d6
to
127bbdc
Compare
127bbdc
to
b612ba0
Compare
Rebased. Will proceed with the code cleanups. |
bcac361
to
cead686
Compare
RBS has updated with a fix
There's a chance that the value returned here is not consistent with the timeout that may have happened before.
b20f4ca
to
a031262
Compare
Rebased again |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #3315 +/- ##
==========================================
+ Coverage 98.25% 98.26% +0.01%
==========================================
Files 1262 1264 +2
Lines 73606 74113 +507
Branches 3445 3466 +21
==========================================
+ Hits 72319 72828 +509
+ Misses 1287 1285 -2 ☔ View full report in Codecov by Sentry. |
2.0 Upgrade Guide notes
What does this PR do?
Motivation:
RC boot waiting being best-effort, it becomes important to be able to diagnose specific requests behaving in unexpected ways WRT remote configuration settings, e.g some select requests not being blocked by a denylist entry.
By adding tags recording what happened with remote configuration boot on a specific request we can diagnose why a specific request behaved a certain way instead of relying on guesswork.
This is fundamentally different from gathering statistical metrics pushed via telemetry (another planned thing).
Additional Notes:
Since these only concern specific information for boot, these tags are present only for requests that are subject to waiting during the boot process.
Recording the time as a metric for all boot requests also allows us to build knowledge of what a sane default value would be.
A PR to allow configuration of the barrier timeout is coming up next.Timeout is configurable.How to test the change?
This starts as a draft to gather feedback,
unit tests pendingUnit tests are implemented.Manual check:
For Datadog employees:
credentials of any kind, I've requested a review from
@DataDog/security-design-and-guidance
.Unsure? Have a question? Request a review!