-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][Profiling] Prevent panic from crossing FFI boundaries #815
base: main
Are you sure you want to change the base?
Conversation
BenchmarksComparisonBenchmark execution time: 2025-01-06 14:04:28 Comparing candidate commit d8e4924 in PR branch Found 0 performance improvements and 3 performance regressions! Performance is the same for 48 metrics, 2 unstable metrics. scenario:normalization/normalize_service/normalize_service/Data🐨dog🐶 繋がっ⛰てて
scenario:tags/replace_trace_tags
CandidateCandidate benchmark detailsGroup 1
Group 2
Group 3
Group 4
Group 5
Group 6
Group 7
Group 8
Group 9
Group 10
Group 11
Group 12
BaselineOmitted due to size. |
579730c
to
6730452
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #815 +/- ##
==========================================
- Coverage 71.03% 71.03% -0.01%
==========================================
Files 313 313
Lines 45908 45912 +4
==========================================
+ Hits 32612 32614 +2
- Misses 13296 13298 +2
|
6730452
to
d8e4924
Compare
This is a complex subject, but I think we've been avoiding addressing it for a bit too long. The ideal mode of operation would be to have no aborts, but in the real world sometimes one slips by... I think having a "last line of defense" on FFI functions that returns a clean error makes sense. Of course in some cases returning an error means the rust state can be in an inconsistent state, but it seems reasonable to put the burden on the API client to properly handle errors by backing off and stopping e.g. profiling, rather than tearing the whole app down. E.g. an abort is an extreme event, so I think stopping datadog stuff and logging an error (possibly via crashtracking...) is a reasonable outcome to such an extreme event (and is a much better alternative than killing off the host app). We could look into enabling lints to make it easier for us to catch things that may cause abort, but I suspect that will hinder and make really awkward a lot of the code we want to write. The Linux kernel folks have faced similar challenges, so perhaps there's some learnings there (I found this link on a quick googling). TL;DR: If we could write a nice macro to add this feature to most of our FFI, I think it's a reasonable and worth improvement. |
That's why this draft is here :) to talk about that subject. |
What does this PR do?
This PR is a communication channel to discuss about preventing panic from crossing FFI boundaries.
The code change here is to discuss, and find the best way to achieve this.
Motivation
Playing around with the new crashtracker api, we got a crash (#756) because there were missing information. The example was not modified nor ran which left with an unusable version of libdatadog (crashtracking)
This may be a misusage and the client's code must be changed, but in no case, this should panic. (at the minimum, there should be tests.
Additional Notes
Anything else we should know when reviewing?
How to test the change?
Describe here in detail how the change can be validated.