-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROF-8967] Reduce memory footprint and allocations for profiling timeline data #293
Conversation
**What does this PR do?** This PR is a proof-of-concept for reducing: * the number of allocations needed to store profiling timeline data * the memory space used to store profiling timeline data It does so by taking advantage of the following observation: the timeline data is write-only until it's time to serialize. Thus, instead of keeping a vec of `TrimmedTimestampedObservation` (which internally keeps another vec), I've decided to just stream all the bytes to an lz4 compressor, and to make sure to size its output buffer so as to avoid growing in very small steps. At serialization time, we do the reverse, sample by sample, and then immediately write such samples as pprof, reducing the memory footprint needed even during serialization. The objective of this PR is to start a discussion on this approach; the code itself is an awful hack and it's probably better that we reimplement this idea from scratch if we do like it. Note that while I cheated on some of the libdatadog tests (because hey used internal details that changed), this branch passes the full Ruby profiler test suite and produces -- as far as I could tell -- valid profiles. **Motivation:** One of our Ruby on Rails test apps was experiencing heavy memory bloat when timeline profiling was enabled. I'm still doing an (internal) writeup with more details, but TL;DR it's a combination of many threads (100), and glibc creating a memory arena for each thread, and then libdatadog's use of malloc combined with Ruby's use of malloc would cause every one of these arenas to grow a lot. Here's some quick numbers (more details on the internal writeup) for a reproducer app I wrote that from my experiments seems to mimic the behavior of the full Ruby app close enough: | Configuration | RSS usage | Note | |--------------------------:|----------:|------------------------------------------------------------------------------------------------------------| | Baseline (no profiling) | 215 552k | Baseline here includes the profiler being instanced, but never sampling. So it's higher than in real apps. | | Libdatadog 5, no timeline | 214 396k | | | Libdatadog 5, timeline | 298 640k | Does not stabilize after simulating 15 mins of profiling (e.g. RSS still growing) | | Prototype, timeline | 223 468k | RSS oscilates between 219 and 223 MiB | **Additional Notes:** N/A **How to test the change?** I'll provide instructions to reproduce my experiments soon.
} else { | ||
// // Create buffer, pre-touch all elements to avoid any laziness | ||
let mut timestamped_data_buffer: Vec<u8> = vec![0; 1_048_576]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is 1 MiB, aka 1024 KiB. Your tests are showing RSS usage smaller than this. I don't see this how this is possible. What am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The values above are in kilobytes; e.g. taking the prototype 223 468k => 223MiB.
I actually did make some experiments to try to account for all the memory, sharing here as well:
Prototype, no timeline: ~218 MiB (e.g. since we pre-initialize the space anyway, and the Ruby profiler keeps two profiles active at all time)
Prototype, timeline, no serialization: ~219 MiB (e.g. the extra space seen above this is memory needed during serialization, since timeline profiles, even compressed, are bigger)
profiling/src/internal/profile.rs
Outdated
let labels_id_raw = decompressed_input.read_u32::<NativeEndian>().unwrap(); | ||
let timestamp_raw = decompressed_input.read_i64::<NativeEndian>().unwrap(); | ||
|
||
let mut values: Vec<i64> = Vec::with_capacity(self.sample_types.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could declare this outside the loop and do a .clear()
at the top of the loop so you only ever allocate one vec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I think that ideally this would all move inside the iterator, so we don't need to have two copies of the code for going through each sample.
I tried doing that but found it too hard with my current Rust-foo level >_>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved this to be inside the TimestampedObservationsIter
. A Vec
still gets created every time, but with limited lifetime (just each iteration). Suggestions welcome on improvements :)
inner.obs_len.assert_eq(values.len()); | ||
pub fn init_timeline(&mut self, values_size: usize) { | ||
if let Some(_inner) = &self.inner { | ||
panic!("Should never happen!"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably better to return an error rather than panic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course! I'm guessing we should probably turn this into an Observations::new(...)
or something like that? (This was just another ugly hack as I wanted to focus on the prototype and didn't want to take the detour)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleaned all this up in the latest version!
|
||
struct NonEmptyObservations { | ||
aggregated_data: HashMap<Sample, TrimmedObservation>, | ||
timestamped_data: Vec<TrimmedTimestampedObservation>, | ||
compressed_timestamped_data: FrameEncoder<Vec<u8>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much do we save by compressing vs just using an array of values, and indicies into there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't actually need to index into the array (which is why compressing it works).
On the "how much do we save", here's some back-of-the-napkin numbers:
-
Size of each observation: 4 bytes stacktrace, 4 bytes labels, 8 bytes timestamp, 8 bytes * N profile types
-
100 threads * 100 samples per second * 60 seconds * (4 + 4 + 8 + 8 * 4 profile types enabled for Ruby by default) = 28 800 000 bytes
-
For... reasons... the test app ends up recording data for 103 threads so I get
- Uncompressed 29 MiB
- Compressed 4 MiB
Which seems like a nice improvement as well. This data is highly compressible, (lots of small numbers, zeros, numbers next to each other, ...) so we could make a smarter uncompressed representation, but I think the compressor takes very well care of that for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried compressing in a Struct of Arrays format, and compress each array of fields of the TrimmedTimestampedObservation
struct independently?
I'd think that this would give you a better compression ratio since the data would be more homogenous
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, you might get a better compression ratio using the Linked block mode since you're doing a lot of small write and thus the block size will default to the minimum (64KB)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think those are very valid options; having said that, I'm not convinced they are what we would want here:
Have you tried compressing in a Struct of Arrays format, and compress each array of fields of the TrimmedTimestampedObservation struct independently?
I'd think that this would give you a better compression ratio since the data would be more homogenous
If I'm understanding your suggestion, doing this would mean having multiple FrameEncoder
s, with multiple underlying buffers:
struct NonEmptyObservations {
stack_ids: FrameEncoder<Vec<u8>>,
label_ids: FrameEncoder<Vec<u8>>,
timestamps: FrameEncoder<Vec<u8>>,
values: FrameEncoder<Vec<u8>>,
// ...
This would mean more allocations, especially when growing the backing vecs -- there's no longer one single vec that gets doubled, but multiple small ones. 🤔
I suspect (without having tried it, to be fair xD) this would cause more heap fragmentation.
Also, you might get a better compression ratio using the Linked block mode since you're doing a lot of small write and thus the block size will default to the minimum (64KB)
I'm assuming that if the encoder is referencing previous blocks, it means that it's doing more work (rather than e.g. looking only at the current block).
Depending on the win, If it's a few % I'm not sure it's worth doing more work on the profiling sample write path vs the memory savings, since anyway it's not a lot of data.
} else { | ||
// // Create buffer, pre-touch all elements to avoid any laziness | ||
let mut timestamped_data_buffer: Vec<u8> = vec![0; 1_048_576]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this a configurable constant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind a build-time constant? Or something that can be configured dynamically by the library?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
buildtime is fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 6157a90
|
||
// SAFETY: we just ensured it has an item above. | ||
let observations = unsafe { self.inner.as_mut().unwrap_unchecked() }; | ||
let obs_len = observations.obs_len; | ||
|
||
if let Some(ts) = timestamp { | ||
let trimmed = TrimmedObservation::new(values, obs_len); | ||
observations.timestamped_data.push((sample, ts, trimmed)); | ||
observations.compressed_timestamped_data.write_all(&(Id::into_raw_id(sample.stacktrace) as u32).to_ne_bytes()).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ergonomics: you could put the data into a repr(packed)
struct, convert that to bytes, and then read that in and out. My bias is that this would be a touch cleaner, but YMMV
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good -- Rust was just fighting me too much that I ended up with whatever the simplest thing I could get going :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up not doing this -- I couldn't find a nice way of converting the struct to bytes that involve pulling in a serializer or basically writing the same field-by-field code, but elsewhere in the code.
I wasn't convinced it was worth the extra indirection, suggestions welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.rs/bytemuck/1.14.1/bytemuck/fn.bytes_of.html does it. Your call if this is more or less elegant than doing it field by field
let bytes: &[u8] = bytemuck::bytes_of(&my_struct)
and then you'd convert back with https://docs.rs/bytemuck/1.14.1/bytemuck/fn.try_from_bytes.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave a stab at bytemuck
for 5-10mins, but I wasn't quite understanding how to integrate it, the documentation is not great (and I'm being nice...), so I'll leave it as-is for now.
What is the CPU / latency impact of doing lz4 writes on every sample operation? |
I've tested this with the `arenas-repro-v0.rb` reproducer and the results are the same.
I've created a Ruby benchmark for this:
Results are in iterations per second. Each iteration of the benchmark simulates 1 minute of sampling (6000 samples * number of active threads). At the end of each iteration, the profile gets reset (values don't include serialization). TL;DR It seems to be close. (Bonus note: The benchmark looks like it's serializing, but I manually slipped in a Profile_reset call in the C code to empty out the profile, as we don't quite expose these APIs to Ruby usually) |
|
||
struct NonEmptyObservations { | ||
aggregated_data: HashMap<Sample, TrimmedObservation>, | ||
timestamped_data: Vec<TrimmedTimestampedObservation>, | ||
compressed_timestamped_data: FrameEncoder<Vec<u8>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried compressing in a Struct of Arrays format, and compress each array of fields of the TrimmedTimestampedObservation
struct independently?
I'd think that this would give you a better compression ratio since the data would be more homogenous
|
||
struct NonEmptyObservations { | ||
aggregated_data: HashMap<Sample, TrimmedObservation>, | ||
timestamped_data: Vec<TrimmedTimestampedObservation>, | ||
compressed_timestamped_data: FrameEncoder<Vec<u8>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, you might get a better compression ratio using the Linked block mode since you're doing a lot of small write and thus the block size will default to the minimum (64KB)
This struct abstracts away the underlying compressed storage, and exposes an iterator that can be used to provide the existing behavior used by the pprof serialization. Also tweak and restore the tests.
I've pushed a new revision that displays similar performance in my reproducer apps, but with a much nicer code structure. Let me know your thoughts! P.s.: I know there's a trivial merge conflict, but on purpose I've started this branch from v5.0.0 to have a full apples-to-apples comparison. Once we're happy with the PR, I'll do a quick merge from master and resolve any conflicts. |
The `malloc_stats()` API has been useful to debug issues with memory fragmentation (see <DataDog/libdatadog#293>) so let's keep a helper to make it easy to access.
**What does this PR do?** This PR enables the timeline feature for the Continuous Profiler by default. It introduces a new setting, `timeline_enabled`, which can also be controlled by setting the `DD_PROFILING_TIMELINE_ENABLED` environment variable. It also deprecates the old `experimental_timeline_enabled` + it's environment variable counterpart. **Motivation:** With the latest round of improvements to libdatadog, we're confident about enabling this feature by default. **Additional Notes:** I'm opening this PR as a draft as it still depends on DataDog/libdatadog#293 being released + a few more rounds of internal validation. **How to test the change?** Try to profile any application, validate that you get timeline data without further configuration.
Now these functions return actual errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, after a few ⛏️
I have yet to do a proper review, but a "well done" is due: DataDog/dd-trace-php#2513 (comment)
|
The tests were actually passing for the wrong reasons, since the tests had not been changed to use `Observations::new`. They are now passing for the right reasons (I hope).
Thanks everyone for the help! I'm going to go ahead and merge this one. The miri failure in CI is unrelated, and I'll prepare a follow-up PR to fix it. |
**What does this PR do?** This PR enables the timeline feature for the Continuous Profiler by default. It introduces a new setting, `timeline_enabled`, which can also be controlled by setting the `DD_PROFILING_TIMELINE_ENABLED` environment variable. It also deprecates the old `experimental_timeline_enabled` + it's environment variable counterpart. **Motivation:** With the latest round of improvements to libdatadog, we're confident about enabling this feature by default. **Additional Notes:** I'm opening this PR as a draft as it still depends on DataDog/libdatadog#293 being released + a few more rounds of internal validation. **How to test the change?** Try to profile any application, validate that you get timeline data without further configuration.
What does this PR do?
This PR is a proof-of-concept for reducing:
It does so by taking advantage of the following observation: the timeline data is write-only until it's time to serialize.
Thus, instead of keeping a vec of
TrimmedTimestampedObservation
(which internally keeps another vec), I've decided to just stream all the bytes to an lz4 compressor, and to make sure to size its output buffer so as to avoid growing in very small steps.At serialization time, we do the reverse, sample by sample, and then immediately write such samples as pprof, reducing the memory footprint needed even during serialization.
The objective of this PR is to start a discussion on this approach; the code itself is an awful hack and it's probably better that we reimplement this idea from scratch if we do like it.
Note that while I cheated on some of the libdatadog tests (because hey used internal details that changed), this branch passes the full Ruby profiler test suite and produces -- as far as I could tell -- valid profiles.
Motivation:
One of our Ruby on Rails test apps was experiencing heavy memory bloat when timeline profiling was enabled.
I'm still doing an (internal) writeup with more details, but TL;DR it's a combination of many threads (100), and glibc creating a memory arena for each thread, and then libdatadog's use of malloc combined with Ruby's use of malloc would cause every one of these arenas to grow a lot.
Here's some quick numbers (more details on the internal writeup) for a reproducer app I wrote that from my experiments seems to mimic the behavior of the full Ruby app close enough:
Additional Notes:
N/A
How to test the change?
I'll provide instructions to reproduce my experiments soon.
For Reviewers
@DataDog/security-design-and-guidance
.