[PROF-8967] Reduce memory footprint and allocations for profiling timeline data #293

ivoanjo · 2024-01-25T11:13:20Z

What does this PR do?

This PR is a proof-of-concept for reducing:

the number of allocations needed to store profiling timeline data
the memory space used to store profiling timeline data

It does so by taking advantage of the following observation: the timeline data is write-only until it's time to serialize.

Thus, instead of keeping a vec of TrimmedTimestampedObservation (which internally keeps another vec), I've decided to just stream all the bytes to an lz4 compressor, and to make sure to size its output buffer so as to avoid growing in very small steps.

At serialization time, we do the reverse, sample by sample, and then immediately write such samples as pprof, reducing the memory footprint needed even during serialization.

The objective of this PR is to start a discussion on this approach; the code itself is an awful hack and it's probably better that we reimplement this idea from scratch if we do like it.

Note that while I cheated on some of the libdatadog tests (because hey used internal details that changed), this branch passes the full Ruby profiler test suite and produces -- as far as I could tell -- valid profiles.

Motivation:

One of our Ruby on Rails test apps was experiencing heavy memory bloat when timeline profiling was enabled.

I'm still doing an (internal) writeup with more details, but TL;DR it's a combination of many threads (100), and glibc creating a memory arena for each thread, and then libdatadog's use of malloc combined with Ruby's use of malloc would cause every one of these arenas to grow a lot.

Here's some quick numbers (more details on the internal writeup) for a reproducer app I wrote that from my experiments seems to mimic the behavior of the full Ruby app close enough:

Configuration	RSS usage	Note
Baseline (no profiling)	215 552k	Baseline here includes the profiler being instanced, but never sampling. So it's higher than in real apps.
Libdatadog 5, no timeline	214 396k
Libdatadog 5, timeline	298 640k	Does not stabilize after simulating 15 mins of profiling (e.g. RSS still growing)
Prototype, timeline	223 468k	RSS oscilates between 219 and 223 MiB

Additional Notes:

N/A

How to test the change?

I'll provide instructions to reproduce my experiments soon.

For Reviewers

If this PR touches code that signs or publishes builds or packages, or handles credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
This PR doesn't touch any of that.

**What does this PR do?** This PR is a proof-of-concept for reducing: * the number of allocations needed to store profiling timeline data * the memory space used to store profiling timeline data It does so by taking advantage of the following observation: the timeline data is write-only until it's time to serialize. Thus, instead of keeping a vec of `TrimmedTimestampedObservation` (which internally keeps another vec), I've decided to just stream all the bytes to an lz4 compressor, and to make sure to size its output buffer so as to avoid growing in very small steps. At serialization time, we do the reverse, sample by sample, and then immediately write such samples as pprof, reducing the memory footprint needed even during serialization. The objective of this PR is to start a discussion on this approach; the code itself is an awful hack and it's probably better that we reimplement this idea from scratch if we do like it. Note that while I cheated on some of the libdatadog tests (because hey used internal details that changed), this branch passes the full Ruby profiler test suite and produces -- as far as I could tell -- valid profiles. **Motivation:** One of our Ruby on Rails test apps was experiencing heavy memory bloat when timeline profiling was enabled. I'm still doing an (internal) writeup with more details, but TL;DR it's a combination of many threads (100), and glibc creating a memory arena for each thread, and then libdatadog's use of malloc combined with Ruby's use of malloc would cause every one of these arenas to grow a lot. Here's some quick numbers (more details on the internal writeup) for a reproducer app I wrote that from my experiments seems to mimic the behavior of the full Ruby app close enough: | Configuration | RSS usage | Note | |--------------------------:|----------:|------------------------------------------------------------------------------------------------------------| | Baseline (no profiling) | 215 552k | Baseline here includes the profiler being instanced, but never sampling. So it's higher than in real apps. | | Libdatadog 5, no timeline | 214 396k | | | Libdatadog 5, timeline | 298 640k | Does not stabilize after simulating 15 mins of profiling (e.g. RSS still growing) | | Prototype, timeline | 223 468k | RSS oscilates between 219 and 223 MiB | **Additional Notes:** N/A **How to test the change?** I'll provide instructions to reproduce my experiments soon.

morrisonlevi · 2024-01-25T17:25:22Z

profiling/src/internal/observation/observations.rs

        } else {
+            // // Create buffer, pre-touch all elements to avoid any laziness
+            let mut timestamped_data_buffer: Vec<u8> = vec![0; 1_048_576];


This is 1 MiB, aka 1024 KiB. Your tests are showing RSS usage smaller than this. I don't see this how this is possible. What am I missing?

The values above are in kilobytes; e.g. taking the prototype 223 468k => 223MiB.

I actually did make some experiments to try to account for all the memory, sharing here as well:

Prototype, no timeline: ~218 MiB (e.g. since we pre-initialize the space anyway, and the Ruby profiler keeps two profiles active at all time)

Prototype, timeline, no serialization: ~219 MiB (e.g. the extra space seen above this is memory needed during serialization, since timeline profiles, even compressed, are bigger)

danielsn · 2024-01-25T18:33:01Z

profiling/src/internal/profile.rs

+            let labels_id_raw = decompressed_input.read_u32::<NativeEndian>().unwrap();
+            let timestamp_raw = decompressed_input.read_i64::<NativeEndian>().unwrap();
+
+            let mut values: Vec<i64> = Vec::with_capacity(self.sample_types.len());


You could declare this outside the loop and do a .clear() at the top of the loop so you only ever allocate one vec

Good point. I think that ideally this would all move inside the iterator, so we don't need to have two copies of the code for going through each sample.

I tried doing that but found it too hard with my current Rust-foo level >_>

I've moved this to be inside the TimestampedObservationsIter. A Vec still gets created every time, but with limited lifetime (just each iteration). Suggestions welcome on improvements :)

profiling/src/internal/observation/observations.rs

danielsn · 2024-01-25T18:36:57Z

profiling/src/internal/observation/observations.rs

-            inner.obs_len.assert_eq(values.len());
+    pub fn init_timeline(&mut self, values_size: usize) {
+        if let Some(_inner) = &self.inner {
+            panic!("Should never happen!");


Probably better to return an error rather than panic.

Of course! I'm guessing we should probably turn this into an Observations::new(...) or something like that? (This was just another ugly hack as I wanted to focus on the prototype and didn't want to take the detour)

Cleaned all this up in the latest version!

profiling/src/internal/observation/observations.rs

danielsn · 2024-01-25T20:15:07Z

profiling/src/internal/observation/observations.rs


 struct NonEmptyObservations {
    aggregated_data: HashMap<Sample, TrimmedObservation>,
-    timestamped_data: Vec<TrimmedTimestampedObservation>,
+    compressed_timestamped_data: FrameEncoder<Vec<u8>>,


How much do we save by compressing vs just using an array of values, and indicies into there?

We don't actually need to index into the array (which is why compressing it works).

On the "how much do we save", here's some back-of-the-napkin numbers:

Size of each observation: 4 bytes stacktrace, 4 bytes labels, 8 bytes timestamp, 8 bytes * N profile types

100 threads * 100 samples per second * 60 seconds * (4 + 4 + 8 + 8 * 4 profile types enabled for Ruby by default) = 28 800 000 bytes

For... reasons... the test app ends up recording data for 103 threads so I get

Uncompressed 29 MiB

Compressed 4 MiB

Which seems like a nice improvement as well. This data is highly compressible, (lots of small numbers, zeros, numbers next to each other, ...) so we could make a smarter uncompressed representation, but I think the compressor takes very well care of that for us.

Have you tried compressing in a Struct of Arrays format, and compress each array of fields of the TrimmedTimestampedObservation struct independently?
I'd think that this would give you a better compression ratio since the data would be more homogenous

Also, you might get a better compression ratio using the Linked block mode since you're doing a lot of small write and thus the block size will default to the minimum (64KB)

I think those are very valid options; having said that, I'm not convinced they are what we would want here:

Have you tried compressing in a Struct of Arrays format, and compress each array of fields of the TrimmedTimestampedObservation struct independently?
I'd think that this would give you a better compression ratio since the data would be more homogenous

If I'm understanding your suggestion, doing this would mean having multiple FrameEncoders, with multiple underlying buffers:

struct NonEmptyObservations { stack_ids: FrameEncoder<Vec<u8>>, label_ids: FrameEncoder<Vec<u8>>, timestamps: FrameEncoder<Vec<u8>>, values: FrameEncoder<Vec<u8>>, // ...

This would mean more allocations, especially when growing the backing vecs -- there's no longer one single vec that gets doubled, but multiple small ones. 🤔

I suspect (without having tried it, to be fair xD) this would cause more heap fragmentation.

Also, you might get a better compression ratio using the Linked block mode since you're doing a lot of small write and thus the block size will default to the minimum (64KB)

I'm assuming that if the encoder is referencing previous blocks, it means that it's doing more work (rather than e.g. looking only at the current block).

Depending on the win, If it's a few % I'm not sure it's worth doing more work on the profiling sample write path vs the memory savings, since anyway it's not a lot of data.

danielsn · 2024-01-25T20:15:35Z

profiling/src/internal/observation/observations.rs

        } else {
+            // // Create buffer, pre-touch all elements to avoid any laziness
+            let mut timestamped_data_buffer: Vec<u8> = vec![0; 1_048_576];


Make this a configurable constant

Do you mind a build-time constant? Or something that can be configured dynamically by the library?

buildtime is fine

Added in 6157a90

danielsn · 2024-01-25T20:23:27Z

profiling/src/internal/observation/observations.rs


        // SAFETY: we just ensured it has an item above.
        let observations = unsafe { self.inner.as_mut().unwrap_unchecked() };
        let obs_len = observations.obs_len;

        if let Some(ts) = timestamp {
-            let trimmed = TrimmedObservation::new(values, obs_len);
-            observations.timestamped_data.push((sample, ts, trimmed));
+            observations.compressed_timestamped_data.write_all(&(Id::into_raw_id(sample.stacktrace) as u32).to_ne_bytes()).unwrap();


ergonomics: you could put the data into a repr(packed) struct, convert that to bytes, and then read that in and out. My bias is that this would be a touch cleaner, but YMMV

That sounds good -- Rust was just fighting me too much that I ended up with whatever the simplest thing I could get going :)

I ended up not doing this -- I couldn't find a nice way of converting the struct to bytes that involve pulling in a serializer or basically writing the same field-by-field code, but elsewhere in the code.

I wasn't convinced it was worth the extra indirection, suggestions welcome.

https://docs.rs/bytemuck/1.14.1/bytemuck/fn.bytes_of.html does it. Your call if this is more or less elegant than doing it field by field

let bytes: &[u8] = bytemuck::bytes_of(&my_struct)

and then you'd convert back with https://docs.rs/bytemuck/1.14.1/bytemuck/fn.try_from_bytes.html

I gave a stab at bytemuck for 5-10mins, but I wasn't quite understanding how to integrate it, the documentation is not great (and I'm being nice...), so I'll leave it as-is for now.

profiling/src/internal/observation/observations.rs

danielsn · 2024-01-25T20:26:13Z

What is the CPU / latency impact of doing lz4 writes on every sample operation?

I've tested this with the `arenas-repro-v0.rb` reproducer and the results are the same.

ivoanjo · 2024-01-26T10:33:29Z

What is the CPU / latency impact of doing lz4 writes on every sample operation?

I've created a Ruby benchmark for this:

Config
Baseline (No timeline)	14.182 (± 0.0%) i/s - 142.000 in 10.014558s
Timeline + Libdatadog 5	14.925 (± 6.7%) i/s - 150.000 in 10.061879s
Timeline + Prototype	13.933 (± 0.0%) i/s - 140.000 in 10.055976s

Results are in iterations per second. Each iteration of the benchmark simulates 1 minute of sampling (6000 samples * number of active threads). At the end of each iteration, the profile gets reset (values don't include serialization).

TL;DR It seems to be close.

(Bonus note: The benchmark looks like it's serializing, but I manually slipped in a Profile_reset call in the C code to empty out the profile, as we don't quite expose these APIs to Ruby usually)

profiling/src/internal/observation/observations.rs

paullegranddc · 2024-01-26T13:11:23Z

profiling/src/internal/observation/observations.rs


 struct NonEmptyObservations {
    aggregated_data: HashMap<Sample, TrimmedObservation>,
-    timestamped_data: Vec<TrimmedTimestampedObservation>,
+    compressed_timestamped_data: FrameEncoder<Vec<u8>>,


Have you tried compressing in a Struct of Arrays format, and compress each array of fields of the TrimmedTimestampedObservation struct independently?
I'd think that this would give you a better compression ratio since the data would be more homogenous

paullegranddc · 2024-01-26T13:13:11Z

profiling/src/internal/observation/observations.rs


 struct NonEmptyObservations {
    aggregated_data: HashMap<Sample, TrimmedObservation>,
-    timestamped_data: Vec<TrimmedTimestampedObservation>,
+    compressed_timestamped_data: FrameEncoder<Vec<u8>>,


Also, you might get a better compression ratio using the Linked block mode since you're doing a lot of small write and thus the block size will default to the minimum (64KB)

This struct abstracts away the underlying compressed storage, and exposes an iterator that can be used to provide the existing behavior used by the pprof serialization. Also tweak and restore the tests.

ivoanjo · 2024-01-31T13:26:28Z

I've pushed a new revision that displays similar performance in my reproducer apps, but with a much nicer code structure. Let me know your thoughts!

P.s.: I know there's a trivial merge conflict, but on purpose I've started this branch from v5.0.0 to have a full apples-to-apples comparison. Once we're happy with the PR, I'll do a quick merge from master and resolve any conflicts.

The `malloc_stats()` API has been useful to debug issues with memory fragmentation (see <DataDog/libdatadog#293>) so let's keep a helper to make it easy to access.

profiling/src/internal/profile.rs

profiling/src/internal/observation/observations.rs

profiling/src/internal/observation/timestamped_observations.rs

profiling/src/internal/observation/observations.rs

profiling/src/internal/observation/timestamped_observations.rs

profiling/src/internal/observation/trimmed_observation.rs

**What does this PR do?** This PR enables the timeline feature for the Continuous Profiler by default. It introduces a new setting, `timeline_enabled`, which can also be controlled by setting the `DD_PROFILING_TIMELINE_ENABLED` environment variable. It also deprecates the old `experimental_timeline_enabled` + it's environment variable counterpart. **Motivation:** With the latest round of improvements to libdatadog, we're confident about enabling this feature by default. **Additional Notes:** I'm opening this PR as a draft as it still depends on DataDog/libdatadog#293 being released + a few more rounds of internal validation. **How to test the change?** Try to profile any application, validate that you get timeline data without further configuration.

Now these functions return actual errors.

danielsn

LGTM, after a few ⛏️

profiling/src/internal/observation/observations.rs

realFlowControl · 2024-02-12T08:37:10Z

I have yet to do a proper review, but a "well done" is due: DataDog/dd-trace-php#2513 (comment)

scenario:php-profiler-timeline-memory-with-profiler-and-timeline
🟩 max_rss_usage [-6.926MB; -6.379MB] or [-18.431%; -16.975%]

The tests were actually passing for the wrong reasons, since the tests had not been changed to use `Observations::new`. They are now passing for the right reasons (I hope).

ivoanjo · 2024-02-12T10:53:07Z

Thanks everyone for the help! I'm going to go ahead and merge this one. The miri failure in CI is unrelated, and I'll prepare a follow-up PR to fix it.

**What does this PR do?** This PR enables the timeline feature for the Continuous Profiler by default. It introduces a new setting, `timeline_enabled`, which can also be controlled by setting the `DD_PROFILING_TIMELINE_ENABLED` environment variable. It also deprecates the old `experimental_timeline_enabled` + it's environment variable counterpart. **Motivation:** With the latest round of improvements to libdatadog, we're confident about enabling this feature by default. **Additional Notes:** I'm opening this PR as a draft as it still depends on DataDog/libdatadog#293 being released + a few more rounds of internal validation. **How to test the change?** Try to profile any application, validate that you get timeline data without further configuration.

ivoanjo requested review from a team as code owners January 25, 2024 11:13

ivoanjo changed the title ~~PoC: Reduce memory footprint and allocations for profiling timeline data~~ [PROF-8967] PoC: Reduce memory footprint and allocations for profiling timeline data Jan 25, 2024

morrisonlevi reviewed Jan 25, 2024

View reviewed changes

danielsn reviewed Jan 25, 2024

View reviewed changes

Use Vec::with_capacity to size buffer, no need to fill and clear

efd13a8

I've tested this with the `arenas-repro-v0.rb` reproducer and the results are the same.

paullegranddc reviewed Jan 26, 2024

View reviewed changes

ivoanjo added 2 commits January 31, 2024 13:15

Introduce struct for representing timeline observations separately

3ed8bc1

This struct abstracts away the underlying compressed storage, and exposes an iterator that can be used to provide the existing behavior used by the pprof serialization. Also tweak and restore the tests.

Update Cargo.lock

774e7f7

Introduce Observations::new and use it in profile.rs

c1a4dc6

danielsn reviewed Jan 31, 2024

View reviewed changes

ivoanjo added 2 commits February 1, 2024 10:51

Minor: Add #[derive(Debug)] to TimestampedObservations

5cc490e

Clean up error handling in iterator

d4d3407

ivoanjo mentioned this pull request Feb 1, 2024

[PROF-8942] Enable Ruby timeline by default DataDog/dd-trace-rb#3428

Merged

2 tasks

ivoanjo added 6 commits February 8, 2024 13:57

Refactor Observations and TrimmedObservations to avoid runtime panics

e4fff0d

Now these functions return actual errors.

Extract magic default buffer size constant

6157a90

Cargo fmt pass

250846a

Make timestamped_data in Observations not need to be an Option

8f2415f

Simplify conversion of ids into u32

04b2b16

Merge branch 'main' into ivoanjo/timeline-compressed-data

e714297

github-actions bot added the profiling Relates to the profiling* modules. label Feb 9, 2024

ivoanjo requested a review from danielsn February 9, 2024 09:42

ivoanjo changed the title ~~[PROF-8967] PoC: Reduce memory footprint and allocations for profiling timeline data~~ [PROF-8967] educe memory footprint and allocations for profiling timeline data Feb 9, 2024

ivoanjo changed the title ~~[PROF-8967] educe memory footprint and allocations for profiling timeline data~~ [PROF-8967] Reduce memory footprint and allocations for profiling timeline data Feb 9, 2024

ivoanjo added 2 commits February 9, 2024 09:53

Clippy cleanups

0592285

Merge branch 'main' into ivoanjo/timeline-compressed-data

b71a9d2

danielsn approved these changes Feb 9, 2024

View reviewed changes

profiling/src/internal/observation/observations.rs Outdated Show resolved Hide resolved

profiling/src/internal/observation/observations.rs Show resolved Hide resolved

profiling/src/internal/observation/observations.rs Outdated Show resolved Hide resolved

Merge branch 'main' into ivoanjo/timeline-compressed-data

ef01dc4

realFlowControl mentioned this pull request Feb 12, 2024

feat(profiling) update to libdatadog 6 DataDog/dd-trace-php#2513

Merged

2 tasks

ivoanjo added 2 commits February 12, 2024 10:36

Remove #[should_panic] on tests and fix unwrap testing

8bc3820

The tests were actually passing for the wrong reasons, since the tests had not been changed to use `Observations::new`. They are now passing for the right reasons (I hope).

Also assert on expected number of stored samples

cd5b156

ivoanjo merged commit ac30a9f into main Feb 12, 2024
18 of 19 checks passed

bantonsson deleted the ivoanjo/timeline-compressed-data branch March 7, 2024 07:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROF-8967] Reduce memory footprint and allocations for profiling timeline data #293

[PROF-8967] Reduce memory footprint and allocations for profiling timeline data #293

ivoanjo commented Jan 25, 2024

morrisonlevi Jan 25, 2024

ivoanjo Jan 25, 2024

danielsn Jan 25, 2024

ivoanjo Jan 26, 2024 •

edited

Loading

ivoanjo Jan 31, 2024

danielsn Jan 25, 2024

ivoanjo Jan 26, 2024

ivoanjo Jan 31, 2024

danielsn Jan 25, 2024

ivoanjo Jan 26, 2024

paullegranddc Jan 26, 2024

paullegranddc Jan 26, 2024

ivoanjo Jan 26, 2024

danielsn Jan 25, 2024

ivoanjo Jan 31, 2024

danielsn Feb 2, 2024

ivoanjo Feb 9, 2024

danielsn Jan 25, 2024

ivoanjo Jan 26, 2024

ivoanjo Jan 31, 2024

danielsn Feb 2, 2024 •

edited

Loading

ivoanjo Feb 8, 2024

danielsn commented Jan 25, 2024

ivoanjo commented Jan 26, 2024 •

edited

Loading

paullegranddc Jan 26, 2024

paullegranddc Jan 26, 2024

ivoanjo commented Jan 31, 2024

danielsn left a comment

realFlowControl commented Feb 12, 2024 •

edited

Loading

ivoanjo commented Feb 12, 2024

[PROF-8967] Reduce memory footprint and allocations for profiling timeline data #293

[PROF-8967] Reduce memory footprint and allocations for profiling timeline data #293

Conversation

ivoanjo commented Jan 25, 2024

For Reviewers

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivoanjo Jan 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielsn Feb 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielsn commented Jan 25, 2024

ivoanjo commented Jan 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivoanjo commented Jan 31, 2024

danielsn left a comment

Choose a reason for hiding this comment

realFlowControl commented Feb 12, 2024 • edited Loading

ivoanjo commented Feb 12, 2024

ivoanjo Jan 26, 2024 •

edited

Loading

danielsn Feb 2, 2024 •

edited

Loading

ivoanjo commented Jan 26, 2024 •

edited

Loading

realFlowControl commented Feb 12, 2024 •

edited

Loading