Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-8967] Reduce memory footprint and allocations for profiling timeline data #293

Merged
merged 18 commits into from
Feb 12, 2024

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Jan 25, 2024

What does this PR do?

This PR is a proof-of-concept for reducing:

  • the number of allocations needed to store profiling timeline data
  • the memory space used to store profiling timeline data

It does so by taking advantage of the following observation: the timeline data is write-only until it's time to serialize.

Thus, instead of keeping a vec of TrimmedTimestampedObservation (which internally keeps another vec), I've decided to just stream all the bytes to an lz4 compressor, and to make sure to size its output buffer so as to avoid growing in very small steps.

At serialization time, we do the reverse, sample by sample, and then immediately write such samples as pprof, reducing the memory footprint needed even during serialization.

The objective of this PR is to start a discussion on this approach; the code itself is an awful hack and it's probably better that we reimplement this idea from scratch if we do like it.

Note that while I cheated on some of the libdatadog tests (because hey used internal details that changed), this branch passes the full Ruby profiler test suite and produces -- as far as I could tell -- valid profiles.

Motivation:

One of our Ruby on Rails test apps was experiencing heavy memory bloat when timeline profiling was enabled.

I'm still doing an (internal) writeup with more details, but TL;DR it's a combination of many threads (100), and glibc creating a memory arena for each thread, and then libdatadog's use of malloc combined with Ruby's use of malloc would cause every one of these arenas to grow a lot.

Here's some quick numbers (more details on the internal writeup) for a reproducer app I wrote that from my experiments seems to mimic the behavior of the full Ruby app close enough:

Configuration RSS usage Note
Baseline (no profiling) 215 552k Baseline here includes the profiler being instanced, but never sampling. So it's higher than in real apps.
Libdatadog 5, no timeline 214 396k
Libdatadog 5, timeline 298 640k Does not stabilize after simulating 15 mins of profiling (e.g. RSS still growing)
Prototype, timeline 223 468k RSS oscilates between 219 and 223 MiB

Additional Notes:

N/A

How to test the change?

I'll provide instructions to reproduce my experiments soon.

For Reviewers

  • If this PR touches code that signs or publishes builds or packages, or handles credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.

**What does this PR do?**

This PR is a proof-of-concept for reducing:

* the number of allocations needed to store profiling timeline data
* the memory space used to store profiling timeline data

It does so by taking advantage of the following observation: the
timeline data is write-only until it's time to serialize.

Thus, instead of keeping a vec of `TrimmedTimestampedObservation`
(which internally keeps another vec), I've decided to just stream
all the bytes to an lz4 compressor, and to make sure to size its
output buffer so as to avoid growing in very small steps.

At serialization time, we do the reverse, sample by sample, and then
immediately write such samples as pprof, reducing the memory footprint
needed even during serialization.

The objective of this PR is to start a discussion on this approach;
the code itself is an awful hack and it's probably better that we
reimplement this idea from scratch if we do like it.

Note that while I cheated on some of the libdatadog tests (because
hey used internal details that changed), this branch passes the full
Ruby profiler test suite and produces -- as far as I could tell --
valid profiles.

**Motivation:**

One of our Ruby on Rails test apps was experiencing heavy memory
bloat when timeline profiling was enabled.

I'm still doing an (internal) writeup with more details, but TL;DR
it's a combination of many threads (100), and glibc creating
a memory arena for each thread, and then libdatadog's use of
malloc combined with Ruby's use of malloc would cause every one of
these arenas to grow a lot.

Here's some quick numbers (more details on the internal writeup)
for a reproducer app I wrote that from my experiments seems to
mimic the behavior of the full Ruby app close enough:

|             Configuration | RSS usage | Note                                                                                                       |
|--------------------------:|----------:|------------------------------------------------------------------------------------------------------------|
|   Baseline (no profiling) |  215 552k | Baseline here includes the profiler being instanced, but never sampling. So it's higher than in real apps. |
| Libdatadog 5, no timeline |  214 396k |                                                                                                            |
|    Libdatadog 5, timeline |  298 640k | Does not stabilize after simulating 15 mins of profiling (e.g. RSS still growing)                          |
|       Prototype, timeline |  223 468k | RSS oscilates between 219 and 223 MiB                                                                      |

**Additional Notes:**

N/A

**How to test the change?**

I'll provide instructions to reproduce my experiments soon.
@ivoanjo ivoanjo requested review from a team as code owners January 25, 2024 11:13
@ivoanjo ivoanjo changed the title PoC: Reduce memory footprint and allocations for profiling timeline data [PROF-8967] PoC: Reduce memory footprint and allocations for profiling timeline data Jan 25, 2024
} else {
// // Create buffer, pre-touch all elements to avoid any laziness
let mut timestamped_data_buffer: Vec<u8> = vec![0; 1_048_576];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is 1 MiB, aka 1024 KiB. Your tests are showing RSS usage smaller than this. I don't see this how this is possible. What am I missing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values above are in kilobytes; e.g. taking the prototype 223 468k => 223MiB.

I actually did make some experiments to try to account for all the memory, sharing here as well:

Prototype, no timeline: ~218 MiB (e.g. since we pre-initialize the space anyway, and the Ruby profiler keeps two profiles active at all time)

Prototype, timeline, no serialization: ~219 MiB (e.g. the extra space seen above this is memory needed during serialization, since timeline profiles, even compressed, are bigger)

let labels_id_raw = decompressed_input.read_u32::<NativeEndian>().unwrap();
let timestamp_raw = decompressed_input.read_i64::<NativeEndian>().unwrap();

let mut values: Vec<i64> = Vec::with_capacity(self.sample_types.len());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could declare this outside the loop and do a .clear() at the top of the loop so you only ever allocate one vec

Copy link
Member Author

@ivoanjo ivoanjo Jan 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think that ideally this would all move inside the iterator, so we don't need to have two copies of the code for going through each sample.

I tried doing that but found it too hard with my current Rust-foo level >_>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved this to be inside the TimestampedObservationsIter. A Vec still gets created every time, but with limited lifetime (just each iteration). Suggestions welcome on improvements :)

profiling/src/internal/observation/observations.rs Outdated Show resolved Hide resolved
inner.obs_len.assert_eq(values.len());
pub fn init_timeline(&mut self, values_size: usize) {
if let Some(_inner) = &self.inner {
panic!("Should never happen!");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to return an error rather than panic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course! I'm guessing we should probably turn this into an Observations::new(...) or something like that? (This was just another ugly hack as I wanted to focus on the prototype and didn't want to take the detour)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned all this up in the latest version!


struct NonEmptyObservations {
aggregated_data: HashMap<Sample, TrimmedObservation>,
timestamped_data: Vec<TrimmedTimestampedObservation>,
compressed_timestamped_data: FrameEncoder<Vec<u8>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much do we save by compressing vs just using an array of values, and indicies into there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't actually need to index into the array (which is why compressing it works).

On the "how much do we save", here's some back-of-the-napkin numbers:

  • Size of each observation: 4 bytes stacktrace, 4 bytes labels, 8 bytes timestamp, 8 bytes * N profile types

  • 100 threads * 100 samples per second * 60 seconds * (4 + 4 + 8 + 8 * 4 profile types enabled for Ruby by default) = 28 800 000 bytes

  • For... reasons... the test app ends up recording data for 103 threads so I get

    • Uncompressed 29 MiB
    • Compressed 4 MiB

Which seems like a nice improvement as well. This data is highly compressible, (lots of small numbers, zeros, numbers next to each other, ...) so we could make a smarter uncompressed representation, but I think the compressor takes very well care of that for us.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried compressing in a Struct of Arrays format, and compress each array of fields of the TrimmedTimestampedObservation struct independently?
I'd think that this would give you a better compression ratio since the data would be more homogenous

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you might get a better compression ratio using the Linked block mode since you're doing a lot of small write and thus the block size will default to the minimum (64KB)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those are very valid options; having said that, I'm not convinced they are what we would want here:

Have you tried compressing in a Struct of Arrays format, and compress each array of fields of the TrimmedTimestampedObservation struct independently?
I'd think that this would give you a better compression ratio since the data would be more homogenous

If I'm understanding your suggestion, doing this would mean having multiple FrameEncoders, with multiple underlying buffers:

struct NonEmptyObservations {
    stack_ids: FrameEncoder<Vec<u8>>,
    label_ids: FrameEncoder<Vec<u8>>,
    timestamps: FrameEncoder<Vec<u8>>,
    values: FrameEncoder<Vec<u8>>,
    // ...

This would mean more allocations, especially when growing the backing vecs -- there's no longer one single vec that gets doubled, but multiple small ones. 🤔

I suspect (without having tried it, to be fair xD) this would cause more heap fragmentation.

Also, you might get a better compression ratio using the Linked block mode since you're doing a lot of small write and thus the block size will default to the minimum (64KB)

I'm assuming that if the encoder is referencing previous blocks, it means that it's doing more work (rather than e.g. looking only at the current block).

Depending on the win, If it's a few % I'm not sure it's worth doing more work on the profiling sample write path vs the memory savings, since anyway it's not a lot of data.

} else {
// // Create buffer, pre-touch all elements to avoid any laziness
let mut timestamped_data_buffer: Vec<u8> = vec![0; 1_048_576];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this a configurable constant

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind a build-time constant? Or something that can be configured dynamically by the library?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buildtime is fine

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 6157a90


// SAFETY: we just ensured it has an item above.
let observations = unsafe { self.inner.as_mut().unwrap_unchecked() };
let obs_len = observations.obs_len;

if let Some(ts) = timestamp {
let trimmed = TrimmedObservation::new(values, obs_len);
observations.timestamped_data.push((sample, ts, trimmed));
observations.compressed_timestamped_data.write_all(&(Id::into_raw_id(sample.stacktrace) as u32).to_ne_bytes()).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ergonomics: you could put the data into a repr(packed) struct, convert that to bytes, and then read that in and out. My bias is that this would be a touch cleaner, but YMMV

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good -- Rust was just fighting me too much that I ended up with whatever the simplest thing I could get going :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up not doing this -- I couldn't find a nice way of converting the struct to bytes that involve pulling in a serializer or basically writing the same field-by-field code, but elsewhere in the code.

I wasn't convinced it was worth the extra indirection, suggestions welcome.

Copy link
Contributor

@danielsn danielsn Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.rs/bytemuck/1.14.1/bytemuck/fn.bytes_of.html does it. Your call if this is more or less elegant than doing it field by field

let bytes: &[u8] = bytemuck::bytes_of(&my_struct)

and then you'd convert back with https://docs.rs/bytemuck/1.14.1/bytemuck/fn.try_from_bytes.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave a stab at bytemuck for 5-10mins, but I wasn't quite understanding how to integrate it, the documentation is not great (and I'm being nice...), so I'll leave it as-is for now.

profiling/src/internal/observation/observations.rs Outdated Show resolved Hide resolved
@danielsn
Copy link
Contributor

What is the CPU / latency impact of doing lz4 writes on every sample operation?

I've tested this with the `arenas-repro-v0.rb` reproducer and the
results are the same.
@ivoanjo
Copy link
Member Author

ivoanjo commented Jan 26, 2024

What is the CPU / latency impact of doing lz4 writes on every sample operation?

I've created a Ruby benchmark for this:

Config
Baseline (No timeline) 14.182 (± 0.0%) i/s - 142.000 in 10.014558s
Timeline + Libdatadog 5 14.925 (± 6.7%) i/s - 150.000 in 10.061879s
Timeline + Prototype 13.933 (± 0.0%) i/s - 140.000 in 10.055976s

Results are in iterations per second. Each iteration of the benchmark simulates 1 minute of sampling (6000 samples * number of active threads). At the end of each iteration, the profile gets reset (values don't include serialization).

TL;DR It seems to be close.

(Bonus note: The benchmark looks like it's serializing, but I manually slipped in a Profile_reset call in the C code to empty out the profile, as we don't quite expose these APIs to Ruby usually)

profiling/src/internal/observation/observations.rs Outdated Show resolved Hide resolved

struct NonEmptyObservations {
aggregated_data: HashMap<Sample, TrimmedObservation>,
timestamped_data: Vec<TrimmedTimestampedObservation>,
compressed_timestamped_data: FrameEncoder<Vec<u8>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried compressing in a Struct of Arrays format, and compress each array of fields of the TrimmedTimestampedObservation struct independently?
I'd think that this would give you a better compression ratio since the data would be more homogenous


struct NonEmptyObservations {
aggregated_data: HashMap<Sample, TrimmedObservation>,
timestamped_data: Vec<TrimmedTimestampedObservation>,
compressed_timestamped_data: FrameEncoder<Vec<u8>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you might get a better compression ratio using the Linked block mode since you're doing a lot of small write and thus the block size will default to the minimum (64KB)

This struct abstracts away the underlying compressed storage, and
exposes an iterator that can be used to provide the existing behavior
used by the pprof serialization.

Also tweak and restore the tests.
@ivoanjo
Copy link
Member Author

ivoanjo commented Jan 31, 2024

I've pushed a new revision that displays similar performance in my reproducer apps, but with a much nicer code structure. Let me know your thoughts!

P.s.: I know there's a trivial merge conflict, but on purpose I've started this branch from v5.0.0 to have a full apples-to-apples comparison. Once we're happy with the PR, I'll do a quick merge from master and resolve any conflicts.

ivoanjo added a commit to DataDog/dd-trace-rb that referenced this pull request Jan 31, 2024
The `malloc_stats()` API has been useful to debug issues with
memory fragmentation (see
<DataDog/libdatadog#293>) so let's keep a
helper to make it easy to access.
profiling/src/internal/profile.rs Show resolved Hide resolved
profiling/src/internal/observation/observations.rs Outdated Show resolved Hide resolved
profiling/src/internal/observation/observations.rs Outdated Show resolved Hide resolved
profiling/src/internal/observation/observations.rs Outdated Show resolved Hide resolved
profiling/src/internal/observation/observations.rs Outdated Show resolved Hide resolved
profiling/src/internal/observation/observations.rs Outdated Show resolved Hide resolved
ivoanjo added a commit to DataDog/dd-trace-rb that referenced this pull request Feb 1, 2024
**What does this PR do?**

This PR enables the timeline feature for the Continuous Profiler by
default.

It introduces a new setting, `timeline_enabled`, which can also be
controlled by setting the `DD_PROFILING_TIMELINE_ENABLED` environment
variable.

It also deprecates the old `experimental_timeline_enabled` + it's
environment variable counterpart.

**Motivation:**

With the latest round of improvements to libdatadog, we're confident
about enabling this feature by default.

**Additional Notes:**

I'm opening this PR as a draft as it still depends on
DataDog/libdatadog#293 being released +
a few more rounds of internal validation.

**How to test the change?**

Try to profile any application, validate that you get timeline
data without further configuration.
@github-actions github-actions bot added the profiling Relates to the profiling* modules. label Feb 9, 2024
@ivoanjo ivoanjo requested a review from danielsn February 9, 2024 09:42
@ivoanjo ivoanjo changed the title [PROF-8967] PoC: Reduce memory footprint and allocations for profiling timeline data [PROF-8967] educe memory footprint and allocations for profiling timeline data Feb 9, 2024
@ivoanjo ivoanjo changed the title [PROF-8967] educe memory footprint and allocations for profiling timeline data [PROF-8967] Reduce memory footprint and allocations for profiling timeline data Feb 9, 2024
Copy link
Contributor

@danielsn danielsn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, after a few ⛏️

profiling/src/internal/observation/observations.rs Outdated Show resolved Hide resolved
profiling/src/internal/observation/observations.rs Outdated Show resolved Hide resolved
@realFlowControl
Copy link
Member

realFlowControl commented Feb 12, 2024

I have yet to do a proper review, but a "well done" is due: DataDog/dd-trace-php#2513 (comment)

scenario:php-profiler-timeline-memory-with-profiler-and-timeline
🟩 max_rss_usage [-6.926MB; -6.379MB] or [-18.431%; -16.975%]

The tests were actually passing for the wrong reasons, since the tests
had not been changed to use `Observations::new`.

They are now passing for the right reasons (I hope).
@ivoanjo
Copy link
Member Author

ivoanjo commented Feb 12, 2024

Thanks everyone for the help! I'm going to go ahead and merge this one. The miri failure in CI is unrelated, and I'll prepare a follow-up PR to fix it.

@ivoanjo ivoanjo merged commit ac30a9f into main Feb 12, 2024
18 of 19 checks passed
ivoanjo added a commit to DataDog/dd-trace-rb that referenced this pull request Feb 21, 2024
**What does this PR do?**

This PR enables the timeline feature for the Continuous Profiler by
default.

It introduces a new setting, `timeline_enabled`, which can also be
controlled by setting the `DD_PROFILING_TIMELINE_ENABLED` environment
variable.

It also deprecates the old `experimental_timeline_enabled` + it's
environment variable counterpart.

**Motivation:**

With the latest round of improvements to libdatadog, we're confident
about enabling this feature by default.

**Additional Notes:**

I'm opening this PR as a draft as it still depends on
DataDog/libdatadog#293 being released +
a few more rounds of internal validation.

**How to test the change?**

Try to profile any application, validate that you get timeline
data without further configuration.
@bantonsson bantonsson deleted the ivoanjo/timeline-compressed-data branch March 7, 2024 07:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
profiling Relates to the profiling* modules.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants