Reduce chunk write queue memory usage 2 #10874

pstibrany · 2022-06-16T08:30:53Z

This PR reimplements chan chunkWriteJob with custom buffered queue that should use less memory, because it doesn't preallocate entire buffer for maximum queue size at once. Instead it allocates individual "segments" with smaller size.

As elements are added to the queue, they fill individual segments. When elements are removed from the queue (and segments), empty segments can be thrown away. This doesn't change memory usage of the queue when it's full, but decreases its memory footprint when queue is empty (queue will keep max 1 segment in such case).

This PR comes from grafana/mimir-prometheus#247. We use it in Grafana Mimir to reduce memory usage in Mimir clusters with thousands of open TSDBs in single process.

Related to #10873.

pstibrany · 2022-06-16T08:50:12Z

See #10873 (comment) for first part of the story.

After applying fix from #10873 and then fix from this PR at ~14:00 to ingester-zone-b, we've got:

Legend:

zone-a (green) using new chunk mapper without fix Reduce chunk write queue memory usage 1 #10873 and Reduce chunk write queue memory usage 2 #10874
zone-b (yellow) is using new chunk mapper with fix Reduce chunk write queue memory usage 1 #10873 and at about 14:00 got updated with fix from this PR as well
zone-c using old chunk mapper from before Write chunks via queue, predicting the refs #10051

We can see that applying both fixes (#10873 and from this PR) reduced memory usage of new chunk mapper to the same values as old chunk mapper.

(In this case, each zone had 20 TSDBs open altogether)

pstibrany · 2022-06-16T09:48:18Z

Here is a graph with much larger Mimir deployment with 5200 open TSDBs in each zone, after enabling new chunk mapper with both fixes in ingester-zone-a, while zones B and C still use old chunk mapper. Graph starts with rollout in zone-a, and shows cumulative memory usage for each zone (in GB):

This PR reimplements chan chunkWriteJob with custom buffered queue that should use less memory, because it doesn't preallocate entire buffer for maximum queue size at once. Instead it allocates individual "segments" with smaller size. As elements are added to the queue, they fill individual segments. When elements are removed from the queue (and segments), empty segments can be thrown away. This doesn't change memory usage of the queue when it's full, but should decrease its memory footprint when it's empty (queue will keep max 1 segment in such case). Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

bwplotka

This looks good, thanks!

However, I don't see the mentioned improvement in the second graph: #10874 (comment) - do you mean those small steps down when the load was potentially smaller?

Anyway, it's great to reduce the allocation when the queue is smaller, so LGTM. Thanks 💪🏽

Just FYI the only negative I see is that generally linked lists can trash L-caches when iterating. And we iterate jobs to process them. However, given the write chunk job requires disk write so higher I/O - linked list overhead probably does not matter. Did you maybe check the impact of this on latency for writing those chunks (queue feeling up quicker)? I wonder if we have even a means to measure this currently. 🤔

tsdb/chunks/chunk_write_queue.go

tsdb/chunks/queue.go

pstibrany · 2022-06-29T08:37:17Z

This looks good, thanks!

Thank you for review!

However, I don't see the mentioned improvement in the second graph: #10874 (comment) - do you mean those small steps down when the load was potentially smaller?

The improvement is in the yellow line. At first you can see it being better (lower) than green (zone-a), but worse than blue (zone-c). After applying fix from this PR (roughtly at 14 UTC on the graph), it got to the same level as blue (zone-c, running old chunk mapper).

Anyway, it's great to reduce the allocation when the queue is smaller, so LGTM. Thanks 💪🏽

Just FYI the only negative I see is that generally linked lists can trash L-caches when iterating. And we iterate jobs to process them. However, given the write chunk job requires disk write so higher I/O - linked list overhead probably does not matter. Did you maybe check the impact of this on latency for writing those chunks (queue feeling up quicker)? I wonder if we have even a means to measure this currently. 🤔

We iterate full segment first (slice of 8192 jobs), before jumping (using linked list) to the next segment. As you point out each chunk job performs IO. I haven't checked the impact on latency of writing the chunks though.

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

bwplotka · 2022-06-29T10:26:20Z

Restarted windows test flake

bwplotka · 2022-06-29T11:41:28Z

Happy to merge, unless you want to check it @codesome

codesome · 2022-06-29T12:21:22Z

1 less PR to review for me? Yes please!

pstibrany requested a review from codesome as a code owner June 16, 2022 08:30

pstibrany changed the title ~~Job queue~~ Reduce chunk write queue memory usage Jun 16, 2022

pstibrany mentioned this pull request Jun 16, 2022

Job queue grafana/mimir-prometheus#247

Merged

pstibrany changed the title ~~Reduce chunk write queue memory usage~~ Reduce chunk write queue memory usage 2 Jun 16, 2022

pstibrany mentioned this pull request Jun 16, 2022

Reduce chunk write queue memory usage 1 #10873

Merged

pstibrany force-pushed the chunk-mapper-chan branch 2 times, most recently from c7e3bdd to af02c17 Compare June 16, 2022 09:27

pstibrany added 2 commits June 17, 2022 09:44

Modify test to work with low resolution timer.

5198a92

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

pstibrany force-pushed the chunk-mapper-chan branch from af02c17 to 5198a92 Compare June 17, 2022 07:44

bwplotka approved these changes Jun 29, 2022

View reviewed changes

tsdb/chunks/chunk_write_queue.go Outdated Show resolved Hide resolved

tsdb/chunks/queue.go Outdated Show resolved Hide resolved

tsdb/chunks/queue.go Outdated Show resolved Hide resolved

Improve comments.

e60064a

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

codesome merged commit ffc60d8 into prometheus:main Jun 29, 2022

philipgough mentioned this pull request Aug 3, 2022

receive: Expose write chunk queue as flag thanos-io/thanos#5566

Merged

2 tasks

eyenx mentioned this pull request Sep 26, 2022

chore(kps): update to 40.1.* adfinis/helm-charts#816

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce chunk write queue memory usage 2 #10874

Reduce chunk write queue memory usage 2 #10874

pstibrany commented Jun 16, 2022 •

edited

Loading

pstibrany commented Jun 16, 2022 •

edited

Loading

pstibrany commented Jun 16, 2022 •

edited

Loading

bwplotka left a comment

pstibrany commented Jun 29, 2022 •

edited

Loading

bwplotka commented Jun 29, 2022

bwplotka commented Jun 29, 2022

codesome commented Jun 29, 2022

Reduce chunk write queue memory usage 2 #10874

Reduce chunk write queue memory usage 2 #10874

Conversation

pstibrany commented Jun 16, 2022 • edited Loading

pstibrany commented Jun 16, 2022 • edited Loading

pstibrany commented Jun 16, 2022 • edited Loading

bwplotka left a comment

Choose a reason for hiding this comment

pstibrany commented Jun 29, 2022 • edited Loading

bwplotka commented Jun 29, 2022

bwplotka commented Jun 29, 2022

codesome commented Jun 29, 2022

pstibrany commented Jun 16, 2022 •

edited

Loading

pstibrany commented Jun 16, 2022 •

edited

Loading

pstibrany commented Jun 16, 2022 •

edited

Loading

pstibrany commented Jun 29, 2022 •

edited

Loading