Integrate seastar memory profiling into RP #10562

StephanDollberg · 2023-05-04T17:38:23Z

RP side of integrating the seastar memory sampling. This will allow us to get a sampled view on the live memory set as consumed by RP.

Issue https://github.com/redpanda-data/core-internal/issues/470 / RFC

Seastar PR is here redpanda-data/seastar#46 (with numbers & flamegraphs)

Individual commits should stand on their own. I haven't added the cmake commit to the updated seastar yet so the build will obviously fail.

There is a couple open questions from my side:

Do we really want the periodic printing? Especially with high shard counts it gets quite noisy.
If yes, do we want a separate logger?
Do we want the sampling rate to be configurable? IMO either we chose one that works or we just turn it off. I can't imagine many scenarios where X would be fine for one case and Y for the other.
Do we want it default off/on at this point?

Backports Required

Release Notes

Adds a Redpanda internal memory profiler helping with finding the root cause for OOM situations

dotnwat

looks awesome. just a few minor comments

src/v/redpanda/memory_sampling.cc

src/v/redpanda/memory_sampling.h

src/v/redpanda/memory_sampling.cc

src/v/redpanda/admin_server.cc

tests/rptest/tests/memory_sampling_test.py

src/v/redpanda/memory_sampling.h

src/v/redpanda/memory_sampling.cc

StephanDollberg · 2023-05-10T09:33:06Z

@BenPope Have updated the fmt usage. Hopefully more idiomatic now.

src/v/redpanda/memory_sampling.cc

StephanDollberg · 2023-05-11T14:12:16Z

Decided yesterday with the team that we only want to log once at some points of memory usage. Hence, I have now added a change to only log once when we have 20% and 10% available memory left (per shard).

StephanDollberg · 2023-05-16T13:05:47Z

@ballard26 I have made the response now a bit more JSONesque as discussed

src/v/config/configuration.cc

src/v/redpanda/admin/api-doc/debug.json

travisdowns · 2023-05-17T18:38:52Z

src/v/redpanda/admin_server.cc

+        } catch (const boost::bad_lexical_cast&) {
+            throw ss::httpd::bad_param_exception(
+              fmt::format("Invalid parameter 'shard_id' value {{{}}}", e));
+        }


Ugh, so much boilerplate but yeah I guess this is how we roll elsewhere in the file 🤷 .

src/v/redpanda/admin_server.cc

src/v/resource_mgmt/memory_sampling.cc

src/v/resource_mgmt/memory_sampling.h

travisdowns

Made it to the end, some comments & suggested changes.

src/v/resource_mgmt/memory_sampling.h

src/v/resource_mgmt/tests/memory_sampling_tests.cc

tests/rptest/tests/memory_sampling_test.py

src/v/resource_mgmt/memory_sampling.cc

dotnwat · 2023-05-23T02:26:40Z

src/v/storage/batch_cache.cc

+    if (_memory_sampling_service.local_is_initialized()) {
+        _memory_sampling_service.local().notify_of_reclaim();
+    }


Not a blocker at all for this PR, but just wanted to mention:

The batch cache is the only subsystem that has a registered reclaimer right now. Ideally we may register a pseudo-reclaimer at the top level (e.g. in resource_mgmt or some other convenient location) and have that call out to sub-systems that are capable of reclaiming, such as the batch cache, and in your case, receiving an upcall. That would reduce coupling a lot (e..g not having to thread the sampling service down into the storage sub-system).

src/v/resource_mgmt/memory_sampling.cc

travisdowns · 2023-05-24T20:37:03Z

src/v/raft/tests/foreign_entry_test.cc

        _feature_table.stop().get();
    }
    model::offset _base_offset{0};
+    ss::logger _test_logger{"foreign-test-logger"};


Why is the memory sampling service showing up in these apparently unrelated tests?

Does the fixture somehow require it?

Yes exactly, all these tests use the storage::api or log_manager which own the batch cache.

It's a bit unfortunate: additional complexity for all these additional services and the tests that use them. I guess this was because of my suggestion to have the reclaim process trigger the LWM logger, right?

Probably something like @dotnwat 's suggestion of a reclaim API which could decouple these two would be a good way forward: you could register a reclaim listener which will be notified when reclaim happens without every reclaimer having to know about every listener.

I don't think we need to do this in this series though.

I guess this was because of my suggestion to have the reclaim process trigger the LWM logger, right?

Yeah, one of the reasons why I did the timer based thing before was because I wanted to avoid all these changes. Though that's just being a bit lazy really and the notify based one certainly seems better. The reclaim API suggestion would certainly help.

src/v/resource_mgmt/memory_sampling.cc

travisdowns

Only remaining issue regarding the size/count fields we expose.

dotnwat · 2023-05-30T17:22:40Z

ping @StephanDollberg looks like some merge conflicts have cropped up

StephanDollberg · 2023-05-30T19:01:14Z

ping @StephanDollberg looks like some merge conflicts have cropped up

Thanks yeah, I have resolved those locally (it's only small stuff). I will do a rebase plus squash once Travis has given the go-ahead.

travisdowns

Thanks for all the changes. This looks good to go.

dotnwat · 2023-06-05T20:00:22Z

@StephanDollberg looks like some merge conflicts here

StephanDollberg · 2023-06-09T11:05:12Z

Squashed and rebased

StephanDollberg · 2023-06-10T08:58:25Z

Above just minor test fixups. RP + seastar changes build is green now (minus some flaky ducktape tests and the public build which will need the sha update) - https://buildkite.com/redpanda/vtools/builds/8044#_

travisdowns

LGTM, though the ss side changes need to go in first, presumably

StephanDollberg · 2023-06-14T09:31:35Z

Rebased and update oss seastar pointer

StephanDollberg · 2023-06-14T13:20:02Z

Failures are:

CI Failure (TimeoutError('Redpanda node docker-rp-13 failed to stop in 30 seconds')) in ConsumerGroupTest.test_large_group_count #11320
CI Failure (timeout in kafka-topics.sh create topic) in RackAwarePlacementTest.test_replica_placement in DEBUG builds #11276

Adds a new memory_sampling service. The service enables seastar sampled memory profiling. It hooks into the batch cache which calls back into the service whenever a reclaim is run. The service then prints top-n allocation sites once we reach the thresholds of 10% and 20% of available memory left.

Print top-10 allocation sites on OOM. Uses the seastar on-OOM hook to add additional output when we fail to alloc. The output looks similar to what we print when reaching the low watermarks.

Adds an endpoint that the collects the current sampled memory profile from all shards and returns them to the caller. For now the stack is just stringified but we could make this a proper json structure all the way down. Keeping it simple for now. The following (or some form thereof) can be used to get a flamegraph: ``` curl localhost:9644/v1/debug/sampled_memory_profile?shard=3 \ | jq -r .[0].profile \ | ./symbolize_mem_dump.py /home/stephan/build/redpanda/vbuild/release/clang/bin/redpanda \ | flamegraph.pl - > flamegraph.svg ```

Adds a config flag to enable/disable the memory sampling. Flag can be changed without a restart. However, turning it on after the fact is not that useful because we will be missing lots of live allocations. It's still useful if profiling needs to be turned off for whatever reason.

Match vtools seastar pointer and pull in the memory sampling changes.

travisdowns

Re-approved after rebase to fix conflicts.

StephanDollberg · 2023-06-15T09:13:38Z

Failing tests now are:

StephanDollberg requested review from travisdowns and ballard26 May 4, 2023 17:38

github-actions bot added the area/redpanda label May 4, 2023

StephanDollberg force-pushed the support-mem-profiling branch from 24e8d0d to bce1314 Compare May 4, 2023 17:52

dotnwat reviewed May 9, 2023

View reviewed changes

ballard26 reviewed May 9, 2023

View reviewed changes

src/v/redpanda/memory_sampling.h Outdated Show resolved Hide resolved

BenPope reviewed May 9, 2023

View reviewed changes

src/v/redpanda/memory_sampling.cc Outdated Show resolved Hide resolved

BenPope reviewed May 10, 2023

View reviewed changes

src/v/redpanda/memory_sampling.cc Outdated Show resolved Hide resolved