Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the amount of IO that LedgerCleanupService performs #29239

Merged
merged 1 commit into from
Jan 23, 2023

Conversation

steviez
Copy link
Contributor

@steviez steviez commented Dec 13, 2022

Problem

Currently, the cleanup service counts the number of shreds in the database by iterating the entire SlotMeta column and reading the number of received shreds for each slot. This gives us a fairly accurate count at the expense of performing a good amount of IO.

Summary of Changes

Instead of counting the individual slots, use the live_files() rust-rocksdb entrypoint that we expose in Blockstore. This API allows us to get the number of entries (shreds) in the data shred column family by reading file metadata. This is much more efficient from IO perspective.

Fixes #28403

@steviez
Copy link
Contributor Author

steviez commented Dec 13, 2022

Have a node running right now. It was previously running tip of master and got to a "full" ledger such that LedgerCleanupService was actually needing to do something. At the moment, I have about 1 day of runtime with the new change; the below graphs show 4 days of runtime total so new behavior kicks in at 12 Dec 10:00.

The first graph here shows the returned shred count (old was using counting of SlotMeta, new uses metadata from rocksdb API). The new number reported has a little more variation; I think some of this can be attributed to this API returning data about SST's (and not memtable). Also, this is an observation and not a problem in my eyes.
image

This second graph shows the disk utilization before (pink) and after (blue) cleanup runs. There is a little bit of a dip around the crossover point, but there is also some variation in the graph before I started using new behavior. The y-axis scale is also pretty small so we're talking about only a couple GB here. As the comment in code calls out, the new behavior is marginally more aggressive in cleaning so I would expect a couple GB vertical offset.
image

@steviez steviez requested a review from yhchiang-sol December 13, 2022 10:16
@steviez steviez marked this pull request as ready for review December 13, 2022 10:16
@steviez steviez force-pushed the lcs_file_meta branch 2 times, most recently from f9be872 to 3cb3baf Compare December 16, 2022 07:09
@github-actions github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Jan 2, 2023
@steviez steviez removed the stale [bot only] Added to stale content; results in auto-close after a week. label Jan 3, 2023
@steviez steviez force-pushed the lcs_file_meta branch 2 times, most recently from 83b7d6f to 35533ef Compare January 17, 2023 00:20
Copy link
Contributor

@yhchiang-sol yhchiang-sol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks good! Only minor comments.

Btw, do you happen to have updated numbers from your experiments? Or the previous numbers are already from the updated PR?

core/src/ledger_cleanup_service.rs Show resolved Hide resolved
core/src/ledger_cleanup_service.rs Show resolved Hide resolved
core/src/ledger_cleanup_service.rs Show resolved Hide resolved
core/src/ledger_cleanup_service.rs Show resolved Hide resolved
@steviez
Copy link
Contributor Author

steviez commented Jan 18, 2023

Btw, do you happen to have updated numbers from your experiments? Or the previous numbers are already from the updated PR?

Was planning on re-running with rebased on latest; going to utilize the nodes I had been using for testing the SlotMeta one. I consider getting another datapoint as a hard requirement before shipping.

It'd be nice to get a graph with I/O isolated to the blockstore (like we did on the GCP nodes when we put everything on separate drives), but with the dev servers having everything on one drive, it is harder to get a clean measurement. I don't consider this a hard requirement for shipping this PR tho; from inspection, it is very obvious that we are saving IO by not reading the SlotMeta column

yhchiang-sol
yhchiang-sol previously approved these changes Jan 18, 2023
Copy link
Contributor

@yhchiang-sol yhchiang-sol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks good! Thanks for adding tests! (esp. the test is tricky to write since there're things that are still in mem-tables, manual flush in tests is a good workaround.)

I don't consider this a hard requirement for shipping this PR tho; from inspection, it is very obvious that we are saving IO by not reading the SlotMeta column

Definitely not a hard requirement, but it would be great if we can explicitly say how much this PR improves things.

@steviez
Copy link
Contributor Author

steviez commented Jan 18, 2023

Definitely not a hard requirement, but it would be great if we can explicitly say how much this PR improves things.

Agreed. We do know how large the SlotMeta column will be, and we do know how often the scans occur. So, we can say how quantify how much / how often we would be reading. I did the math for that in the GH issue that this PR will close

@mergify mergify bot dismissed yhchiang-sol’s stale review January 20, 2023 21:41

Pull request has been modified.

Currently, the cleanup service counts the number of shreds in the
database by iterating the entire SlotMeta column and reading the number
of received shreds for each slot. This gives us a fairly accurate count
at the expense of performing a good amount of IO.

Instead of counting the individual slots, use the live_files()
rust-rocksdb entrypoint that we expose in Blockstore. This API allows us
to get the number of entries (shreds) in the data shred column family by
reading file metadata. This is much more efficient from IO perspective.
@steviez
Copy link
Contributor Author

steviez commented Jan 23, 2023

I had my test node running over the weekend. I let it get the ledger up to capacity on tip of master, and then updated node to use this branch on 2023.01.22 @ 19:00 UTC. Here is the disk usage for the weekend; it shows the ramp as well as shows that the total disk utilization is pretty flat before/after the change:
image

There is just a slight increase over this 24 hour period in total space; however, that is a function of block size as a control node (the purple trace) shows a similar trend up:
image

The total number of shreds found before cleanup is slightly noisier, which is a result (and somewhat expected) of us using a much cheaper estimate instead of exact counting for number of shreds:
image

However, the variation observed here is an extra 250k shreds, which is is 0.125 % of the 200M default ledger size. The inconsequential nature of this variation shows in that our total ledger size in the previous graphs is still pretty similar.

So, the data still looks good! I improved the error handling slightly (for a case that should never exists where highest_slot < lowest_slot from Blockstore functions), and additionally promoted the warn's I added to error. Restarting my validator and letting this run through CI again; these are minor enough that I'll push if validator looks good with the change reflected

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LedgerCleanupService reads the entire SlotMeta column with frequency
2 participants