-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove daily rewrite/compaction of each ledger file #27571
Remove daily rewrite/compaction of each ledger file #27571
Conversation
The result on mainnet-beta indicates that we are still able to keep the ledger disk size within 500GB with default --limit-ledger-size without the daily rewrite/compaction. This means we can solely rely on #26651 to free up ledger disk space without daily rewrite/compaction! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you baked this any? I think it is likely good-to-go as is, but probably wouldn't hurt to get some runtime on it.
I had one concern about column families with small kv-pair size "accumulating" given how small they are compared to our target SST file-size. However, I see from metrics that these column families are getting flushed (likely due to your insight from DM's about total WAL size getting hit). This means the SST's are broken up with reasonable ranges such that they'll get picked up by delete_file_in_range_cf()
. Thus, no accumulation and not a problem! (Even if there was accumulation, it probably would have been fairly small, and we probably could have lived with it).
I think the PR should be good to go. I am trying to collect more data points to show its performance benefit, although the ledger disk size already proves this change is safe to ship as we can still keep the ledger size small with just #26651. |
I had a non-GCE node that was otherwise idle; I kicked it off with the tip of master + this PR. Let's let it run over the weekend and push once the ledger has hit capacity; this should probably happen by Monday. I was trying to grab a second (comparable machine) to run master without the
|
The experimental results on my side on mainnet-beta comparing this PR and the master show that removing daily rewrite (w/ this PR, orange line) can reduce ~17% of ledger disk write-bytes-per-second compared to the master (green line) once the validator runs longer than a day that would trigger daily compaction without this PR. |
I have another node that collects data points for this. Probably need few more days to collect more data. So far the trend starts only after a day when the daily compaction triggers on the master.
Aghh, yep. I always forgot the account_index. I should separate it out next time. |
Here're more data points from my previous run only with this PR (unfortunately during that time the master one dies somehow so I only have data points for this PR). So the current observation (and some of my predictions) from both data points is that:
|
Offline discussed with @steviez on the experiments, looks like the experiment is affected by the recent issue #27740 that makes the validator occasionally not able to make new roots, and this might explain why it is difficult for all of my validators in the experiments to run healthy for more than 5 days or more. Below is an example error.
|
Rebase to include #27752 that fixes the issue. |
Here're the new data points! It shows the effectiveness of this PR in reducing disk-write-bytes-per-second. The results also match my previous hypotheses. The one with this PR (the green one) shows ~10% to 15% disk-write-bytes-per-second. Note that the actual percentage might also change depending on the
Disk space usages of the two instances are also the same, indicating delete_files_in_range() alone without the daily compaction is effective enough to free up ledger disk space! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just leaving ourselves a note incase we look in the future, this PR reverts us back to allowing RocksDB choose a default value for when to go through compaction. That default value is still 30 days (as it was at the time that the comment that is being deleted in this PR was written): |
Thank @steviez for the additional comment.
Just adding some extra notes here. Since we have delete_files_in_range() from #26651 that will purge any file older than --limit-ledger-size, it means unless --limit-ledger-size is configured to keep ledger data for more than 30 days, we will not see any compactions trigged by the 30-day policy (but we will still have regular compaction triggered when a level is full or its size hits the compaction trigger). |
Periodic compaction was previously disabled on all columns in solana-labs#27571 in favor of the delete_file_in_range() approach that solana-labs#26651 introduced. However, several columns still rely on periodic compaction to reclaim storage. Namely, the TransactionStatus and AddressSignatures columns as these columns contain a slot in their key, but as the secondary index. The result of periodic compaction not running on these columns is that no storage space was being reclaimed from columns. This is obviously bad and would lead to a node eventually running of storage space and crashing. This PR reintroduces periodic compaction, but only for the columns that need it.
Periodic compaction was previously disabled on all columns in solana-labs#27571 in favor of the delete_file_in_range() approach that solana-labs#26651 introduced. However, several columns still rely on periodic compaction to reclaim storage. Namely, the TransactionStatus and AddressSignatures columns as these columns contain a slot in their key, but as the secondary index. The result of periodic compaction not running on these columns is that no storage space was being reclaimed from columns. This is obviously bad and would lead to a node eventually running of storage space and crashing. This PR reintroduces periodic compaction, but only for the columns that need it.
Periodic compaction was previously disabled on all columns in #27571 in favor of the delete_file_in_range() approach that #26651 introduced. However, several columns still rely on periodic compaction to reclaim storage. Namely, the TransactionStatus and AddressSignatures columns, as these columns contain a slot in their key, but as a non-primary index. The result of periodic compaction not running on these columns is that no storage space is being reclaimed from columns. This is obviously bad and would lead to a node eventually running of storage space and crashing. This PR reintroduces periodic compaction, but only for the columns that need it.
Periodic compaction was previously disabled on all columns in #27571 in favor of the delete_file_in_range() approach that #26651 introduced. However, several columns still rely on periodic compaction to reclaim storage. Namely, the TransactionStatus and AddressSignatures columns, as these columns contain a slot in their key, but as a non-primary index. The result of periodic compaction not running on these columns is that no storage space is being reclaimed from columns. This is obviously bad and would lead to a node eventually running of storage space and crashing. This PR reintroduces periodic compaction, but only for the columns that need it. (cherry picked from commit d73fa1b)
Problem
Previously before #26651, our LedgerCleanupService needs RocksDB background
compactions to reclaim ledger disk space via our custom CompactionFilter.
However, since RocksDB's compaction isn't smart enough to know which file to pick,
we rely on the 1-day compaction period so that each file will be forced to be compacted
once a day so that we can reclaim ledger disk space in time. The downside of this is
each ledger file will be rewritten once per day.
Summary of Changes
As #26651 makes LedgerCleanupService actively delete those files whose entire slot-range
is older than both --limit-ledger-size and the current root, we can remove the 1-day compaction
period and get rid of the daily ledger file rewrite.
The results on mainnet-beta shows that this PR reduces ~20% write-bytes-per-second
and reduces ~50% read-bytes-per-second on ledger disk.