Remove daily rewrite/compaction of each ledger file #27571

yhchiang-sol · 2022-09-02T09:16:57Z

Problem

Previously before #26651, our LedgerCleanupService needs RocksDB background
compactions to reclaim ledger disk space via our custom CompactionFilter.
However, since RocksDB's compaction isn't smart enough to know which file to pick,
we rely on the 1-day compaction period so that each file will be forced to be compacted
once a day so that we can reclaim ledger disk space in time. The downside of this is
each ledger file will be rewritten once per day.

Summary of Changes

As #26651 makes LedgerCleanupService actively delete those files whose entire slot-range
is older than both --limit-ledger-size and the current root, we can remove the 1-day compaction
period and get rid of the daily ledger file rewrite.

The results on mainnet-beta shows that this PR reduces ~20% write-bytes-per-second
and reduces ~50% read-bytes-per-second on ledger disk.

yhchiang-sol · 2022-09-09T04:38:44Z

The result on mainnet-beta indicates that we are still able to keep the ledger disk size within 500GB with default --limit-ledger-size without the daily rewrite/compaction. This means we can solely rely on #26651 to free up ledger disk space without daily rewrite/compaction!

steviez

Have you baked this any? I think it is likely good-to-go as is, but probably wouldn't hurt to get some runtime on it.

I had one concern about column families with small kv-pair size "accumulating" given how small they are compared to our target SST file-size. However, I see from metrics that these column families are getting flushed (likely due to your insight from DM's about total WAL size getting hit). This means the SST's are broken up with reasonable ranges such that they'll get picked up by delete_file_in_range_cf(). Thus, no accumulation and not a problem! (Even if there was accumulation, it probably would have been fairly small, and we probably could have lived with it).

yhchiang-sol · 2022-09-10T06:37:38Z

Have you baked this any? I think it is likely good-to-go as is, but probably wouldn't hurt to get some runtime on it.

I think the PR should be good to go. I am trying to collect more data points to show its performance benefit, although the ledger disk size already proves this change is safe to ship as we can still keep the ledger size small with just #26651.

steviez · 2022-09-10T06:53:45Z

I had a non-GCE node that was otherwise idle; I kicked it off with the tip of master + this PR. Let's let it run over the weekend and push once the ledger has hit capacity; this should probably happen by Monday. I was trying to grab a second (comparable machine) to run master without the delete_files_in_range() change such that we could have a direct side-by-side, but couldn't find one.

although the ledger disk size already proves this change is safe to ship as we can still keep the ledger size small with just #26651.
Technically, any testing you did for that PR would have still had periodic compactions at more frequent interval right? Again, I'm probably being paranoid, but the node is already running so let's just let it soak over the weekend if that is good with you

yhchiang-sol · 2022-09-12T04:53:27Z

The experimental results on my side on mainnet-beta comparing this PR and the master show that removing daily rewrite (w/ this PR, orange line) can reduce ~17% of ledger disk write-bytes-per-second compared to the master (green line) once the validator runs longer than a day that would trigger daily compaction without this PR.

steviez · 2022-09-12T18:15:35Z

The experimental results on my side on mainnet-beta comparing this PR and the master show that removing daily rewrite (w/ this PR, orange line) can reduce ~17% of ledger disk write-bytes-per-second compared to the master (green line) once the validator runs longer than a day that would trigger daily compaction without this PR.

My node ended up dying over the weekend from unrelated issues; but nice, what is the duration of time that the lower rate was sustained for? If it is still going, do you still see that trend?

Also, assuming you're just looking at disk rate on the entire disk, there is likely some I/O for accounts_index in there. I don't think there is a need to separate it out, just pointing out that %-wise, Rocks-specific-I/O might have dropped at a higher %

yhchiang-sol · 2022-09-12T20:57:45Z

My node ended up dying over the weekend from unrelated issues; but nice, what is the duration of time that the lower rate was sustained for? If it is still going, do you still see that trend?

I have another node that collects data points for this. Probably need few more days to collect more data. So far the trend starts only after a day when the daily compaction triggers on the master.

Also, assuming you're just looking at disk rate on the entire disk, there is likely some I/O for accounts_index in there. I don't think there is a need to separate it out, just pointing out that %-wise, Rocks-specific-I/O might have dropped at a higher %

Aghh, yep. I always forgot the account_index. I should separate it out next time.

yhchiang-sol · 2022-09-12T22:00:03Z

Here're more data points from my previous run only with this PR (unfortunately during that time the master one dies somehow so I only have data points for this PR).

So the current observation (and some of my predictions) from both data points is that:

within ~2 days:
- this PR: no increment in write-bytes-per-second.
- master: slightly more write-bytes-per-second after a day.
after ~2 days:
- this PR: start having increased write-bytes-per-second as the delete_files_in_range() kicks in.
- master (not yet have data points, but based on prediction): delete_files_in_range() kicks in as well + ~2 days of sst files to have daily rewrite/compaction. To collect this we probably need another week as I found one of my GCE node isn't stable and I already killed the instance.

yhchiang-sol · 2022-09-13T17:49:48Z

Offline discussed with @steviez on the experiments, looks like the experiment is affected by the recent issue #27740 that makes the validator occasionally not able to make new roots, and this might explain why it is difficult for all of my validators in the experiments to run healthy for more than 5 days or more. Below is an example error.

[2022-09-13T17:43:55.707521368Z WARN  solana_core::cluster_slot_state_verifier] Cluster duplicate confirmed slot 150355423 with hash Hzbuefkt7NtzLMCXpzzZKY8HAxLNeiTvBZJVkPvBq7ue, but our version has hash 2fL2Vh9rr6SikQPDPxeJvmbGVU2Mms5kRb7XfDrmZ5Yc

yhchiang-sol · 2022-09-13T19:01:08Z

Rebase to include #27752 that fixes the issue.

yhchiang-sol · 2022-09-16T03:27:23Z

Here're the new data points! It shows the effectiveness of this PR in reducing disk-write-bytes-per-second. The results also match my previous hypotheses. The one with this PR (the green one) shows ~10% to 15% disk-write-bytes-per-second. Note that the actual percentage might also change depending on the --limit-ledger-size setting (more win on larger --limit-ledger-size).

within ~2 days:

this PR: no increment in write-bytes-per-second.

master: slightly more write-bytes-per-second after a day.

after ~2 days:

this PR: start having increased write-bytes-per-second as the delete_files_in_range() kicks in.

master (not yet have data points, but based on prediction): delete_files_in_range() kicks in as well + ~2 days of sst files to have daily rewrite/compaction. To collect this we probably need another week as I found one of my GCE node isn't stable and I already killed the instance.

Disk space usages of the two instances are also the same, indicating delete_files_in_range() alone without the daily compaction is effective enough to free up ledger disk space!

steviez

LGTM! I think it is also worth calling attention to the fact that read-rate shows improvements as well, as the compaction filter involved scanning all keys whereas delete_files_in_range() only needs to scan the metadata for each SST.

steviez · 2022-09-16T20:18:51Z

Just leaving ourselves a note incase we look in the future, this PR reverts us back to allowing RocksDB choose a default value for when to go through compaction. That default value is still 30 days (as it was at the time that the comment that is being deleted in this PR was written):
https://github.com/facebook/rocksdb/blob/0f91c72adc977c3895f8320b13ae4ef2b8633756/include/rocksdb/advanced_options.h#L854-L865

yhchiang-sol · 2022-09-16T20:34:11Z

Thank @steviez for the additional comment.

Just leaving ourselves a note incase we look in the future, this PR reverts us back to allowing RocksDB choose a default value for when to go through compaction. That default value is still 30 days

Just adding some extra notes here. Since we have delete_files_in_range() from #26651 that will purge any file older than --limit-ledger-size, it means unless --limit-ledger-size is configured to keep ledger data for more than 30 days, we will not see any compactions trigged by the 30-day policy (but we will still have regular compaction triggered when a level is full or its size hits the compaction trigger).

Periodic compaction was previously disabled on all columns in solana-labs#27571 in favor of the delete_file_in_range() approach that solana-labs#26651 introduced. However, several columns still rely on periodic compaction to reclaim storage. Namely, the TransactionStatus and AddressSignatures columns as these columns contain a slot in their key, but as the secondary index. The result of periodic compaction not running on these columns is that no storage space was being reclaimed from columns. This is obviously bad and would lead to a node eventually running of storage space and crashing. This PR reintroduces periodic compaction, but only for the columns that need it.

Periodic compaction was previously disabled on all columns in #27571 in favor of the delete_file_in_range() approach that #26651 introduced. However, several columns still rely on periodic compaction to reclaim storage. Namely, the TransactionStatus and AddressSignatures columns, as these columns contain a slot in their key, but as a non-primary index. The result of periodic compaction not running on these columns is that no storage space is being reclaimed from columns. This is obviously bad and would lead to a node eventually running of storage space and crashing. This PR reintroduces periodic compaction, but only for the columns that need it.

Periodic compaction was previously disabled on all columns in #27571 in favor of the delete_file_in_range() approach that #26651 introduced. However, several columns still rely on periodic compaction to reclaim storage. Namely, the TransactionStatus and AddressSignatures columns, as these columns contain a slot in their key, but as a non-primary index. The result of periodic compaction not running on these columns is that no storage space is being reclaimed from columns. This is obviously bad and would lead to a node eventually running of storage space and crashing. This PR reintroduces periodic compaction, but only for the columns that need it. (cherry picked from commit d73fa1b)

yhchiang-sol requested a review from steviez September 9, 2022 04:38

steviez reviewed Sep 10, 2022

View reviewed changes

Remove 1-day period compaction setting

3d88788

steviez approved these changes Sep 16, 2022

View reviewed changes

yhchiang-sol merged commit 9831e4d into solana-labs:master Sep 16, 2022

yhchiang-sol mentioned this pull request Oct 9, 2022

Make Ledger Store reclaim disk space more efficiently and effectively #26003

Closed

steviez mentioned this pull request Jul 19, 2023

Re-enable periodic compaction on several columns #32548

Merged

steviez mentioned this pull request Jun 26, 2024

Deprecate --rocksdb-shred-compaction fifo anza-xyz/agave#1882

Merged

mergify bot mentioned this pull request Jun 28, 2024

v2.0: Deprecate --rocksdb-shred-compaction fifo (backport of #1882) anza-xyz/agave#1907

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove daily rewrite/compaction of each ledger file #27571

Remove daily rewrite/compaction of each ledger file #27571

yhchiang-sol commented Sep 2, 2022 •

edited

Loading

yhchiang-sol commented Sep 9, 2022

steviez left a comment

yhchiang-sol commented Sep 10, 2022

steviez commented Sep 10, 2022

yhchiang-sol commented Sep 12, 2022

steviez commented Sep 12, 2022 •

edited

Loading

yhchiang-sol commented Sep 12, 2022 •

edited

Loading

yhchiang-sol commented Sep 12, 2022

yhchiang-sol commented Sep 13, 2022 •

edited

Loading

yhchiang-sol commented Sep 13, 2022

yhchiang-sol commented Sep 16, 2022 •

edited

Loading

steviez left a comment •

edited

Loading

steviez commented Sep 16, 2022

yhchiang-sol commented Sep 16, 2022

Remove daily rewrite/compaction of each ledger file #27571

Remove daily rewrite/compaction of each ledger file #27571

Conversation

yhchiang-sol commented Sep 2, 2022 • edited Loading

Problem

Summary of Changes

yhchiang-sol commented Sep 9, 2022

steviez left a comment

Choose a reason for hiding this comment

yhchiang-sol commented Sep 10, 2022

steviez commented Sep 10, 2022

yhchiang-sol commented Sep 12, 2022

steviez commented Sep 12, 2022 • edited Loading

yhchiang-sol commented Sep 12, 2022 • edited Loading

yhchiang-sol commented Sep 12, 2022

yhchiang-sol commented Sep 13, 2022 • edited Loading

yhchiang-sol commented Sep 13, 2022

yhchiang-sol commented Sep 16, 2022 • edited Loading

steviez left a comment • edited Loading

Choose a reason for hiding this comment

steviez commented Sep 16, 2022

yhchiang-sol commented Sep 16, 2022

yhchiang-sol commented Sep 2, 2022 •

edited

Loading

steviez commented Sep 12, 2022 •

edited

Loading

yhchiang-sol commented Sep 12, 2022 •

edited

Loading

yhchiang-sol commented Sep 13, 2022 •

edited

Loading

yhchiang-sol commented Sep 16, 2022 •

edited

Loading

steviez left a comment •

edited

Loading