Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove daily rewrite/compaction of each ledger file #27571

Merged
merged 1 commit into from
Sep 16, 2022
Merged

Remove daily rewrite/compaction of each ledger file #27571

merged 1 commit into from
Sep 16, 2022

Conversation

yhchiang-sol
Copy link
Contributor

@yhchiang-sol yhchiang-sol commented Sep 2, 2022

Problem

Previously before #26651, our LedgerCleanupService needs RocksDB background
compactions to reclaim ledger disk space via our custom CompactionFilter.
However, since RocksDB's compaction isn't smart enough to know which file to pick,
we rely on the 1-day compaction period so that each file will be forced to be compacted
once a day so that we can reclaim ledger disk space in time. The downside of this is
each ledger file will be rewritten once per day.

Summary of Changes

As #26651 makes LedgerCleanupService actively delete those files whose entire slot-range
is older than both --limit-ledger-size and the current root, we can remove the 1-day compaction
period and get rid of the daily ledger file rewrite.

The results on mainnet-beta shows that this PR reduces ~20% write-bytes-per-second
and reduces ~50% read-bytes-per-second on ledger disk.

@yhchiang-sol
Copy link
Contributor Author

The result on mainnet-beta indicates that we are still able to keep the ledger disk size within 500GB with default --limit-ledger-size without the daily rewrite/compaction. This means we can solely rely on #26651 to free up ledger disk space without daily rewrite/compaction!

Screen Shot 2022-09-08 at 9 33 44 PM

@yhchiang-sol yhchiang-sol requested a review from steviez September 9, 2022 04:38
Copy link
Contributor

@steviez steviez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you baked this any? I think it is likely good-to-go as is, but probably wouldn't hurt to get some runtime on it.

I had one concern about column families with small kv-pair size "accumulating" given how small they are compared to our target SST file-size. However, I see from metrics that these column families are getting flushed (likely due to your insight from DM's about total WAL size getting hit). This means the SST's are broken up with reasonable ranges such that they'll get picked up by delete_file_in_range_cf(). Thus, no accumulation and not a problem! (Even if there was accumulation, it probably would have been fairly small, and we probably could have lived with it).

@yhchiang-sol
Copy link
Contributor Author

Have you baked this any? I think it is likely good-to-go as is, but probably wouldn't hurt to get some runtime on it.

I think the PR should be good to go. I am trying to collect more data points to show its performance benefit, although the ledger disk size already proves this change is safe to ship as we can still keep the ledger size small with just #26651.

@steviez
Copy link
Contributor

steviez commented Sep 10, 2022

I had a non-GCE node that was otherwise idle; I kicked it off with the tip of master + this PR. Let's let it run over the weekend and push once the ledger has hit capacity; this should probably happen by Monday. I was trying to grab a second (comparable machine) to run master without the delete_files_in_range() change such that we could have a direct side-by-side, but couldn't find one.

although the ledger disk size already proves this change is safe to ship as we can still keep the ledger size small with just #26651.
Technically, any testing you did for that PR would have still had periodic compactions at more frequent interval right? Again, I'm probably being paranoid, but the node is already running so let's just let it soak over the weekend if that is good with you

@yhchiang-sol
Copy link
Contributor Author

The experimental results on my side on mainnet-beta comparing this PR and the master show that removing daily rewrite (w/ this PR, orange line) can reduce ~17% of ledger disk write-bytes-per-second compared to the master (green line) once the validator runs longer than a day that would trigger daily compaction without this PR.

Screen Shot 2022-09-11 at 9 46 28 PM

Screen Shot 2022-09-11 at 9 44 19 PM

@steviez
Copy link
Contributor

steviez commented Sep 12, 2022

The experimental results on my side on mainnet-beta comparing this PR and the master show that removing daily rewrite (w/ this PR, orange line) can reduce ~17% of ledger disk write-bytes-per-second compared to the master (green line) once the validator runs longer than a day that would trigger daily compaction without this PR.

Screen Shot 2022-09-11 at 9 46 28 PM Screen Shot 2022-09-11 at 9 44 19 PM

My node ended up dying over the weekend from unrelated issues; but nice, what is the duration of time that the lower rate was sustained for? If it is still going, do you still see that trend?

Also, assuming you're just looking at disk rate on the entire disk, there is likely some I/O for accounts_index in there. I don't think there is a need to separate it out, just pointing out that %-wise, Rocks-specific-I/O might have dropped at a higher %

@yhchiang-sol
Copy link
Contributor Author

yhchiang-sol commented Sep 12, 2022

My node ended up dying over the weekend from unrelated issues; but nice, what is the duration of time that the lower rate was sustained for? If it is still going, do you still see that trend?

I have another node that collects data points for this. Probably need few more days to collect more data. So far the trend starts only after a day when the daily compaction triggers on the master.

Also, assuming you're just looking at disk rate on the entire disk, there is likely some I/O for accounts_index in there. I don't think there is a need to separate it out, just pointing out that %-wise, Rocks-specific-I/O might have dropped at a higher %

Aghh, yep. I always forgot the account_index. I should separate it out next time.

@yhchiang-sol
Copy link
Contributor Author

Here're more data points from my previous run only with this PR (unfortunately during that time the master one dies somehow so I only have data points for this PR).

So the current observation (and some of my predictions) from both data points is that:

  • within ~2 days:

    • this PR: no increment in write-bytes-per-second.
    • master: slightly more write-bytes-per-second after a day.
  • after ~2 days:

    • this PR: start having increased write-bytes-per-second as the delete_files_in_range() kicks in.
    • master (not yet have data points, but based on prediction): delete_files_in_range() kicks in as well + ~2 days of sst files to have daily rewrite/compaction. To collect this we probably need another week as I found one of my GCE node isn't stable and I already killed the instance.

Screen Shot 2022-09-12 at 2 51 30 PM

@yhchiang-sol
Copy link
Contributor Author

yhchiang-sol commented Sep 13, 2022

Offline discussed with @steviez on the experiments, looks like the experiment is affected by the recent issue #27740 that makes the validator occasionally not able to make new roots, and this might explain why it is difficult for all of my validators in the experiments to run healthy for more than 5 days or more. Below is an example error.

[2022-09-13T17:43:55.707521368Z WARN  solana_core::cluster_slot_state_verifier] Cluster duplicate confirmed slot 150355423 with hash Hzbuefkt7NtzLMCXpzzZKY8HAxLNeiTvBZJVkPvBq7ue, but our version has hash 2fL2Vh9rr6SikQPDPxeJvmbGVU2Mms5kRb7XfDrmZ5Yc

@yhchiang-sol
Copy link
Contributor Author

Rebase to include #27752 that fixes the issue.

@yhchiang-sol
Copy link
Contributor Author

yhchiang-sol commented Sep 16, 2022

Here're the new data points! It shows the effectiveness of this PR in reducing disk-write-bytes-per-second. The results also match my previous hypotheses. The one with this PR (the green one) shows ~10% to 15% disk-write-bytes-per-second. Note that the actual percentage might also change depending on the --limit-ledger-size setting (more win on larger --limit-ledger-size).

  • within ~2 days:
    • this PR: no increment in write-bytes-per-second.
    • master: slightly more write-bytes-per-second after a day.
  • after ~2 days:
    • this PR: start having increased write-bytes-per-second as the delete_files_in_range() kicks in.
    • master (not yet have data points, but based on prediction): delete_files_in_range() kicks in as well + ~2 days of sst files to have daily rewrite/compaction. To collect this we probably need another week as I found one of my GCE node isn't stable and I already killed the instance.

Screen Shot 2022-09-15 at 8 17 59 PM

Disk space usages of the two instances are also the same, indicating delete_files_in_range() alone without the daily compaction is effective enough to free up ledger disk space!

Screen Shot 2022-09-15 at 8 25 54 PM

Copy link
Contributor

@steviez steviez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I think it is also worth calling attention to the fact that read-rate shows improvements as well, as the compaction filter involved scanning all keys whereas delete_files_in_range() only needs to scan the metadata for each SST.

image

@yhchiang-sol yhchiang-sol merged commit 9831e4d into solana-labs:master Sep 16, 2022
@steviez
Copy link
Contributor

steviez commented Sep 16, 2022

Just leaving ourselves a note incase we look in the future, this PR reverts us back to allowing RocksDB choose a default value for when to go through compaction. That default value is still 30 days (as it was at the time that the comment that is being deleted in this PR was written):
https://github.com/facebook/rocksdb/blob/0f91c72adc977c3895f8320b13ae4ef2b8633756/include/rocksdb/advanced_options.h#L854-L865

@yhchiang-sol
Copy link
Contributor Author

Thank @steviez for the additional comment.

Just leaving ourselves a note incase we look in the future, this PR reverts us back to allowing RocksDB choose a default value for when to go through compaction. That default value is still 30 days

Just adding some extra notes here. Since we have delete_files_in_range() from #26651 that will purge any file older than --limit-ledger-size, it means unless --limit-ledger-size is configured to keep ledger data for more than 30 days, we will not see any compactions trigged by the 30-day policy (but we will still have regular compaction triggered when a level is full or its size hits the compaction trigger).

steviez added a commit to steviez/solana that referenced this pull request Jul 19, 2023
Periodic compaction was previously disabled on all columns in solana-labs#27571 in
favor of the delete_file_in_range() approach that solana-labs#26651 introduced.
However, several columns still rely on periodic compaction to reclaim
storage. Namely, the TransactionStatus and AddressSignatures columns as
these columns contain a slot in their key, but as the secondary index.

The result of periodic compaction not running on these columns is that
no storage space was being reclaimed from columns. This is obviously bad
and would lead to a node eventually running of storage space and
crashing.

This PR reintroduces periodic compaction, but only for the columns that
need it.
steviez added a commit to steviez/solana that referenced this pull request Jul 19, 2023
Periodic compaction was previously disabled on all columns in solana-labs#27571 in
favor of the delete_file_in_range() approach that solana-labs#26651 introduced.
However, several columns still rely on periodic compaction to reclaim
storage. Namely, the TransactionStatus and AddressSignatures columns as
these columns contain a slot in their key, but as the secondary index.

The result of periodic compaction not running on these columns is that
no storage space was being reclaimed from columns. This is obviously bad
and would lead to a node eventually running of storage space and
crashing.

This PR reintroduces periodic compaction, but only for the columns that
need it.
steviez added a commit that referenced this pull request Jul 20, 2023
Periodic compaction was previously disabled on all columns in #27571 in
favor of the delete_file_in_range() approach that #26651 introduced.
However, several columns still rely on periodic compaction to reclaim
storage. Namely, the TransactionStatus and AddressSignatures columns, as
these columns contain a slot in their key, but as a non-primary index.

The result of periodic compaction not running on these columns is that
no storage space is being reclaimed from columns. This is obviously bad
and would lead to a node eventually running of storage space and
crashing.

This PR reintroduces periodic compaction, but only for the columns that
need it.
mergify bot pushed a commit that referenced this pull request Jul 20, 2023
Periodic compaction was previously disabled on all columns in #27571 in
favor of the delete_file_in_range() approach that #26651 introduced.
However, several columns still rely on periodic compaction to reclaim
storage. Namely, the TransactionStatus and AddressSignatures columns, as
these columns contain a slot in their key, but as a non-primary index.

The result of periodic compaction not running on these columns is that
no storage space is being reclaimed from columns. This is obviously bad
and would lead to a node eventually running of storage space and
crashing.

This PR reintroduces periodic compaction, but only for the columns that
need it.

(cherry picked from commit d73fa1b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants