Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(storage): handle inconsistency between data and commit in ReadRecoveryPoints #545

Merged
merged 2 commits into from
Jan 2, 2024

Conversation

ijsong
Copy link
Member

@ijsong ijsong commented Jul 25, 2023

What this PR does

Previously the method internal/(*storage).ReadRecoveryPoints returned an error when it found an inconsistency between data and commit for log entries, which could happen when there was no data for a commit. It should not have happened when we used the unified database in storage, so the method returned an error.
However, using separate databases in storage and turning off the WAL sync option can happen; for instance, the data for a log entry can be lost, although the commit for that log entry is synced.

This PR makes the ReadRecoveryPoints not return an error for the inconsistency between data and commit for a log entry. It tries to find the first and last log entries that have no inconsistency. If there are no valid first and last, it returns nil for them and lets them be resolved through synchronization.

@ijsong ijsong force-pushed the storage_refactor_recoverypoints branch from 03e72df to 284c477 Compare July 25, 2023 03:48
@ijsong ijsong force-pushed the storage_readrecoverypoints_inconsistency branch from 7048516 to 11f87a6 Compare July 25, 2023 03:48
@ijsong ijsong self-assigned this Jul 25, 2023
@codecov-commenter
Copy link

codecov-commenter commented Jul 25, 2023

Codecov Report

Attention: 9 lines in your changes are missing coverage. Please review.

Comparison is base (aa6789e) 61.19% compared to head (8832c06) 61.40%.

Files Patch % Lines
internal/storage/recovery_points.go 89.28% 6 Missing and 3 partials ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #545      +/-   ##
==========================================
+ Coverage   61.19%   61.40%   +0.20%     
==========================================
  Files         144      144              
  Lines       19213    19270      +57     
==========================================
+ Hits        11757    11832      +75     
+ Misses       6867     6851      -16     
+ Partials      589      587       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

return nil, nil, err
last = s.getLastLogSequenceNumber(cit, dit, first)
if last == nil {
s.logger.Warn("the last must exist but could not be found.", zap.Stringer("first", first))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a case that could happen? If it happend, rp.CommittedLogEntry.Last[L33] is nil and it means maybe trimmed[internal/storagenode/logstream/executor.go:L650]. Are there any side effects?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot happen because the first log entry already exists. The first log entry is also the last if only a single log entry exists. Therefore, this if block is unnecessary but just added for debugging in weird situations.

These similar anomalies, including this if block, can be fixed through synchronization. But as you said, we need to know whether this replica has been trimmed. I think the global low watermark can be helpful.

@ijsong
Copy link
Member Author

ijsong commented Jul 28, 2023

@hungryjang, I will recheck this PR since it can be affected by #492.
Usually, I assumed that all writes in the storage module are executed in a fsync way and unified data and commit db. But it was wrong. Separated data and commit db and the non-fsync writes can make subtle situations.

@ijsong ijsong force-pushed the storage_refactor_recoverypoints branch from 284c477 to ece04d6 Compare July 28, 2023 10:25
@ijsong ijsong force-pushed the storage_readrecoverypoints_inconsistency branch from 11f87a6 to 5957cad Compare July 28, 2023 10:25
Base automatically changed from storage_refactor_recoverypoints to main July 28, 2023 11:03
@ijsong ijsong force-pushed the storage_readrecoverypoints_inconsistency branch from 5957cad to 47deb7d Compare July 31, 2023 00:48
@ijsong ijsong force-pushed the storage_readrecoverypoints_inconsistency branch 2 times, most recently from e33a700 to d2dd322 Compare October 4, 2023 10:26
@ijsong ijsong force-pushed the storage_readrecoverypoints_inconsistency branch from d2dd322 to aca606d Compare October 24, 2023 01:39
@ijsong ijsong force-pushed the storage_readrecoverypoints_inconsistency branch from aca606d to 0f02a73 Compare November 16, 2023 01:04
@ijsong ijsong force-pushed the storage_readrecoverypoints_inconsistency branch 2 times, most recently from 73aefd7 to 2859a89 Compare December 1, 2023 08:07
@ijsong ijsong force-pushed the storage_readrecoverypoints_inconsistency branch from 2859a89 to b76599c Compare December 28, 2023 05:51
@ijsong
Copy link
Member Author

ijsong commented Dec 28, 2023

@hungryjang, I will recheck this PR since it can be affected by #492. Usually, I assumed that all writes in the storage module are executed in a fsync way and unified data and commit db. But it was wrong. Separated data and commit db and the non-fsync writes can make subtle situations.

@hungryjang, can you take a look at this PR? I will finish this issue with this PR and make a new follow-up PR if a new issue arises. Since it passes a long time, I can't remember other issues about those. 🥲

…coveryPoints

Previously the method internal/(*storage).ReadRecoveryPoints returned an error when it found an
inconsistency between data and commit for log entries, which could happen when there was no data for
a commit. It should not have happened when we used the unified database in storage, so the method
returned an error.
However, using separate databases in storage and turning off the WAL sync option can happen; for
instance, the data for a log entry can be lost, although the commit for that log entry is synced.

This PR makes the ReadRecoveryPoints not return an error for the inconsistency between data and
commit for a log entry. It tries to find the first and last log entries that have no inconsistency.
If there are no valid first and last, it returns nil for them and lets them be resolved through
synchronization.
@ijsong ijsong force-pushed the storage_readrecoverypoints_inconsistency branch from b76599c to 8832c06 Compare January 2, 2024 00:19
@ijsong
Copy link
Member Author

ijsong commented Jan 2, 2024

Merge activity

  • Jan 1, 8:00 PM: @ijsong started a stack merge that includes this pull request via Graphite.
  • Jan 1, 8:01 PM: @ijsong merged this pull request with Graphite.

@ijsong ijsong merged commit 208c113 into main Jan 2, 2024
16 checks passed
@ijsong ijsong deleted the storage_readrecoverypoints_inconsistency branch January 2, 2024 01:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants