Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto resume the DB from Retryable IO Error #6765

Closed
wants to merge 8 commits into from

Conversation

zhichao-cao
Copy link
Contributor

In current codebase, in write path, if Retryable IO Error happens, SetBGError is called. The retryable IO Error is converted to hard error and DB is in read only mode. User or application needs to resume it. In this PR, if Retryable IO Error happens in one DB, SetBGError will create a new thread to call Resume (auto resume). otpions.max_bgerror_resume_count controls if auto resume is enabled or not (if max_bgerror_resume_count<=0, auto resume will not be enabled). options.bgerror_resume_retry_interval controls the time interval to call Resume again if the previous resume fails due to the Retryable IO Error. If non-retryable error happens during resume, auto resume will terminate.

Test plan: Added the unit test cases in error_handler_fs_test and pass make asan_check

Copy link
Contributor

@anand1976 anand1976 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhichao-cao Thanks for the PR! I've partially reviewed and added some comments. Will continue to review.

db/error_handler.cc Show resolved Hide resolved
db/error_handler.cc Outdated Show resolved Hide resolved
db/error_handler.cc Show resolved Hide resolved
db/error_handler.cc Outdated Show resolved Hide resolved
db/error_handler.h Show resolved Hide resolved
db/error_handler.cc Show resolved Hide resolved
db/db_impl/db_impl.cc Outdated Show resolved Hide resolved
db/compaction/compaction_job.cc Outdated Show resolved Hide resolved
Copy link
Contributor

@anand1976 anand1976 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just had a couple of comments.

db/error_handler.cc Outdated Show resolved Hide resolved
db/error_handler.cc Outdated Show resolved Hide resolved
db/compaction/compaction_job.cc Outdated Show resolved Hide resolved
Copy link
Contributor

@anand1976 anand1976 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @zhichao-cao!

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhichao-cao has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@@ -1144,6 +1144,22 @@ struct DBOptions {
// not be used for recovery if best_efforts_recovery is true.
// Default: false
bool best_efforts_recovery = false;

// It defines How many times call db resume when retryable BGError happens.
Copy link
Contributor

@ajkr ajkr Jun 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does somebody want to tune these knobs? How about always making infinite recovery attempts with a one second gap?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a good idea. We probably just need an option to enable/disable the functionality in case of bugs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anand1976 You mean, I can remove the options of max_bgerror_resume_count and bgerror_resume_retry_interval, set value directly in the resume function. Instead, we give a option for user to enable or disable auto resume? The resume will be infinite attempts as suggested by Andrew?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhichao-cao Thinking about this some more, I think there's value in allowing the retry interval to be adjusted. The flushes initiated by the recovery threads share the same background thread pool as regular flushes, so if you have a mix of good and bad DBs, changing the retry interval will allow a user to control resource consumption for recovery. The retry count is debatable. We could set the default to infinite and user can set it to 0 to disable. I don't know if there would be a reason to choose some value in between. Again, maybe to limit resource consumption.

@facebook-github-bot
Copy link
Contributor

@zhichao-cao has updated the pull request. Re-import the pull request

1 similar comment
@facebook-github-bot
Copy link
Contributor

@zhichao-cao has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhichao-cao has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@zhichao-cao has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhichao-cao has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@zhichao-cao has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhichao-cao has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@anand1976 anand1976 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for diligently fixing the tests!

@facebook-github-bot
Copy link
Contributor

@zhichao-cao has updated the pull request. You must reimport the pull request before landing.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhichao-cao has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@zhichao-cao has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@zhichao-cao has updated the pull request. You must reimport the pull request before landing.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhichao-cao has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@zhichao-cao merged this pull request in a10f12e.

jay-zhuang added a commit to jay-zhuang/rocksdb that referenced this pull request Jul 17, 2020
facebook-github-bot pushed a commit that referenced this pull request Jul 18, 2020
Summary:
Remove the 3 testing cases that cause the time out in linux build by #6765 . Will fix them later.

Pull Request resolved: #7141

Test Plan: make asan_check, buck run

Reviewed By: ajkr

Differential Revision: D22593831

Pulled By: zhichao-cao

fbshipit-source-id: 14956c36476ecc3393f613178c22e13df843126e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants