Auto resume the DB from Retryable IO Error #6765

zhichao-cao · 2020-04-28T05:11:32Z

In current codebase, in write path, if Retryable IO Error happens, SetBGError is called. The retryable IO Error is converted to hard error and DB is in read only mode. User or application needs to resume it. In this PR, if Retryable IO Error happens in one DB, SetBGError will create a new thread to call Resume (auto resume). otpions.max_bgerror_resume_count controls if auto resume is enabled or not (if max_bgerror_resume_count<=0, auto resume will not be enabled). options.bgerror_resume_retry_interval controls the time interval to call Resume again if the previous resume fails due to the Retryable IO Error. If non-retryable error happens during resume, auto resume will terminate.

Test plan: Added the unit test cases in error_handler_fs_test and pass make asan_check

db/compaction/compaction_job.cc

anand1976

@zhichao-cao Thanks for the PR! I've partially reviewed and added some comments. Will continue to review.

db/error_handler.cc

db/error_handler.h

db/error_handler.cc

db/db_impl/db_impl.cc

db/compaction/compaction_job.cc

anand1976

Looks great! Just had a couple of comments.

db/error_handler.cc

db/compaction/compaction_job.cc

anand1976

LGTM. Thanks @zhichao-cao!

db/error_handler.cc

facebook-github-bot

@zhichao-cao has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ajkr · 2020-06-08T23:55:40Z

include/rocksdb/options.h

@@ -1144,6 +1144,22 @@ struct DBOptions {
  // not be used for recovery if best_efforts_recovery is true.
  // Default: false
  bool best_efforts_recovery = false;
+
+  // It defines How many times call db resume when retryable BGError happens.


Does somebody want to tune these knobs? How about always making infinite recovery attempts with a one second gap?

Yes, that's a good idea. We probably just need an option to enable/disable the functionality in case of bugs.

@anand1976 You mean, I can remove the options of max_bgerror_resume_count and bgerror_resume_retry_interval, set value directly in the resume function. Instead, we give a option for user to enable or disable auto resume? The resume will be infinite attempts as suggested by Andrew?

@zhichao-cao Thinking about this some more, I think there's value in allowing the retry interval to be adjusted. The flushes initiated by the recovery threads share the same background thread pool as regular flushes, so if you have a mix of good and bad DBs, changing the retry interval will allow a user to control resource consumption for recovery. The retry count is debatable. We could set the default to infinite and user can set it to 0 to disable. I don't know if there would be a reason to choose some value in between. Again, maybe to limit resource consumption.

facebook-github-bot · 2020-07-06T22:06:56Z

@zhichao-cao has updated the pull request. Re-import the pull request

facebook-github-bot · 2020-07-06T23:11:37Z

@zhichao-cao has updated the pull request. Re-import the pull request

facebook-github-bot

@zhichao-cao has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-07-11T02:37:21Z

@zhichao-cao has updated the pull request. Re-import the pull request

facebook-github-bot

@zhichao-cao has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-07-11T02:43:40Z

@zhichao-cao has updated the pull request. Re-import the pull request

facebook-github-bot

@zhichao-cao has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

anand1976

LGTM. Thanks for diligently fixing the tests!

facebook-github-bot · 2020-07-14T21:51:55Z

@zhichao-cao has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot

@zhichao-cao has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-07-15T00:02:28Z

@zhichao-cao has updated the pull request. You must reimport the pull request before landing.

deadlock

…for auto resume, code works

facebook-github-bot · 2020-07-15T00:03:52Z

@zhichao-cao has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot

@zhichao-cao has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-07-15T18:38:37Z

@zhichao-cao merged this pull request in a10f12e.

This reverts commit a10f12e.

Summary: Remove the 3 testing cases that cause the time out in linux build by #6765 . Will fix them later. Pull Request resolved: #7141 Test Plan: make asan_check, buck run Reviewed By: ajkr Differential Revision: D22593831 Pulled By: zhichao-cao fbshipit-source-id: 14956c36476ecc3393f613178c22e13df843126e

zhichao-cao requested review from siying and anand1976 April 28, 2020 05:11

facebook-github-bot added the CLA Signed label Apr 28, 2020

zhichao-cao force-pushed the auto_resume branch from cd5e8b0 to 8f46055 Compare April 28, 2020 17:54

zhichao-cao commented Apr 29, 2020

View reviewed changes

db/compaction/compaction_job.cc Outdated Show resolved Hide resolved

zhichao-cao force-pushed the auto_resume branch from 8f46055 to 0cfdc6b Compare April 29, 2020 22:12

anand1976 reviewed May 9, 2020

View reviewed changes

zhichao-cao force-pushed the auto_resume branch from 0cfdc6b to f0e16cb Compare May 18, 2020 23:59

anand1976 reviewed May 20, 2020

View reviewed changes

db/error_handler.cc Outdated Show resolved Hide resolved

db/error_handler.cc Outdated Show resolved Hide resolved

db/compaction/compaction_job.cc Outdated Show resolved Hide resolved

zhichao-cao force-pushed the auto_resume branch from f0e16cb to b24a84b Compare May 30, 2020 00:40

zhichao-cao requested a review from anand1976 June 3, 2020 00:03

anand1976 approved these changes Jun 4, 2020

View reviewed changes

zhichao-cao force-pushed the auto_resume branch from b24a84b to 4fbcf61 Compare June 4, 2020 23:38

riversand963 reviewed Jun 5, 2020

View reviewed changes

db/error_handler.cc Show resolved Hide resolved

riversand963 reviewed Jun 5, 2020

View reviewed changes

db/error_handler.cc Show resolved Hide resolved

facebook-github-bot reviewed Jun 6, 2020

View reviewed changes

ajkr reviewed Jun 8, 2020

View reviewed changes

zhichao-cao force-pushed the auto_resume branch from 4fbcf61 to d99113f Compare July 6, 2020 22:06

facebook-github-bot reviewed Jul 6, 2020

View reviewed changes

zhichao-cao force-pushed the auto_resume branch from 3841416 to 77f042f Compare July 11, 2020 02:37

facebook-github-bot reviewed Jul 11, 2020

View reviewed changes

zhichao-cao force-pushed the auto_resume branch from 77f042f to a64838b Compare July 11, 2020 02:43

facebook-github-bot reviewed Jul 11, 2020

View reviewed changes

anand1976 approved these changes Jul 14, 2020

View reviewed changes

zhichao-cao force-pushed the auto_resume branch from a64838b to 0e88d27 Compare July 14, 2020 21:51

facebook-github-bot reviewed Jul 14, 2020

View reviewed changes

zhichao-cao force-pushed the auto_resume branch from 0e88d27 to 1939381 Compare July 15, 2020 00:02

zhichao-cao added 8 commits July 14, 2020 17:03

inital change, pass the make, cannot run the error_handler_test, maybe

e11dd30

deadlock

Find the missing IO Status pass cases, write the inital testing code …

d34d61d

…for auto resume, code works

Added the test for each step, initial version

dfb9492

Cleanup the code

1951b05

Address review comments

36cdc51

Added explanation to HISTORY.md

500607b

Remove compaction from auto resume

8a139f8

Added the test case for compaction recover

e2b5396

zhichao-cao force-pushed the auto_resume branch from 1939381 to e2b5396 Compare July 15, 2020 00:03

facebook-github-bot reviewed Jul 15, 2020

View reviewed changes

facebook-github-bot closed this in a10f12e Jul 15, 2020

facebook-github-bot added the Merged label Jul 15, 2020

zhichao-cao mentioned this pull request Jul 17, 2020

Remove time out testing cases in error_handler_fs_test #7141

Closed

jay-zhuang added a commit to jay-zhuang/rocksdb that referenced this pull request Jul 17, 2020

Revert "Auto resume the DB from Retryable IO Error (facebook#6765)"

cd8bbbc

This reverts commit a10f12e.

karelrooted mentioned this pull request Apr 21, 2021

Upgrade rocksdb to latest tag v6.19.3 apache/kvrocks#226

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto resume the DB from Retryable IO Error #6765

Auto resume the DB from Retryable IO Error #6765

zhichao-cao commented Apr 28, 2020

anand1976 left a comment

anand1976 left a comment

anand1976 left a comment

facebook-github-bot left a comment

ajkr Jun 8, 2020 •

edited

Loading

anand1976 Jun 24, 2020

zhichao-cao Jun 24, 2020

anand1976 Jun 25, 2020

facebook-github-bot commented Jul 6, 2020

facebook-github-bot commented Jul 6, 2020

facebook-github-bot left a comment

facebook-github-bot commented Jul 11, 2020

facebook-github-bot left a comment

facebook-github-bot commented Jul 11, 2020

facebook-github-bot left a comment

anand1976 left a comment

facebook-github-bot commented Jul 14, 2020

facebook-github-bot left a comment

facebook-github-bot commented Jul 15, 2020

facebook-github-bot commented Jul 15, 2020

facebook-github-bot left a comment

facebook-github-bot commented Jul 15, 2020

Auto resume the DB from Retryable IO Error #6765

Auto resume the DB from Retryable IO Error #6765

Conversation

zhichao-cao commented Apr 28, 2020

anand1976 left a comment

Choose a reason for hiding this comment

anand1976 left a comment

Choose a reason for hiding this comment

anand1976 left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

ajkr Jun 8, 2020 • edited Loading

Choose a reason for hiding this comment

anand1976 Jun 24, 2020

Choose a reason for hiding this comment

zhichao-cao Jun 24, 2020

Choose a reason for hiding this comment

anand1976 Jun 25, 2020

Choose a reason for hiding this comment

facebook-github-bot commented Jul 6, 2020

facebook-github-bot commented Jul 6, 2020

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jul 11, 2020

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jul 11, 2020

facebook-github-bot left a comment

Choose a reason for hiding this comment

anand1976 left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jul 14, 2020

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jul 15, 2020

facebook-github-bot commented Jul 15, 2020

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jul 15, 2020

ajkr Jun 8, 2020 •

edited

Loading