-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Online DDL: improve retry of vreplication errors with vitess
ALTER TABLE
migrations
#12323
Online DDL: improve retry of vreplication errors with vitess
ALTER TABLE
migrations
#12323
Conversation
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
If a new flag is being introduced:
If a workflow is added or modified:
Bug fixes
Non-trivial changes
New/Existing features
Backward compatibility
|
So on the flip side, the Gonna have to do some thinking here. |
… error' and can be retried Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
All right, now trying to differentiate and identify those workflows which |
Not good enough. For example, the
And |
Error code
So I'm not sure why VReplication itself did not completely bail out? |
Depends on #12327 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I had some minor comments that you can resolve as you think best.
tickReentranceFlag int64 | ||
reviewedRunningMigrationsFlag bool | ||
|
||
ticks *timer.Timer | ||
isOpen bool | ||
isOpen int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, v16 requires go 1.19 and it has atomic bool support: https://pkg.go.dev/sync/atomic@go1.19#Bool
But using that would prevent us from potentially back porting beyond v16.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh interesting, I did not realize that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This did seem to cause some valid test failures/breakage in the onlineddl_vrepl_suite
workflows so we'll need to fix that.
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Yes, as mentioned in #12323 (comment) and #12323 (comment) ; the errors will resolve when we merge #12327 |
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
merged #12327 ; now this PR should pass all tests |
|
I was unable to backport this Pull Request to the following branches: |
I will cherry pick this PR manually, along with #12327 |
Description
PEr #12322, Online DDL gives up too soon in some scenarios of
vitess
migrations; the scenarios are where there's a temporary, recoverable error onvreplication
's side; Online DDL chooses to terminate the migration rather than wait forvreplication
to recover.The problem is that it's hard for Online DDL to know when a VReplication error is terminal or recoverable.
The main change in this PR is for the Online DDL scheduler to allow retry/timeout on
VReplication
errors, by usingLastError
, refactored in #12321. This in effect allows up to10min
of retries (with at least one check per minute) before Online DDL gives up on the vreplication stream.So instead of analyzing specific errors, we just wait/retry on
vreplication
up to a timeout.Other changes in this PR, while we're here:
ownedRunningMigrations
uponOpen()
. This may have also contributed ononlineddl_vrepl
errors, though I'm not sure, and the cleanup presented here is required either way.isOpen
frombool
toint64
and use withatomic.LoadInt64()/StoreInt64()
as the reads/writes to this variables may be concurrent.gcArtifacts
Related Issue(s)
Fixes #12322
Extends #12321 (not merged yet)
Depends on #12327
Checklist
Deployment Notes