Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Online DDL: improve retry of vreplication errors with vitess ALTER TABLE migrations #12323

Merged

Conversation

shlomi-noach
Copy link
Contributor

@shlomi-noach shlomi-noach commented Feb 12, 2023

Description

PEr #12322, Online DDL gives up too soon in some scenarios of vitess migrations; the scenarios are where there's a temporary, recoverable error on vreplication's side; Online DDL chooses to terminate the migration rather than wait for vreplication to recover.

The problem is that it's hard for Online DDL to know when a VReplication error is terminal or recoverable.

The main change in this PR is for the Online DDL scheduler to allow retry/timeout on VReplication errors, by using LastError, refactored in #12321. This in effect allows up to 10min of retries (with at least one check per minute) before Online DDL gives up on the vreplication stream.

So instead of analyzing specific errors, we just wait/retry on vreplication up to a timeout.

Other changes in this PR, while we're here:

  • Proper cleanup of ownedRunningMigrations upon Open(). This may have also contributed on onlineddl_vrepl errors, though I'm not sure, and the cleanup presented here is required either way.
  • Change isOpen from bool to int64 and use with atomic.LoadInt64()/StoreInt64() as the reads/writes to this variables may be concurrent.
  • More information in the logs when adopting a migration from a previous failed/demoted tablet
  • More information in the logs when running gcArtifacts

Related Issue(s)

Fixes #12322
Extends #12321 (not merged yet)
Depends on #12327

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Feb 12, 2023

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
  • If a test is added or modified, there should be a documentation on top of the test to explain what the expected behavior is what the test does.

If a new flag is being introduced:

  • Is it really necessary to add this flag?
  • Flag names should be clear and intuitive (as far as possible)
  • Help text should be descriptive.
  • Flag names should use dashes (-) as word separators rather than underscores (_).

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow should be required, the maintainer team should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should include a link to an issue that describes the bug.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from VTop, if used there.

@vitess-bot vitess-bot bot added NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says labels Feb 12, 2023
@deepthi deepthi added Backport to: release-16.0 and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says labels Feb 12, 2023
@shlomi-noach
Copy link
Contributor Author

So on the flip side, the onlineddl_vrepl_suite tests fail. The failures are for the fail-* tests; that is, these are tests that expect to fail. But the new wait/retry logic just... keeps waiting. The failure truly are terminal, but Online DDL waits for some 10min to be sure. However, the tests cannot each wait for 10min.

Gonna have to do some thinking here.

… error' and can be retried

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Contributor Author

All right, now trying to differentiate and identify those workflows which vreplication already admits to be completely failed.

@shlomi-noach
Copy link
Contributor Author

Not good enough. For example, the fail-cannot-be-null test causes this scenario in VReplication:


*************************** 1. row ***************************
                   id: 1
             workflow: 6a363eb7_ab59_11ed_b253_0a43f95f28a3
               source: keyspace:"commerce" shard:"0" filter:{rules:{match:"_6a363eb7_ab59_11ed_b253_0a43f95f28a3_20230213044645_vrepl" filter:"select `id` as `id`, `val` as `val`, `ts` as `ts` from `onlineddl_test`" source_unique_key_columns:"id" target_unique_key_columns:"id" sour
ce_unique_key_target_columns:"id"}}
                  pos: MySQL56/27770cc8-ab59-11ed-b694-0a43f95f28a3:1-68
             stop_pos: NULL
              max_tps: 9223372036854775807
  max_replication_lag: 9223372036854775807
                 cell:
         tablet_types: in_order:REPLICA,PRIMARY
         time_updated: 1676263623
transaction_timestamp: 1676263623
                state: Copying
              message: task error: failed inserting rows: Column 'val' cannot be null (errno 1048) (sqlstate 23000) during query: insert into _6a363eb7_ab59_11ed_b253_0a43f95f28a3_20230213044645_vrepl(id,val,ts) values (1,2,'2023-02-13 04:46:16'), (2,3,'2023-02-13 04:46:16'), (3,nu
ll,'2023-02-13 04:46:16')
              db_name: vt_commerce
          rows_copied: 0
                 tags:
       time_heartbeat: 1676263623
        workflow_type: 5
       time_throttled: 0
  component_throttled:
    workflow_sub_type: 0
 defer_secondary_keys: 0

And vreplication still does not see this as terminal. Looking further.

@shlomi-noach
Copy link
Contributor Author

Error code 1048 is already listed as "unrecoverable error" in vreplication, here:

So I'm not sure why VReplication itself did not completely bail out?

@shlomi-noach
Copy link
Contributor Author

see #12326 ; this PR will continue to fail CI until #12326 is resolved.

@shlomi-noach
Copy link
Contributor Author

Depends on #12327

Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I had some minor comments that you can resolve as you think best.

tickReentranceFlag int64
reviewedRunningMigrationsFlag bool

ticks *timer.Timer
isOpen bool
isOpen int64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, v16 requires go 1.19 and it has atomic bool support: https://pkg.go.dev/sync/atomic@go1.19#Bool

But using that would prevent us from potentially back porting beyond v16.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh interesting, I did not realize that!

go/vt/vttablet/onlineddl/executor.go Outdated Show resolved Hide resolved
go/vt/vttablet/onlineddl/vrepl.go Show resolved Hide resolved
go/vt/vttablet/onlineddl/vrepl.go Outdated Show resolved Hide resolved
Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This did seem to cause some valid test failures/breakage in the onlineddl_vrepl_suite workflows so we'll need to fix that.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Contributor Author

This did seem to cause some valid test failures/breakage in the onlineddl_vrepl_suite workflows so we'll need to fix that.

Yes, as mentioned in #12323 (comment) and #12323 (comment) ; the errors will resolve when we merge #12327

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Copy link
Contributor

@rohit-nayak-ps rohit-nayak-ps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@frouioui frouioui mentioned this pull request Feb 16, 2023
51 tasks
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Contributor Author

merged #12327 ; now this PR should pass all tests

@shlomi-noach
Copy link
Contributor Author

Upgrade Downgrade tests are flaky and taking very long; I'm going to merge this PR now.

@shlomi-noach shlomi-noach merged commit 8eacec8 into vitessio:main Feb 22, 2023
@shlomi-noach shlomi-noach deleted the onlineddl-last-error-vitess-migrations branch February 22, 2023 09:59
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Feb 22, 2023

I was unable to backport this Pull Request to the following branches: release-16.0.

@shlomi-noach
Copy link
Contributor Author

I will cherry pick this PR manually, along with #12327

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug Report: Online DDL scheduler gives up too early on VReplication error (for vitess migrations)
4 participants