-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Online DDL: support for semi-idempotent migration via -combine-duplicate-ddl flag #8209
Online DDL: support for semi-idempotent migration via -combine-duplicate-ddl flag #8209
Conversation
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…bines repetitive migrations Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
I haven't looked at the code yet, but I've read the PR description. The description doesn't make any mention of For Online DDL, it seems like there are two different operations, either of which can be (or not be) idempotent.
In the synchronous stage, a non-singleton Online DDL request is trivially idempotent, since all it has to do is queue the migration and return a UUID, and that can be done independent of the current state of the queue. Whereas in the asynchronous stage, a For the synchronous stage, when we say idempotent, we mean that the operation can be re-sent any amount of time after the first one (including immediately after), and the client should be capable of receiving a successful response back with some valid UUID, regardless of the status of the first request. Under no circumstances should the original synchronous request somehow block the second synchronous request from succeeding (save perhaps for something like very short-lived locks to prevent dangerous concurrency between those synchronous requests).
On the other hand, A non-singleton But (It's also not retry-able even if the first request fully succeeds. This is arguably less important than the partial-failure case, but is still necessary to be fully idempotent. And it's still important, because this can be critical in the case where a network error occurs between the client and the VTGate, preventing the client from receiving the success response + UUID from the first request) From the description of this PR, it sounds like That's not necessarily a problem. As described in this PR, But I would still argue that There were some proposals in #7825 about how to handle
This makes all I also understand that this proposal wouldn't be part of this PR #8209 . But just wanted to clarify that I don't see |
Correct, it doesn't. This PR is specifically targeted at users who do not run I completely understand your argument for
I think this makes sense. I'll look into it in a different PR. |
I agree. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked just at the main code, not the tests. Main code LGTM with some nitpick comments / questions.
} | ||
|
||
// getLastCompletedMigrationOnTable returns the UUID of the last successful migration to complete on given table, if any | ||
func (e *Executor) getLastCompletedMigrationOnTable(ctx context.Context, table string) (found bool, uuid string, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: function name is "last completed", but docstring says "last successful", which don't necessarily mean the same thing, depending on your terminology.
EDIT: Looks like your success status is named "complete". So I guess this function name is in line with the terminology you've already chosen. But I think the fact that you used the wording "last successful" in the docstring betrays the fact that "complete" might not be clear enough :P
go/vt/vttablet/onlineddl/executor.go
Outdated
pendingKeyspace := row["keyspace"].ToString() | ||
pendingTable := row["mysql_table"].ToString() | ||
|
||
if pendingKeyspace == e.keyspace && pendingTable == table { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you are filtering in-memory, but for sqlSelectCompleteMigrationsOnTable
you are filtering in the SQL query. Why the difference? Concerns about inefficient use of the MySQL index if we were to do this one in SQL?
Is there any max size for _vt.schema_migrations
before we start to garbage collect old rows? Or is it only time based? Or does it grow forever without GC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. There's no particular reasoning. I'm gonna refactor.
go/vt/vttablet/onlineddl/executor.go
Outdated
// This migration runs on some table. Let's see if the last migration to complete on the same table, | ||
// has the exact same SQL statement as this one. | ||
// If so, and because this migration is flagged with -combine-duplicate-ddl, we implicitly mark it as complete. | ||
sameSQLasLastCompletedMigration := func() (same bool, completedUUID string, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: Should this function be lifted out to a top-level function, to make executeMigration
shorter and easier to read?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Refactoring
go/vt/vttablet/onlineddl/executor.go
Outdated
if len(pendingUUIDs) > 0 { | ||
return false, "", nil | ||
} | ||
found, completedUUID, err := e.getLastCompletedMigrationOnTable(ctx, onlineDDL.Table) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is where it would possibly be beneficial to name this method as "last successful".
If we interpret "completed" as "completed with any status, e.g. successful, failure, cancelled", then this code would seem to do the wrong thing, since we shouldn't skip the duplicate if the last migration wasn't successful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The state 'complete'
is now set in stone, and it is synonym with "successfully complete" as opposed to 'failed'
or 'cancelled'
. Sorry if terminology is confusing, but there is only one interpretation to complete
in this context.
…SameSQLAsLastCompletedMigration function Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
I'm still stalling with this PR because I'm still unsure we've found the best path to address the issue of mass deployments. I don't want to introduce a syntax that will turn into "backwards compatibility" liability. Giving this more time for thought. |
This PR is being marked as stale because it has been open for 30 days with no activity. To rectify, you may do any of the following:
If no action is taken within 7 days, this PR will be closed. |
This PR was closed because it has been stale for 7 days with no activity. |
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Bug fixes
Non-trivial changes
New/Existing features
Backward compatibility
|
This PR is being marked as stale because it has been open for 30 days with no activity. To rectify, you may do any of the following:
If no action is taken within 7 days, this PR will be closed. |
This PR was closed because it has been stale for 7 days with no activity. |
Description
This PR is the result of a lengthy discussion on #7825 (comment) with @jmoldow
The idea: the user has 100 shards, they run some online DDL. It succeeds on 82 shards, and either fails or is unknown for 18 shards. What does the user do next? They want to try the migration again. How?
But this requires the user to take notes and figure out exactly which shards failed.
ALTER
statement on the remaining 18 shards.-combine-duplicate-ddl flag
, offered in this PR. Details follow.In the #7825 discussion, we were trying to figure out a heuristic for Vitess to say "well, it looks like the user is retrying a migration that already passed on some shards, therefore I'll be smart about it and only apply it where it hasn't passed yet". If you folow that discussion, you'll see that we end up in a complex decision tree where heuristically it makes sense to make that call, but sometimes not.
What I've decided to do here (and per conclusion from that discussion) is to not burden Vitess with that decision. The user knows better.
The user can now run a migration with
@@ddl_strategy='<whichever> -combine-duplicate-ddl
flag.<whichever>
stands for eitheronline
,gh-ost
, orot-osc
.When a migration's turn to run arrives, and the migration happens to be flagged with
-combine-duplicate-ddl flag
, then Vitess checks:...then, we implicitly mark this migration as
complete
.Notes:
-combine-duplicate-ddl flag
complete
complete
-combine-duplicate-ddl flag
or without, the migration gets it's own UUID. The fact that the migration has duplicate migration statement text does not matter. It's a new migration with its own identity and tracking.complete
retroactively. Only going forward.Looking for feedback.
Related Issue(s)
Checklist