Online DDL: support for semi-idempotent migration via -combine-duplicate-ddl flag #8209

shlomi-noach · 2021-05-31T05:53:17Z

Description

This PR is the result of a lengthy discussion on #7825 (comment) with @jmoldow

The idea: the user has 100 shards, they run some online DDL. It succeeds on 82 shards, and either fails or is unknown for 18 shards. What does the user do next? They want to try the migration again. How?

Online DDL: explicit target shards via ddl_strategy #7825 (draft) lets the user explicitly specify the shards they want to run the migration on.
But this requires the user to take notes and figure out exactly which shards failed.
The user can use declarative migrations, which long term we may encourage them to do; but the user may not have used declarative migrations yet, and they still just want to run their ALTER statement on the remaining 18 shards.
Or, the user can use -combine-duplicate-ddl flag, offered in this PR. Details follow.

In the #7825 discussion, we were trying to figure out a heuristic for Vitess to say "well, it looks like the user is retrying a migration that already passed on some shards, therefore I'll be smart about it and only apply it where it hasn't passed yet". If you folow that discussion, you'll see that we end up in a complex decision tree where heuristically it makes sense to make that call, but sometimes not.

What I've decided to do here (and per conclusion from that discussion) is to not burden Vitess with that decision. The user knows better.

The user can now run a migration with @@ddl_strategy='<whichever> -combine-duplicate-ddl flag. <whichever> stands for either online, gh-ost, or ot-osc.

When a migration's turn to run arrives, and the migration happens to be flagged with -combine-duplicate-ddl flag, then Vitess checks:

This migration runs on some table T
If there is no other migration pending on T
and if the last successful migration to complete on T is identical (alter statement text comparison) to this migration

...then, we implicitly mark this migration as complete.

Notes:

We only run this logic for a single migration at a time
And only when it's the migration's turn to execute
The user can submit 10 times the same migration with -combine-duplicate-ddl flag
- 1st one will complete
- 2nd one will be evaluated then marked as complete
- 3rd one will be evaluated then marked as complete
- ...and so forth...
- so we don't pre-calculate all next 10 migrations ahead, even if they are all in queue
Whether with -combine-duplicate-ddl flag or without, the migration gets it's own UUID. The fact that the migration has duplicate migration statement text does not matter. It's a new migration with its own identity and tracking.
we don't mark migrations is complete retroactively. Only going forward.
- So if you run the same migration twice, and you want to validate that it's complete over your 100 shards, you should check the result of your 2nd execution rather than the 1st.
If the user runs a different migration on table T in between, then that breaks the sequence.

Looking for feedback.

Related Issue(s)

Checklist

Tests were added or are not required
Documentation was added or is not required

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

…bines repetitive migrations Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

jmoldow · 2021-06-02T08:08:39Z

I haven't looked at the code yet, but I've read the PR description.

The description doesn't make any mention of -singleton, and in particular -singleton -declarative. Doesn't mean this proposal doesn't address the concern I had about -singleton -declarative, but it doesn't seem like it does. But let's explore this more.

For Online DDL, it seems like there are two different operations, either of which can be (or not be) idempotent.

The synchronous client request to schedule an Online DDL, which returns either a UUID or an error.
The asynchronous execution of the Online DDL, which can either succeed, or fail, or fail-then-retry-and-succeed, or be cancelled, etc.

In the synchronous stage, a non-singleton Online DDL request is trivially idempotent, since all it has to do is queue the migration and return a UUID, and that can be done independent of the current state of the queue.

Whereas in the asynchronous stage, a -declarative Online DDL migration is trivially idempotent.

For the synchronous stage, when we say idempotent, we mean that the operation can be re-sent any amount of time after the first one (including immediately after), and the client should be capable of receiving a successful response back with some valid UUID, regardless of the status of the first request. Under no circumstances should the original synchronous request somehow block the second synchronous request from succeeding (save perhaps for something like very short-lived locks to prevent dangerous concurrency between those synchronous requests).

-singleton is a property that really only matters at the synchronous scheduling time, so it's not an interesting or meaningful question to ask whether -singleton migrations in particular can be made to be more idempotent in the asynchronous stage.

On the other hand, -declarative is a broadly interesting property, and it is an interesting and meaningful question to ask whether -declarative migrations in particular can be made to be always idempotent in the synchronous stage.

A non-singleton -declarative migration is the best of both worlds (in terms of idempotence, at least), since it is trivially idempotent in both stages.

But -singleton -declarative migrations are not trivially idempotent in the synchronous stage, which was what I was originally harping on in my comments on #7825. It violates the definition of idempotent outlined above. The first sychronous request can succeed to queue on some shards, but not on other shards. If this occurs, the -singleton retry will be unable to queue on the shards that succeeded the first time, meaning that the retry is guaranteed to fail from the POV of the client, who will be unable to retrieve a success response with a UUID.

(It's also not retry-able even if the first request fully succeeds. This is arguably less important than the partial-failure case, but is still necessary to be fully idempotent. And it's still important, because this can be critical in the case where a network error occurs between the client and the VTGate, preventing the client from receiving the success response + UUID from the first request)

From the description of this PR, it sounds like -combine-duplicate-ddl is meant to make non-declarative migrations semi-idempotent in the asynchronous stage, by deferring the de-duplication until the migration is finally attempted (which only happens when all previous migrations have been completed). If that's all it does, then it has no benefits for making -singleton -declarative migrations more idempotent.

That's not necessarily a problem. As described in this PR, -combine-duplicate-ddl sounds like a good idea, and a simple one to implement and maintain. I'll take a look at the code later. I think it's fine that it doesn't apply to -singleton -declarative, since I don't think -singleton -declarative should require a new flag in order to be made idempotent.

But I would still argue that -singleton -declarative should be made idempotent, and that making it idempotent is arguably more important than -combine-duplicate-ddl. There isn't necessarily an expectation that non-declarative migrations should be idempotent; it's more of a nice-to-have UX improvement. Whereas -declarative comes with an expectation of idempotence, and I don't think that -singleton -declarative should lessen that expectation.

There were some proposals in #7825 about how to handle -singleton -declarative. They had some extra features that aren't necessary, so I'm going to restate just the core of the proposal:

If a tablet has no running/scheduled migrations, -declarative is trivially schedule-able (-singleton or otherwise).
If the tablet has only one unique running/scheduled DDL (there might be multiple rows with multiple UUIDs, but they all have identical DDL text), and the tablet detects that the incoming -declarative migration (-singleton or otherwise) is exactly identical to that one DDL, it allows the new one to be enqueued even if it or one of the already queued migrations was a -singleton migration. This only applies if it is exactly identical and it is -declarative. If it isn't, normal -singleton rules apply.

This makes all -declarative and -singleton -declarative migrations fully idempotent.

I also understand that this proposal wouldn't be part of this PR #8209 . But just wanted to clarify that I don't see -combine-duplicate-ddl as being the entirety of the answer to #7825 .

shlomi-noach · 2021-06-02T09:07:17Z

Doesn't mean this proposal doesn't address the concern I had about -singleton -declarative,

Correct, it doesn't. This PR is specifically targeted at users who do not run -idempotent.

I completely understand your argument for -singleton -declarative. This PR does not address it. I thought our discussion on #7825 was about multiple things. I gather now that for you the most important discussion was around -singleton -declarative - and I think I lost that and focused on something else.

If the tablet has only one unique running/scheduled DDL (there might be multiple rows with multiple UUIDs, but they all have identical DDL text), and the tablet detects that the incoming -declarative migration (-singleton or otherwise) is exactly identical to that one DDL, it allows the new one to be enqueued even if it or one of the already queued migrations was a -singleton migration. This only applies if it is exactly identical and it is -declarative. If it isn't, normal -singleton rules apply.

I think this makes sense. I'll look into it in a different PR.

shlomi-noach · 2021-06-02T09:12:13Z

But just wanted to clarify that I don't see -combine-duplicate-ddl as being the entirety of the answer to #7825 .

I agree.

jmoldow

Looked just at the main code, not the tests. Main code LGTM with some nitpick comments / questions.

jmoldow · 2021-06-02T16:42:25Z

go/vt/vttablet/onlineddl/executor.go

+}
+
+// getLastCompletedMigrationOnTable returns the UUID of the last successful migration to complete on given table, if any
+func (e *Executor) getLastCompletedMigrationOnTable(ctx context.Context, table string) (found bool, uuid string, err error) {


Nitpick: function name is "last completed", but docstring says "last successful", which don't necessarily mean the same thing, depending on your terminology.

EDIT: Looks like your success status is named "complete". So I guess this function name is in line with the terminology you've already chosen. But I think the fact that you used the wording "last successful" in the docstring betrays the fact that "complete" might not be clear enough :P

jmoldow · 2021-06-02T16:43:53Z

go/vt/vttablet/onlineddl/executor.go

+		pendingKeyspace := row["keyspace"].ToString()
+		pendingTable := row["mysql_table"].ToString()
+
+		if pendingKeyspace == e.keyspace && pendingTable == table {


Here you are filtering in-memory, but for sqlSelectCompleteMigrationsOnTable you are filtering in the SQL query. Why the difference? Concerns about inefficient use of the MySQL index if we were to do this one in SQL?

Is there any max size for _vt.schema_migrations before we start to garbage collect old rows? Or is it only time based? Or does it grow forever without GC?

Good question. There's no particular reasoning. I'm gonna refactor.

jmoldow · 2021-06-02T16:48:19Z

go/vt/vttablet/onlineddl/executor.go

+		// This migration runs on some table. Let's see if the last migration to complete on the same table,
+		// has the exact same SQL statement as this one.
+		// If so, and because this migration is flagged with -combine-duplicate-ddl, we implicitly mark it as complete.
+		sameSQLasLastCompletedMigration := func() (same bool, completedUUID string, err error) {


Nitpick: Should this function be lifted out to a top-level function, to make executeMigration shorter and easier to read?

Yes! Refactoring

jmoldow · 2021-06-02T16:50:08Z

go/vt/vttablet/onlineddl/executor.go

+			if len(pendingUUIDs) > 0 {
+				return false, "", nil
+			}
+			found, completedUUID, err := e.getLastCompletedMigrationOnTable(ctx, onlineDDL.Table)


Here is where it would possibly be beneficial to name this method as "last successful".

If we interpret "completed" as "completed with any status, e.g. successful, failure, cancelled", then this code would seem to do the wrong thing, since we shouldn't skip the duplicate if the last migration wasn't successful.

The state 'complete' is now set in stone, and it is synonym with "successfully complete" as opposed to 'failed' or 'cancelled'. Sorry if terminology is confusing, but there is only one interpretation to complete in this context.

…SameSQLAsLastCompletedMigration function Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

shlomi-noach · 2021-07-14T08:20:07Z

I'm still stalling with this PR because I'm still unsure we've found the best path to address the issue of mass deployments. I don't want to introduce a syntax that will turn into "backwards compatibility" liability. Giving this more time for thought.

shlomi-noach · 2021-08-29T06:06:44Z

@deepthi Together with #7825, let's keep this open for a while longer, we want to solve the problem but we're not sure yet what's the best way.

github-actions · 2022-07-17T01:35:40Z

This PR is being marked as stale because it has been open for 30 days with no activity. To rectify, you may do any of the following:

Push additional commits to the associated branch.
Remove the stale label.
Add a comment indicating why it is not stale.

If no action is taken within 7 days, this PR will be closed.

github-actions · 2022-07-24T01:37:40Z

This PR was closed because it has been stale for 7 days with no activity.

vitess-bot · 2022-07-24T06:31:51Z

github-actions · 2022-08-24T01:35:54Z

This PR is being marked as stale because it has been open for 30 days with no activity. To rectify, you may do any of the following:

Push additional commits to the associated branch.
Remove the stale label.
Add a comment indicating why it is not stale.

If no action is taken within 7 days, this PR will be closed.

github-actions · 2022-08-31T01:39:42Z

This PR was closed because it has been stale for 7 days with no activity.

shlomi-noach added 4 commits May 27, 2021 13:16

support ddl_strategy -combine-duplicate-ddl

9741968

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

refactor logic to two functions

00fe022

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

OnlineDDL: -combine-duplicate-ddl flag for ddl_strategy, silently com…

2babdb5

…bines repetitive migrations Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

Merge branch 'master' into ddl-strategy-combine-duplicate-ddl

f8a56c7

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

shlomi-noach mentioned this pull request May 31, 2021

Online DDL: explicit target shards via ddl_strategy #7825

Closed

8 tasks

shlomi-noach added Component: Query Serving Type: Enhancement Logical improvement (somewhere between a bug and feature) labels May 31, 2021

merged master, resolved conflict

3a6eb9c

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

jmoldow approved these changes Jun 2, 2021

View reviewed changes

shlomi-noach added 4 commits June 6, 2021 14:17

refactor queries to accept keyspace filter; extract doesMigrationHave…

3a687c1

…SameSQLAsLastCompletedMigration function Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

merged main, resolved conflict

6b080de

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

exclude own UUID when searching for pending UUIDs on table

c19d4aa

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

singleton test: wait for migration to complete

64bb4be

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

shlomi-noach mentioned this pull request Oct 28, 2021

Online DDL: migration with same context and SQL now considered duplicate and silently marked as complete #9107

Merged

3 tasks

github-actions bot added the Stale Marks PRs as stale after a period of inactivity, which are then closed after a grace period. label Jul 17, 2022

github-actions bot closed this Jul 24, 2022

shlomi-noach reopened this Jul 24, 2022

shlomi-noach removed the Stale Marks PRs as stale after a period of inactivity, which are then closed after a grace period. label Jul 24, 2022

github-actions bot added the Stale Marks PRs as stale after a period of inactivity, which are then closed after a grace period. label Aug 24, 2022

github-actions bot closed this Aug 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Online DDL: support for semi-idempotent migration via -combine-duplicate-ddl flag #8209

Online DDL: support for semi-idempotent migration via -combine-duplicate-ddl flag #8209

shlomi-noach commented May 31, 2021 •

edited

Loading

jmoldow commented Jun 2, 2021

shlomi-noach commented Jun 2, 2021

shlomi-noach commented Jun 2, 2021

jmoldow left a comment

jmoldow Jun 2, 2021

jmoldow Jun 2, 2021

shlomi-noach Jun 6, 2021

jmoldow Jun 2, 2021

shlomi-noach Jun 6, 2021

jmoldow Jun 2, 2021

shlomi-noach Jun 6, 2021

shlomi-noach commented Jul 14, 2021

shlomi-noach commented Aug 29, 2021

github-actions bot commented Jul 17, 2022

github-actions bot commented Jul 24, 2022

vitess-bot bot commented Jul 24, 2022

github-actions bot commented Aug 24, 2022

github-actions bot commented Aug 31, 2022

Online DDL: support for semi-idempotent migration via -combine-duplicate-ddl flag #8209

Online DDL: support for semi-idempotent migration via -combine-duplicate-ddl flag #8209

Conversation

shlomi-noach commented May 31, 2021 • edited Loading

Description

Notes:

Related Issue(s)

Checklist

jmoldow commented Jun 2, 2021

shlomi-noach commented Jun 2, 2021

shlomi-noach commented Jun 2, 2021

jmoldow left a comment

Choose a reason for hiding this comment

jmoldow Jun 2, 2021

Choose a reason for hiding this comment

jmoldow Jun 2, 2021

Choose a reason for hiding this comment

shlomi-noach Jun 6, 2021

Choose a reason for hiding this comment

jmoldow Jun 2, 2021

Choose a reason for hiding this comment

shlomi-noach Jun 6, 2021

Choose a reason for hiding this comment

jmoldow Jun 2, 2021

Choose a reason for hiding this comment

shlomi-noach Jun 6, 2021

Choose a reason for hiding this comment

shlomi-noach commented Jul 14, 2021

shlomi-noach commented Aug 29, 2021

github-actions bot commented Jul 17, 2022

github-actions bot commented Jul 24, 2022

vitess-bot bot commented Jul 24, 2022

Review Checklist

General

Bug fixes

Non-trivial changes

New/Existing features

Backward compatibility

github-actions bot commented Aug 24, 2022

github-actions bot commented Aug 31, 2022

shlomi-noach commented May 31, 2021 •

edited

Loading