Feature Request: VTOrc should change tablet type of tablets that have errant GTIDs on them #13872

GuptaManan100 · 2023-08-29T11:26:09Z

Feature Description

Description

We should get VTOrc to change the tablet type of tablets that have errant GTIDs on them and get the type converted to drained. This way we prevent these tablets from getting promoted down the line and causing a load of problems.

Use Case(s)

If a tablet ends up with errant GTID (by whatever way), and if we don't remove it from the topology, there is a slight chance that it can end up getting promoted. When that happens, it breaks the replication on all the other tablets, leading to down time. This feature of VTOrc to demote a tablet with errant GTIDs would fix this problem.

The text was updated successfully, but these errors were encountered:

shlomi-noach · 2023-08-30T05:11:07Z

The risk is with exhausting the entire replica fleet, so that you end up with no REPLICA or RDONLY server at all. I think you must never change the type of the last available replica, because:

you'd end up having nothing to promote
a single replica is promotable even if it has an errant GTID, because no one else is there to complain about it

GuptaManan100 · 2023-08-31T03:21:24Z

@shlomi-noach I don't know, I have mixed feelings about that too...

I agree it won't be ideal that we end up getting rid of all the REPLICA and RDONLY tablets. That can cause the primary to be stuck on semi-sync and essentially take down the entire cluster.
At the same time, I don't want to keep a REPLICA tablet with errant GTIDs around just because we have no other tablets, the reason being that if we do end up promoting that REPLICA and the errant GTIDs are old enough that they have been purged, even the previous primary won't be able to replicate and the cluster would essentially be in the same broken state as before. Also, the errant GTIDs would show up in the SELECT queries to the customer once we promote the REPLICA and going back will be troublesome...

Is there a good way to handle these situations?
I am inclined to say that there are no fixed steps that VTOrc can take in these situations, because the remedy is going to be dependent on the situation.
So, what should we get VTOrc to do? @shlomi-noach @deepthi

shlomi-noach · 2023-08-31T04:41:58Z

even the previous primary won't be able to replicate and the cluster would essentially be in the same broken state as before.

Unless you take a backup from this server and use it to seed th erest of the tablets.

I think that the suggested approach is super opinionated and that different OSS users will have different opinions. If you can make this configurable - that's good. I'd tell you that in a production environment, I'd prefer having proper alerting on errant GTID, along with tooling to fix the errant GTID, rather than have some automation purge replicas from my cluster to the point of leaving the PRIMARY all by itself. I feel like that's just too risky.

GuptaManan100 · 2023-09-04T05:58:57Z

Alright, I think that can be done.

I'll put this functionality of changing tablet type of tablets with errant GTIDs behind a flag. As far as alerting goes, we already have that, so I think just making this feature optional should be a good addition.

timvaillancourt · 2023-09-05T12:15:41Z

Great feature request! I'm expecting mixed feedback based on use-cases here, but adding my perspective below

I'd tell you that in a production environment, I'd prefer having proper alerting on errant GTID, along with tooling to fix the errant GTID, rather than have some automation purge replicas from my cluster to the point of leaving the PRIMARY all by itself. I feel like that's just too risky.

I (personally) agree with this statement 👍

If a tablet ends up with errant GTID (by whatever way), and if we don't remove it from the topology, there is a slight chance that it can end up getting promoted. When that happens, it breaks the replication on all the other tablets, leading to down time.

Replication being broken is very bad, but having no REPLICAs to read from at all would probably be worse for a lot of Production systems I've worked on. For apps that specifically target REPLICAs, no available replicas would result in hard errors, which could have a higher impact than stale results (due to broken replication). I'd also argue that broken replication isn't necessarily "down time" as queries should still return (correct me if I'm wrong here)

I think you must never change the type of the last available replica

I feel this approach (keep at least N x REPLICAs regardless of errant GTID) could be a happy medium between availability and consistency. That feels similar to the "min replicas" feature for replication lag - lagging replicas are ignored, but not to the point there is no capacity

shlomi-noach · 2023-09-05T14:22:49Z

@timvaillancourt: updating that in #13873, there's a new configurable behavior (default: false), --convert-tablets-with-errant-gtids. So it's an opt-in behavior.

GuptaManan100 added Type: Feature Component: VTorc Vitess Orchestrator integration labels Aug 29, 2023

GuptaManan100 mentioned this issue Aug 29, 2023

VTOrc converts a tablet to DRAINED type if it detects errant GTIDs on it #13873

Merged

4 tasks

GuptaManan100 closed this as completed in #13873 Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: VTOrc should change tablet type of tablets that have errant GTIDs on them #13872

Feature Request: VTOrc should change tablet type of tablets that have errant GTIDs on them #13872

GuptaManan100 commented Aug 29, 2023

shlomi-noach commented Aug 30, 2023

GuptaManan100 commented Aug 31, 2023

shlomi-noach commented Aug 31, 2023

GuptaManan100 commented Sep 4, 2023

timvaillancourt commented Sep 5, 2023 •

edited

Loading

shlomi-noach commented Sep 5, 2023 •

edited

Loading

Feature Request: VTOrc should change tablet type of tablets that have errant GTIDs on them #13872

Feature Request: VTOrc should change tablet type of tablets that have errant GTIDs on them #13872

Comments

GuptaManan100 commented Aug 29, 2023

Feature Description

Description

Use Case(s)

shlomi-noach commented Aug 30, 2023

GuptaManan100 commented Aug 31, 2023

shlomi-noach commented Aug 31, 2023

GuptaManan100 commented Sep 4, 2023

timvaillancourt commented Sep 5, 2023 • edited Loading

shlomi-noach commented Sep 5, 2023 • edited Loading

timvaillancourt commented Sep 5, 2023 •

edited

Loading

shlomi-noach commented Sep 5, 2023 •

edited

Loading