-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix RemoveTablet
during TabletExternallyReparented
causing connection issues
#16371
Conversation
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
3d37750
to
9bbed96
Compare
…o tablets are marked as primary. Signed-off-by: Arthur Schreiber <arthurschreiber@github.com>
9bbed96
to
3cfaa01
Compare
This ensures that we can never end up in a situation where there is more than one healthy primary tablet for a given shard. Signed-off-by: Arthur Schreiber <arthurschreiber@github.com>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #16371 +/- ##
==========================================
+ Coverage 68.69% 69.05% +0.35%
==========================================
Files 1547 1556 +9
Lines 198297 202409 +4112
==========================================
+ Hits 136228 139781 +3553
- Misses 62069 62628 +559 ☔ View full report in Codecov by Sentry. |
Added the code that fixes this issue, and updated the PR body description accordingly. |
Signed-off-by: Arthur Schreiber <arthurschreiber@github.com>
<-resultChan | ||
<-resultChan | ||
|
||
hc.RemoveTablet(thirdTablet) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does removing an unrelated tablet trigger the bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see this
When RemoveTablet was called, the list of healthy tablets was recomputed, but the logic did not account for the fact that there can only be a single entry in the list of healthy primary tablets.
go/vt/discovery/healthcheck_test.go
Outdated
|
||
hc.RemoveTablet(thirdTablet) | ||
|
||
// tablet 1 is the primary now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is wrong, it should be
// tablet 1 is the primary now | |
// `secondTablet` should be the primary now |
go/vt/discovery/healthcheck.go
Outdated
hc.recomputeHealthy(key) | ||
if tabletType == topodata.TabletType_PRIMARY { | ||
// If tablet type is primary, we should only have one tablet in the healthy list. | ||
hc.recomputeHealthyPrimary(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting bug. We have safeguards in updateHealth
which make sure that there is exactly one element in healthy
for the PRIMARY tablet type, but the recomputation here had no such checks.
While the fix seems correct, what do you think about pushing it into recomputeHealthy
instead of creating a new function call here?
Right now this is the only call site where recomputeHealthy
is called for the PRIMARY tablet type, but it seems more compact and future proof to change recomputeHealthy
to maintain the invariant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the fix seems correct, what do you think about pushing it into
recomputeHealthy
instead of creating a new function call here?
I feel like having two separate methods helps highlighting the difference between the recomputation logic for primaries vs any other replica type more clearly.
I was thinking that maybe having a recomputeHealthy
method that then calls out to either recomputeHealthyReplicas
and recomputeHealthyPrimaries
would probably be the best for clarity (and the go compiler should just inline all of these calls anyway, so not a performance issue either).
But then I noticed that pushing down the tablet type check into recomputeHealth
would require passing down the key
and tabletType
as arguments, which seems redundant, because key
contains the tablet type already, just as a string. Not passing down key
would require passing down other arguments to be able to build the key, and then things just start to become more messy.
As I mentioned in Slack, I'm very interested giving the healthcheck code a polishing pass separately from this fix, so I'd suggest we punt on this for now. 🙇♂️ What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see that instead of passing the "key", you would now have to pass the target. When this was written, it just seemed convenient to use the key because it has already been computed. So we have
hc.recomputeHealthy(targetKey)
However, later on, we are doing a key computation for the second call
oldTargetKey := KeyFromTarget(prevTarget)
hc.recomputeHealthy(oldTargetKey)
It does not seem too invasive to change recomputeHealthy
to be
func (hc *HealthCheckImpl) recomputeHealthy(target *query.Target) {
key := KeyFromTarget(target)
if target.TabletType == ..._PRIMARY {
...
}
…municate the intent. Signed-off-by: Arthur Schreiber <arthurschreiber@github.com>
Signed-off-by: Arthur Schreiber <arthurschreiber@github.com>
…/fix-delete-tablet-inconsistency Signed-off-by: Arthur Schreiber <arthurschreiber@github.com>
Signed-off-by: Arthur Schreiber <arthurschreiber@github.com>
Signed-off-by: Arthur Schreiber <arthurschreiber@github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it. Nice targeted fix to a long-standing issue.
@arthurschreiber can you edit the description to match the final implementation? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @arthurschreiber! I only had a few minor questions and a nit.
I'll let you adjust the title and merge.
// cell or cell alias. It also performs filtering of tablets based on replication lag, | ||
// if configured to do so. | ||
// | ||
// This should not be called for primary tablets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we not allow it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately key
is a string that concats a bunch of values, so checking whether the type is PRIMARY
is not straightforward.
@deepthi and I discussed various options for refactoring this, but I feel like that shouldn't be part of this PR.
// clear the healthy list for the primary. | ||
// | ||
// See the logic in `updateHealth` for more details. | ||
alias := tabletAliasString(topoproto.TabletAliasString(healthy[0].Tablet.Alias)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check and handle cases where healthy has a length > 1 just to be safe? I'm not sure, but the thought came up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maintaining the invariant that healthy only ever contains a single entry for PRIMARY
(and verifying this in the tests) seems "good enough" to me? I'm also not sure what the correct behaviour would be if we detect more than one entry in the list, because that means the invariant is broken and all guarantees are invalid.
// See the logic in `updateHealth` for more details. | ||
alias := tabletAliasString(topoproto.TabletAliasString(healthy[0].Tablet.Alias)) | ||
if alias == tabletAlias { | ||
hc.healthy[key] = []*TabletHealth{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why we do this rather than removing the key? Not implying it's somehow wrong. 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just mirroring what happens in updateHealth
. 🤷
Signed-off-by: Arthur Schreiber <arthurschreiber@github.com>
557a3d6
to
4052306
Compare
…tion issues (#16371) Signed-off-by: Arthur Schreiber <arthurschreiber@github.com>
…tion issues (#16371) Signed-off-by: Arthur Schreiber <arthurschreiber@github.com>
…tion issues (#16371) Signed-off-by: Arthur Schreiber <arthurschreiber@github.com>
…tion issues (vitessio#16371) Signed-off-by: Arthur Schreiber <arthurschreiber@github.com>
* [release-19.0] Bump to `v19.0.5-SNAPSHOT` after the `v19.0.4` release (vitessio#15889) Signed-off-by: Andres Taylor <andres@planetscale.com> * [release-19.0] fix: handle info_schema routing (vitessio#15899) (vitessio#15906) Signed-off-by: Andres Taylor <andres@planetscale.com> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> * [release-19.0] Update VTAdmin build script (vitessio#15839) (vitessio#15850) Signed-off-by: notfelineit <notfelineit@gmail.com> Signed-off-by: <> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> Co-authored-by: Frances Thai <francesthai@Francess-MacBook-Pro.local> * [release-19.0] Update env.sh so that is does not error when running on Mac (vitessio#15835) (vitessio#15915) Signed-off-by: bddicken <bddicken@gmail.com> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> * [release-19.0] fix: derived table join column expression to be part of add join predicate on rewrite (vitessio#15956) (vitessio#15960) Signed-off-by: Harshit Gangal <harshit@planetscale.com> Signed-off-by: Andres Taylor <andres@planetscale.com> Co-authored-by: Harshit Gangal <harshit@planetscale.com> Co-authored-by: Andres Taylor <andres@planetscale.com> * [release-19.0] fix: insert on duplicate update to add list argument in the bind variables map (vitessio#15961) (vitessio#15967) Signed-off-by: Harshit Gangal <harshit@planetscale.com> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> Co-authored-by: Harshit Gangal <harshit@planetscale.com> * [release-19.0] test: Cleaner plan tests output (vitessio#15922) (vitessio#15964) Signed-off-by: Andres Taylor <andres@planetscale.com> * [release-19.0] connpool: Allow time out during shutdown (vitessio#15979) (vitessio#16003) Signed-off-by: Vicent Marti <vmg@strn.cat> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> * [release-19.0] fix: remove keyspace when merging subqueries (vitessio#16019) (vitessio#16027) Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> * [release-19.0] Add DCO workflow (vitessio#16052) (vitessio#16056) Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> * [release-19.0] Upgrade the Golang version to `go1.22.4` (vitessio#16061) Signed-off-by: GitHub <noreply@github.com> Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr> Co-authored-by: frouioui <frouioui@users.noreply.github.com> Co-authored-by: Florent Poinsard <florent.poinsard@outlook.fr> * [release-19.0] Remove DCO workaround (vitessio#16087) (vitessio#16091) Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> * [release-19.0] Do not load table stats when booting `vttablet`. (vitessio#15715) (vitessio#16100) Signed-off-by: Arthur Schreiber <arthurschreiber@github.com> Co-authored-by: Arthur Schreiber <arthurschreiber@github.com> * [release-19.0] Add timeout to all the contexts used for RPC calls in vtorc (vitessio#15991) (vitessio#16103) Signed-off-by: Manan Gupta <manan@planetscale.com> * [release-19.0] Update braces package (vitessio#16115) (vitessio#16118) Signed-off-by: Frances Thai <notfelineit@gmail.com> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> * [release-19.0] fix: order by subquery planning (vitessio#16049) (vitessio#16132) Co-authored-by: Harshit Gangal <harshit@planetscale.com> Co-authored-by: Andres Taylor <andres@planetscale.com> Co-authored-by: Florent Poinsard <florent.poinsard@outlook.fr> * [release-19.0] Fix `vtexplain` not handling `UNION` queries with `weight_string` results correctly. (vitessio#16129) (vitessio#16157) Signed-off-by: Arthur Schreiber <arthurschreiber@github.com> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> Co-authored-by: Arthur Schreiber <arthurschreiber@github.com> * Run more test on release-19 branch (vitessio#16152) Signed-off-by: Harshit Gangal <harshit@planetscale.com> * [release-19.0] Fix flakiness in `vtexplain` unit test case. (vitessio#16159) (vitessio#16167) Signed-off-by: Arthur Schreiber <arthurschreiber@github.com> Co-authored-by: Arthur Schreiber <arthurschreiber@github.com> * [release-19.0] Online DDL shadow table: rename referenced table name in self referencing FK (vitessio#16205) (vitessio#16207) Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> * [release-19.0] Fix flaky tests that use vtcombo (vitessio#16178) (vitessio#16212) Signed-off-by: Manan Gupta <manan@planetscale.com> Co-authored-by: Manan Gupta <35839558+GuptaManan100@users.noreply.github.com> Co-authored-by: Manan Gupta <manan@planetscale.com> * [release-19.0] Handle Nullability for Columns from Outer Tables (vitessio#16174) (vitessio#16185) Co-authored-by: Andrés Taylor <andres@planetscale.com> * [release-19.0] VDiff CLI: Fix VDiff `show` bug (vitessio#16177) (vitessio#16198) Signed-off-by: Rohit Nayak <rohit@planetscale.com> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> * [release-19.0] VReplication Workflow: set state correctly when restarting workflow streams in the copy phase (vitessio#16217) (vitessio#16222) Signed-off-by: Rohit Nayak <rohit@planetscale.com> Co-authored-by: Rohit Nayak <rohit@planetscale.com> * [release-19.0] vtctldclient: Apply (Shard | Keyspace| Table) Routing Rules commands don't work (vitessio#16096) (vitessio#16124) Signed-off-by: Rohit Nayak <rohit@planetscale.com> Co-authored-by: Rohit Nayak <rohit@planetscale.com> * [release-19.0] Fix vtgate crash in group concat (vitessio#16254) Signed-off-by: Manan Gupta <manan@planetscale.com> * [release-19.0] Fix Incorrect Optimization with LIMIT and GROUP BY (vitessio#16263) (vitessio#16267) Signed-off-by: Andres Taylor <andres@planetscale.com> Signed-off-by: Florent Poinsard <35779988+frouioui@users.noreply.github.com> Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> Co-authored-by: Florent Poinsard <35779988+frouioui@users.noreply.github.com> Co-authored-by: Andres Taylor <andres@planetscale.com> * [release-19.0] Fix the `v19.0.0` release notes and use the `vitess/lite` image for the MySQL container (vitessio#16282) (vitessio#16285) Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr> Co-authored-by: Florent Poinsard <35779988+frouioui@users.noreply.github.com> * [release-19.0] VReplication: Properly handle target shards w/o a primary in Reshard (vitessio#16283) (vitessio#16291) Signed-off-by: Matt Lord <mattalord@gmail.com> Co-authored-by: Matt Lord <mattalord@gmail.com> * [release-19.0] CI: Fix for xtrabackup install failures (vitessio#16329) (vitessio#16332) Signed-off-by: Matt Lord <mattalord@gmail.com> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> Co-authored-by: Matt Lord <mattalord@gmail.com> * [release-19.0] Upgrade the Golang version to `go1.22.5` (vitessio#16322) Signed-off-by: GitHub <noreply@github.com> Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr> Co-authored-by: frouioui <frouioui@users.noreply.github.com> Co-authored-by: Florent Poinsard <florent.poinsard@outlook.fr> * [release-19.0] Fix the install dependencies script in Docker (vitessio#16340) (vitessio#16346) Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> * [release-19.0] planner: Handle ORDER BY inside derived tables (vitessio#16353) (vitessio#16359) Signed-off-by: Andres Taylor <andres@planetscale.com> Co-authored-by: Andres Taylor <andres@planetscale.com> * [release-19.0] Fix Join Predicate Cleanup Bug in Route Merging (vitessio#16386) (vitessio#16389) Signed-off-by: Andres Taylor <andres@planetscale.com> Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr> Co-authored-by: Andres Taylor <andres@planetscale.com> Co-authored-by: Florent Poinsard <florent.poinsard@outlook.fr> * [release-19.0] fix issue with aggregation inside of derived tables (vitessio#16366) (vitessio#16384) Signed-off-by: Andres Taylor <andres@planetscale.com> Co-authored-by: Andrés Taylor <andres@planetscale.com> * [release-19.0] Use default schema reload config values when config file is empty (vitessio#16393) (vitessio#16410) Signed-off-by: Matt Lord <mattalord@gmail.com> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> * [release-19.0] Fix subquery planning having an aggregation that is used in order by as long as we can merge it all into a single route (vitessio#16402) (vitessio#16407) Signed-off-by: Manan Gupta <manan@planetscale.com> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> * [release-19.0] Fix panic in schema tracker in presence of keyspace routing rules (vitessio#16383) (vitessio#16406) Signed-off-by: Manan Gupta <manan@planetscale.com> * [release-19] Vitess tester workflow (vitessio#16127) (vitessio#16418) Signed-off-by: Manan Gupta <manan@planetscale.com> Signed-off-by: Florent Poinsard <35779988+frouioui@users.noreply.github.com> Co-authored-by: Florent Poinsard <35779988+frouioui@users.noreply.github.com> * [release-19.0] feat: add a LIMIT 1 on EXISTS subqueries to limit network overhead (vitessio#16153) (vitessio#16191) Signed-off-by: Andres Taylor <andres@planetscale.com> Co-authored-by: Andrés Taylor <andres@planetscale.com> * [release-19.0] Code Freeze for `v19.0.5` (vitessio#16448) Signed-off-by: Andres Taylor <andres@planetscale.com> * [release-19.0] Release of `v19.0.5` (vitessio#16450) Signed-off-by: Andres Taylor <andres@planetscale.com> * [release-19.0] Bump to `v19.0.6-SNAPSHOT` after the `v19.0.5` release (vitessio#16456) Signed-off-by: Andres Taylor <andres@planetscale.com> * [release-19.0] fix: reference table join merge (vitessio#16488) (vitessio#16496) Signed-off-by: Harshit Gangal <harshit@planetscale.com> Signed-off-by: Andres Taylor <andres@planetscale.com> Co-authored-by: Harshit Gangal <harshit@planetscale.com> Co-authored-by: Andres Taylor <andres@planetscale.com> * [release-19.0] Improve the queries upgrade/downgrade CI workflow by using same test code version as binary (vitessio#16494) (vitessio#16501) Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr> Co-authored-by: Florent Poinsard <35779988+frouioui@users.noreply.github.com> Co-authored-by: Florent Poinsard <florent.poinsard@outlook.fr> * [release-19.0] bugfix: don't treat join predicates as filter predicates (vitessio#16472) (vitessio#16474) Signed-off-by: Andres Taylor <andres@planetscale.com> Co-authored-by: Andrés Taylor <andres@planetscale.com> * [release-19.0] VTAdmin: Upgrade websockets js package (vitessio#16504) (vitessio#16512) Signed-off-by: Matt Lord <mattalord@gmail.com> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> Co-authored-by: Matt Lord <mattalord@gmail.com> * [release-19.0] bugfix: Allow cross-keyspace joins (vitessio#16520) (vitessio#16523) Signed-off-by: Andres Taylor <andres@planetscale.com> Co-authored-by: Andrés Taylor <andres@planetscale.com> * [release-19.0] simplify merging logic (vitessio#16525) (vitessio#16532) Signed-off-by: Andres Taylor <andres@planetscale.com> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> * [release-19.0] Fix: Offset planning in hash joins (vitessio#16540) (vitessio#16551) Signed-off-by: Manan Gupta <manan@planetscale.com> Co-authored-by: Manan Gupta <35839558+GuptaManan100@users.noreply.github.com> Co-authored-by: Manan Gupta <manan@planetscale.com> * [release-19.0] Fix `RemoveTablet` during `TabletExternallyReparented` causing connection issues (vitessio#16371) (vitessio#16567) Signed-off-by: Arthur Schreiber <arthurschreiber@github.com> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> * v19 backport: Throttler/vreplication: fix app name used by VPlayer (vitessio#16578) (vitessio#16580) Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * [release-19.0] Upgrade the Golang version to `go1.22.6` (vitessio#16543) Signed-off-by: GitHub <noreply@github.com> Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr> Co-authored-by: frouioui <frouioui@users.noreply.github.com> Co-authored-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> Co-authored-by: Florent Poinsard <florent.poinsard@outlook.fr> * v19 backport: Online DDL: avoid SQL's `CONVERT(...)`, convert programmatically if needed (vitessio#16603) Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * [release-19.0] Remove mysql57/percona57 bootstrap images (vitessio#16620) (vitessio#16622) Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr> Co-authored-by: Florent Poinsard <35779988+frouioui@users.noreply.github.com> * [release-19.0] Fix query plan cache misses metric (vitessio#16562) (vitessio#16627) Signed-off-by: shanth96 <shanth.sathiyaseelan@shopify.com> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> * [release-19.0] VReplication workflows: retry "wrong tablet type" errors (vitessio#16645) (vitessio#16652) Signed-off-by: Rohit Nayak <rohit@planetscale.com> Co-authored-by: Rohit Nayak <57520317+rohit-nayak-ps@users.noreply.github.com> Co-authored-by: Rohit Nayak <rohit@planetscale.com> * [release-19.0] VStream API: validate that last PK has fields defined (vitessio#16478) (vitessio#16486) Signed-off-by: Rohit Nayak <rohit@planetscale.com> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> Co-authored-by: Rohit Nayak <rohit@planetscale.com> * [release-19.0] Update micromatch to 4.0.8 (vitessio#16660) (vitessio#16666) Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> * [release-19.0] Replace ErrorContains checks with Error checks before running upgrade downgrade (vitessio#16700) Signed-off-by: Manan Gupta <manan@planetscale.com> * [release-19.0] JSON Encoding: Use Type_RAW for marshalling json (vitessio#16637) (vitessio#16681) Signed-off-by: Rohit Nayak <rohit@planetscale.com> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> Co-authored-by: Rohit Nayak <rohit@planetscale.com> * [release-19.0] FindErrantGTIDs: superset is not an errant GTID situation (vitessio#16725) (vitessio#16728) Signed-off-by: deepthi <deepthi@planetscale.com> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> * [release-19.0] Move from 4-cores larger runners to `ubuntu-latest` (vitessio#16714) (vitessio#16717) Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr> Co-authored-by: Florent Poinsard <35779988+frouioui@users.noreply.github.com> Co-authored-by: Florent Poinsard <florent.poinsard@outlook.fr> * [release-19.0] Upgrade the Golang version to `go1.22.7` (vitessio#16721) Signed-off-by: GitHub <noreply@github.com> Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr> Co-authored-by: frouioui <frouioui@users.noreply.github.com> Co-authored-by: Florent Poinsard <florent.poinsard@outlook.fr> * [release-19.0] Code Freeze for `v19.0.6` (vitessio#16745) Signed-off-by: Rohit Nayak <rohit@planetscale.com> * [release-19.0] Release of `v19.0.6` (vitessio#16747) Signed-off-by: Rohit Nayak <rohit@planetscale.com> * [release-19.0] Bump to `v19.0.7-SNAPSHOT` after the `v19.0.6` release (vitessio#16753) Signed-off-by: Rohit Nayak <rohit@planetscale.com> * [release-19.0] Remove mysql57 from docker images (vitessio#16763) Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr> * [release-19.0] VTAdmin: Address security vuln in path-to-regexp node pkg (vitessio#16770) (vitessio#16772) Signed-off-by: Matt Lord <mattalord@gmail.com> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> Co-authored-by: Matt Lord <mattalord@gmail.com> * Backport: Fix ACL checks for CTEs (vitessio#16642) (vitessio#16776) Signed-off-by: Manan Gupta <manan@planetscale.com> Signed-off-by: Andres Taylor <andres@planetscale.com> Co-authored-by: Manan Gupta <35839558+GuptaManan100@users.noreply.github.com> * [release-19.0] VTAdmin: Fix serve-handler's path-to-regexp dep and add default schema refresh (vitessio#16778) (vitessio#16783) Signed-off-by: Matt Lord <mattalord@gmail.com> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> Co-authored-by: Matt Lord <mattalord@gmail.com> * [release-19.0] Bump com.google.protobuf:protobuf-java from 3.24.3 to 3.25.5 in /java (vitessio#16809) (vitessio#16837) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [release-19.0] VTAdmin: Upgrade deps to address security vulns (vitessio#16843) (vitessio#16846) Signed-off-by: Matt Lord <mattalord@gmail.com> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> Co-authored-by: Matt Lord <mattalord@gmail.com> * [release-19.0] Support passing filters to `discovery.NewHealthCheck(...)` (vitessio#16170) (vitessio#16871) Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * [release-19.0] Fail fast when builtinbackup fails to restore a single file (vitessio#16856) (vitessio#16867) Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr> Signed-off-by: Florent Poinsard <35779988+frouioui@users.noreply.github.com> Co-authored-by: Florent Poinsard <35779988+frouioui@users.noreply.github.com> Co-authored-by: Matt Lord <mattalord@gmail.com> Co-authored-by: Florent Poinsard <florent.poinsard@outlook.fr> * [release-19.0] Upgrade Golang to 1.22.8 (vitessio#16895) Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr> * [release-19.0] VTTablet: smartconnpool: notify all expired waiters (vitessio#16897) (vitessio#16901) Signed-off-by: Brendan Dougherty <brendan.dougherty@shopify.com> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> * [release-19.0] Fix race in `replicationLagModule` of `go/vt/throttle` (vitessio#16078) (vitessio#16899) Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> Co-authored-by: Tim Vaillancourt <tim@timvaillancourt.com> * [release-19.0] Bump commons-io:commons-io from 2.7 to 2.14.0 in /java (vitessio#16889) (vitessio#16930) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [release-19.0] fixes bugs around expression precedence and LIKE (vitessio#16934 & vitessio#16649) (vitessio#16945) Signed-off-by: Andres Taylor <andres@planetscale.com> Signed-off-by: Manan Gupta <manan@planetscale.com> Co-authored-by: Andrés Taylor <andres@planetscale.com> Co-authored-by: Manan Gupta <manan@planetscale.com> Co-authored-by: Manan Gupta <35839558+GuptaManan100@users.noreply.github.com> * [release-19.0] Flaky test fixes (vitessio#16940) (vitessio#16958) Signed-off-by: Manan Gupta <manan@planetscale.com> Co-authored-by: Manan Gupta <35839558+GuptaManan100@users.noreply.github.com> * [release-19.0] fix: route engine to handle column truncation for execute after lookup (vitessio#16981) (vitessio#16984) Signed-off-by: Harshit Gangal <harshit@planetscale.com> Co-authored-by: Harshit Gangal <harshit@planetscale.com> * [release-19.0] bugfix: add HAVING columns inside derived tables (vitessio#16976) (vitessio#16978) Signed-off-by: Andres Taylor <andres@planetscale.com> Co-authored-by: Andrés Taylor <andres@planetscale.com> * [release-19.0] Fix deadlock between health check and topology watcher (vitessio#16995) (vitessio#17008) Signed-off-by: Manan Gupta <manan@planetscale.com> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> * [release-19.0] Add support for `MultiEqual` opcode for lookup vindexes. (vitessio#16975) (vitessio#17039) Signed-off-by: Arthur Schreiber <arthurschreiber@github.com> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> * [release-19.0] bugfix: treat EXPLAIN like SELECT (vitessio#17054) (vitessio#17056) Signed-off-by: Andres Taylor <andres@planetscale.com> Co-authored-by: Andrés Taylor <andres@planetscale.com> * [release-19.0] Delegate Column Availability Checks to MySQL for Single-Route Queries (vitessio#17077) (vitessio#17085) Signed-off-by: Harshit Gangal <harshit@planetscale.com> Signed-off-by: Andres Taylor <andres@planetscale.com> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> Co-authored-by: Andres Taylor <andres@planetscale.com> Co-authored-by: Harshit Gangal <harshit@planetscale.com> * Bugfix for Panic on Joined Queries with Non-Authoritative Tables in Vitess 19.0 (vitessio#17103) Signed-off-by: Andres Taylor <andres@planetscale.com> * [release-19.0] Improve Schema Engine's TablesWithSize80 query (vitessio#17066) (vitessio#17089) Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> Co-authored-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * [release-19.0] Fix unreachable errors when taking a backup (vitessio#17062) (vitessio#17110) Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr> Signed-off-by: Florent Poinsard <35779988+frouioui@users.noreply.github.com> Co-authored-by: Florent Poinsard <35779988+frouioui@users.noreply.github.com> Co-authored-by: Matt Lord <mattalord@gmail.com> Co-authored-by: Florent Poinsard <florent.poinsard@outlook.fr> * [release-19.0] Code Freeze for `v19.0.7` (vitessio#17148) Signed-off-by: Rohit Nayak <rohit@planetscale.com> Co-authored-by: Rohit Nayak <rohit@planetscale.com> * [release-19.0] Release of `v19.0.7` (vitessio#17149) Signed-off-by: Rohit Nayak <rohit@planetscale.com> Co-authored-by: Rohit Nayak <rohit@planetscale.com> * restore test conditional for v18 vttablet Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * restore more test conditional for v18 binaries Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * restore whitespace Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * Revert "[release-19.0] Improve the queries upgrade/downgrade CI workflow by using same test code version as binary (vitessio#16494) (vitessio#16501)" This reverts commit 25a80ac. * add missing table from cleanup Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> --------- Signed-off-by: Andres Taylor <andres@planetscale.com> Signed-off-by: notfelineit <notfelineit@gmail.com> Signed-off-by: <> Signed-off-by: bddicken <bddicken@gmail.com> Signed-off-by: Harshit Gangal <harshit@planetscale.com> Signed-off-by: Vicent Marti <vmg@strn.cat> Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr> Signed-off-by: GitHub <noreply@github.com> Signed-off-by: Arthur Schreiber <arthurschreiber@github.com> Signed-off-by: Manan Gupta <manan@planetscale.com> Signed-off-by: Frances Thai <notfelineit@gmail.com> Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> Signed-off-by: Rohit Nayak <rohit@planetscale.com> Signed-off-by: Florent Poinsard <35779988+frouioui@users.noreply.github.com> Signed-off-by: Matt Lord <mattalord@gmail.com> Signed-off-by: shanth96 <shanth.sathiyaseelan@shopify.com> Signed-off-by: deepthi <deepthi@planetscale.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> Signed-off-by: Brendan Dougherty <brendan.dougherty@shopify.com> Co-authored-by: Andrés Taylor <andres@planetscale.com> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> Co-authored-by: Frances Thai <francesthai@Francess-MacBook-Pro.local> Co-authored-by: Harshit Gangal <harshit@planetscale.com> Co-authored-by: vitess-bot <139342327+vitess-bot@users.noreply.github.com> Co-authored-by: frouioui <frouioui@users.noreply.github.com> Co-authored-by: Florent Poinsard <florent.poinsard@outlook.fr> Co-authored-by: Arthur Schreiber <arthurschreiber@github.com> Co-authored-by: Manan Gupta <35839558+GuptaManan100@users.noreply.github.com> Co-authored-by: Manan Gupta <manan@planetscale.com> Co-authored-by: Rohit Nayak <rohit@planetscale.com> Co-authored-by: Florent Poinsard <35779988+frouioui@users.noreply.github.com> Co-authored-by: Matt Lord <mattalord@gmail.com> Co-authored-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> Co-authored-by: Rohit Nayak <57520317+rohit-nayak-ps@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Description
This pull request fixes an issue in the health checking logic where an external primary failover in a keyspace/shard can cause two primaries to be marked as healthy. This can occur when a tablet is removed during an external reparenting operation.
This bug can lead to queries being sent to the old primary, or in case the demoted primary vttablet process is restarted, will lead to
vttablet: Connection closed
errors (which are mostly invisible because queries will be retried automatically on the "actual" primary).When
RemoveTablet
was called, the list of healthy tablets was recomputed, but the logic did not account for the fact that there can only be a single entry in the list of healthy primary tablets. This PR changes the logic indeleteTablet
to mirror the special handling forPRIMARY
tablets that also exists inupdateHealth
. This ensures that we can never end up in a situation where there is more than one healthy primary tablet for a given shard.Related Issue(s)
Fixes: #16373
Checklist
Deployment Notes