Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: VReplication workflows don't auto recover when source tablet's mysqld fails #13519

Closed
mattlord opened this issue Jul 17, 2023 · 0 comments · Fixed by #13582
Closed

Comments

@mattlord
Copy link
Contributor

mattlord commented Jul 17, 2023

Overview of the Issue

When a source tablet's mysqld fails, the VReplication workflow never recovers — it continues trying to replicate from the non-healthy tablet.

Secondly, when using a tablet selection preference where e.g. REPLICA tablets are always chosen when available (the default value of --tablet_types="in_order:REPLICA,PRIMARY" does this) and none of the REPLICA tablets are healthy within the shard (each one has a down mysqld) then we never attempt to use one of the secondary tablet types (in this case PRIMARY).

Reproduction Steps

Test case:

git checkout main && make build

cd examples/local

./101_initial_cluster.sh; ./201_customer_tablets.sh; ./202_move_tables.sh

let tablet_uid=$(vtctldclient GetTablets --keyspace commerce --tablet-type replica | awk '{print $1}' | cut -d- -f2)+0; mysqlctl --tablet_uid=${tablet_uid} shutdown

# see it never recover
for _ in {1..500}; do
  vtctlclient Workflow -- customer.commerce2customer show | jq .ShardStatuses
  sleep 1
done

Note that stopping and starting the workflow in this test case also does not help:

vtctlclient Workflow -- customer.commerce2customer stop; vtctlclient Workflow -- customer.commerce2customer start

# see it still never recover
for _ in {1..500}; do
  vtctlclient Workflow -- customer.commerce2customer show | jq .ShardStatuses
  sleep 1
done

This is a TabletPicker issue in that we are weeding out all non-REPLICA tablets — because the tablet types are set to the default of --tablet_types="in_order:REPLICA,PRIMARY" — before we look at the tablet health — and the only thing we test when considering the tablet health is whether or not we can make a gRPC call to it and NOT whether or not the tablet actually reports itself as healthy and serving.

Binary Version

Version: 18.0.0-SNAPSHOT (Git revision 98918326587815d8e934711b817fd10630643772 branch 'main') built on Mon Jul 17 16:02:39 EDT 2023 by matt@pslord.local using go1.20.5 darwin/arm64

Operating System and Environment details

N/A

Log Fragments

N/A
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant