You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a source tablet's mysqld fails, the VReplication workflow never recovers — it continues trying to replicate from the non-healthy tablet.
Secondly, when using a tablet selection preference where e.g. REPLICA tablets are always chosen when available (the default value of --tablet_types="in_order:REPLICA,PRIMARY" does this) and none of the REPLICA tablets are healthy within the shard (each one has a down mysqld) then we never attempt to use one of the secondary tablet types (in this case PRIMARY).
Reproduction Steps
Test case:
git checkout main && make build
cd examples/local
./101_initial_cluster.sh; ./201_customer_tablets.sh; ./202_move_tables.sh
let tablet_uid=$(vtctldclient GetTablets --keyspace commerce --tablet-type replica | awk '{print $1}' | cut -d- -f2)+0; mysqlctl --tablet_uid=${tablet_uid} shutdown
# see it never recover
for _ in {1..500}; do
vtctlclient Workflow -- customer.commerce2customer show | jq .ShardStatuses
sleep 1
done
Note that stopping and starting the workflow in this test case also does not help:
vtctlclient Workflow -- customer.commerce2customer stop; vtctlclient Workflow -- customer.commerce2customer start
# see it still never recover
for _ in {1..500}; do
vtctlclient Workflow -- customer.commerce2customer show | jq .ShardStatuses
sleep 1
done
This is a TabletPicker issue in that we are weeding out all non-REPLICA tablets — because the tablet types are set to the default of --tablet_types="in_order:REPLICA,PRIMARY" — before we look at the tablet health — and the only thing we test when considering the tablet health is whether or not we can make a gRPC call to it and NOT whether or not the tablet actually reports itself as healthy and serving.
Binary Version
Version: 18.0.0-SNAPSHOT (Git revision 98918326587815d8e934711b817fd10630643772 branch 'main') built on Mon Jul 17 16:02:39 EDT 2023 by matt@pslord.local using go1.20.5 darwin/arm64
Operating System and Environment details
N/A
Log Fragments
N/A
The text was updated successfully, but these errors were encountered:
Overview of the Issue
When a source tablet's mysqld fails, the VReplication workflow never recovers — it continues trying to replicate from the non-healthy tablet.
Secondly, when using a tablet selection preference where e.g.
REPLICA
tablets are always chosen when available (the default value of--tablet_types="in_order:REPLICA,PRIMARY"
does this) and none of theREPLICA
tablets are healthy within the shard (each one has a down mysqld) then we never attempt to use one of the secondary tablet types (in this casePRIMARY
).Reproduction Steps
Test case:
Note that stopping and starting the workflow in this test case also does not help:
This is a TabletPicker issue in that we are weeding out all non-REPLICA tablets — because the tablet types are set to the default of
--tablet_types="in_order:REPLICA,PRIMARY"
— before we look at the tablet health — and the only thing we test when considering the tablet health is whether or not we can make a gRPC call to it and NOT whether or not the tablet actually reports itself as healthy and serving.Binary Version
Version: 18.0.0-SNAPSHOT (Git revision 98918326587815d8e934711b817fd10630643772 branch 'main') built on Mon Jul 17 16:02:39 EDT 2023 by matt@pslord.local using go1.20.5 darwin/arm64
Operating System and Environment details
Log Fragments
The text was updated successfully, but these errors were encountered: