Bug Report: VReplication workflows don't auto recover when source tablet's mysqld fails #13519

mattlord · 2023-07-17T20:41:09Z

Overview of the Issue

When a source tablet's mysqld fails, the VReplication workflow never recovers — it continues trying to replicate from the non-healthy tablet.

Secondly, when using a tablet selection preference where e.g. REPLICA tablets are always chosen when available (the default value of --tablet_types="in_order:REPLICA,PRIMARY" does this) and none of the REPLICA tablets are healthy within the shard (each one has a down mysqld) then we never attempt to use one of the secondary tablet types (in this case PRIMARY).

Reproduction Steps

Test case:

git checkout main && make build

cd examples/local

./101_initial_cluster.sh; ./201_customer_tablets.sh; ./202_move_tables.sh

let tablet_uid=$(vtctldclient GetTablets --keyspace commerce --tablet-type replica | awk '{print $1}' | cut -d- -f2)+0; mysqlctl --tablet_uid=${tablet_uid} shutdown

# see it never recover
for _ in {1..500}; do
  vtctlclient Workflow -- customer.commerce2customer show | jq .ShardStatuses
  sleep 1
done

Note that stopping and starting the workflow in this test case also does not help:

vtctlclient Workflow -- customer.commerce2customer stop; vtctlclient Workflow -- customer.commerce2customer start

# see it still never recover
for _ in {1..500}; do
  vtctlclient Workflow -- customer.commerce2customer show | jq .ShardStatuses
  sleep 1
done

This is a TabletPicker issue in that we are weeding out all non-REPLICA tablets — because the tablet types are set to the default of --tablet_types="in_order:REPLICA,PRIMARY" — before we look at the tablet health — and the only thing we test when considering the tablet health is whether or not we can make a gRPC call to it and NOT whether or not the tablet actually reports itself as healthy and serving.

Binary Version

Version: 18.0.0-SNAPSHOT (Git revision 98918326587815d8e934711b817fd10630643772 branch 'main') built on Mon Jul 17 16:02:39 EDT 2023 by matt@pslord.local using go1.20.5 darwin/arm64

Operating System and Environment details

N/A

Log Fragments

N/A

The text was updated successfully, but these errors were encountered:

mattlord added Type: Bug Component: VReplication labels Jul 17, 2023

mattlord self-assigned this Jul 17, 2023

mattlord closed this as completed Jul 18, 2023

mattlord reopened this Jul 18, 2023

mattlord mentioned this issue Jul 24, 2023

VReplication: Make Source Tablet Selection More Robust #13582

Merged

4 tasks

mattlord closed this as completed in #13582 Jul 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Report: VReplication workflows don't auto recover when source tablet's mysqld fails #13519

Bug Report: VReplication workflows don't auto recover when source tablet's mysqld fails #13519

mattlord commented Jul 17, 2023 •

edited

Loading

Bug Report: VReplication workflows don't auto recover when source tablet's mysqld fails #13519

Bug Report: VReplication workflows don't auto recover when source tablet's mysqld fails #13519

Comments

mattlord commented Jul 17, 2023 • edited Loading

Overview of the Issue

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments

mattlord commented Jul 17, 2023 •

edited

Loading