consumer: ABA concurrency between shard watchdog & transition handler #314

jgraettinger · 2022-01-28T19:39:33Z

We (Estuary) run a shard watchdog job in our cluster which removes assignments for FAILED shards periodically.

Here's a scenario we recently observed:

A shard failed, and is marked as FAILED
Some time later, the shard-watchdog job noticed it was FAILED and removed the assignment.
The consumer coordinator noticed the assignment removal and immediately re-assigned it back to the same pod where it had been running.
The pod's Etcd keyspace watch loop (which has a deliberate Nagle-like delay of ~30ms) didn't notice that the assignment was removed and re-created. Specifically, there wasn't a keyspace watch loop iteration where it was able to notice the key didn't exist and take action accordingly, and by the time it did run, the key already existed again.

Solution: The resolver's updateLocalShards function needs to account for the assignment's Etcd creation revision. If the prior and current creation revision differ, that's an implicit delete-then-create operation.

Fixes gazette/core#314

jgraettinger added the bug label Jan 28, 2022

jgraettinger mentioned this issue Feb 2, 2022

consumer: shard transitions account for delete-then-create race #315

Merged

jgraettinger added a commit to estuary/flow that referenced this issue Feb 2, 2022

go.mod: bump gazette pin to bring in shard creation race fix

6548c2b

Fixes gazette/core#314

jgraettinger added a commit to estuary/flow that referenced this issue Feb 2, 2022

go.mod: bump gazette pin to bring in shard creation race fix

904fd86

Fixes gazette/core#314

jgraettinger mentioned this issue Feb 2, 2022

runtime: fix five bugs estuary/flow#354

Merged

jgraettinger closed this as completed in 0087160 Feb 2, 2022

jgraettinger added a commit to estuary/flow that referenced this issue Feb 3, 2022

go.mod: bump gazette pin to bring in shard creation race fix

0a4aeb5

Fixes gazette/core#314

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consumer: ABA concurrency between shard watchdog & transition handler #314

consumer: ABA concurrency between shard watchdog & transition handler #314

jgraettinger commented Jan 28, 2022

consumer: ABA concurrency between shard watchdog & transition handler #314

consumer: ABA concurrency between shard watchdog & transition handler #314

Comments

jgraettinger commented Jan 28, 2022