Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consumer: ABA concurrency between shard watchdog & transition handler #314

Closed
jgraettinger opened this issue Jan 28, 2022 · 0 comments · Fixed by #315
Closed

consumer: ABA concurrency between shard watchdog & transition handler #314

jgraettinger opened this issue Jan 28, 2022 · 0 comments · Fixed by #315
Labels

Comments

@jgraettinger
Copy link
Contributor

We (Estuary) run a shard watchdog job in our cluster which removes assignments for FAILED shards periodically.

Here's a scenario we recently observed:

  • A shard failed, and is marked as FAILED
  • Some time later, the shard-watchdog job noticed it was FAILED and removed the assignment.
  • The consumer coordinator noticed the assignment removal and immediately re-assigned it back to the same pod where it had been running.
  • The pod's Etcd keyspace watch loop (which has a deliberate Nagle-like delay of ~30ms) didn't notice that the assignment was removed and re-created. Specifically, there wasn't a keyspace watch loop iteration where it was able to notice the key didn't exist and take action accordingly, and by the time it did run, the key already existed again.

Solution: The resolver's updateLocalShards function needs to account for the assignment's Etcd creation revision. If the prior and current creation revision differ, that's an implicit delete-then-create operation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant