You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We (Estuary) run a shard watchdog job in our cluster which removes assignments for FAILED shards periodically.
Here's a scenario we recently observed:
A shard failed, and is marked as FAILED
Some time later, the shard-watchdog job noticed it was FAILED and removed the assignment.
The consumer coordinator noticed the assignment removal and immediately re-assigned it back to the same pod where it had been running.
The pod's Etcd keyspace watch loop (which has a deliberate Nagle-like delay of ~30ms) didn't notice that the assignment was removed and re-created. Specifically, there wasn't a keyspace watch loop iteration where it was able to notice the key didn't exist and take action accordingly, and by the time it did run, the key already existed again.
Solution: The resolver's updateLocalShards function needs to account for the assignment's Etcd creation revision. If the prior and current creation revision differ, that's an implicit delete-then-create operation.
The text was updated successfully, but these errors were encountered:
We (Estuary) run a shard watchdog job in our cluster which removes assignments for FAILED shards periodically.
Here's a scenario we recently observed:
Solution: The resolver's updateLocalShards function needs to account for the assignment's Etcd creation revision. If the prior and current creation revision differ, that's an implicit delete-then-create operation.
The text was updated successfully, but these errors were encountered: