Shard state transitions should be edge-triggered rather than level-triggered #82185
Labels
:Distributed Coordination/Allocation
All issues relating to the decision making around placing a shard (both master logic & on the nodes)
>enhancement
Team:Distributed (Obsolete)
Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
>tech debt
Today when each node receives a cluster state update it compares the states of its shards in the new routing table to their expected states, and triggers a
shard-started
orshard-failed
transition if they don't match. We then capture the transition and suppress it if a duplicate request is already in flight (#31313 forshard-failed
transitions, #82089 forshard-started
ones).This is pretty ugly. These transitions may be a long way down the master's queue so we may trigger (and then suppress) many duplicate requests. I think the reasons for this mechanism date back to a time when cluster state updates could occasionally be lost, but these problems are fixed today so we should move to a system that triggers the state update request only at the shard state transition and then relies on the fact that this request will eventually complete (possibly unsuccessfully, requiring a retry).
The text was updated successfully, but these errors were encountered: