-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Segment Replication & Remote Translog] Back-pressure and Recovery for lagging replica copies #4478
Comments
@Bukhtawar Have put together a POC that applies some backpressure as replicas fall behind based on some arbitrary doc count limit. Would like your thoughts: At a high level we can update the write path to return data from each replica on its current status. I'm leaning toward a seqNo indicating a replica's "visible" checkpoint. The primary can then store this data and use it to determine if backpressure should be applied based on some configurable setting. We also include a another setting to configure when pressure is applied based on % of replicas that are 'stale.' I'm determining if a replica is 'stale' if its latest visible seqNo, or if actively syncing the seqNo it is syncing to is more than n checkpoints behind the primary. The primary then updates its local state after each copy event completes. I prefer n checkpoints behind to an arbitrary doc count, given the size of checkpoints can vary. Some additional thoughts:
If going seqNo Route, some detailed steps/tasks: Update IndexShard's ReplicationTracker to store what I'm calling a "visibleCheckpoint" to capture the latest seqNo that a replica has received. |
I've put up #6520 to start capturing metrics we can use for applying pressure. This includes checkpoints behind, bytes behind, and some average replication times. I'm leaning toward a combination of breaching thresh holds for all three. 'm debating if we should apply pressure globally to a node or for individual replication groups. I think rejecting on the entire node could be useful if driven by the sum of current bytes replicas are behind, given all will consume shared resources to copy segments. Though this would mean heavier indices would impact those that are lighter. I'm leaning toward rejecting within a replication group, given the purpose for rejection is to allow replicas to catch up, and rely on existing throttle settings & index backpressure to preserve node resources. @dreamer-89, @Bukhtawar curious what you two think here. |
Am thinking of this simple algorithm for applying pressure in SR based on replica state:
I think this is a reasonable best effort to prevent replicas from falling too far behind until we have a streaming API where we can control our ingestion rate based on these metrics. |
From what I understand we are starting with failing the shard first rather than putting a backpressure i.e. disallowing primaries to take in more write requests and allowing lagging replicas a cool-off. If backpressure isn't helping with alleviating the pain within a bounded interval, we can fail the shard copy considering it to be the reason for bottleneck and knowing that we cannot allow replicas to fall too much behind without blocking incoming writes |
@Bukhtawar I think we are on the same page here though your first sentence is the opposite. We will apply pressure first, and if the replica is not able to cool off & catch up within a bounded interval it will be failed. |
Will add 1 more issue here linked to this to actually fail replicas. I suggest we have a background task that periodically (30s or so) fetches the stats introduced with #6520 and fails any replicas in the RG that are behind. @Bukhtawar wdyt? |
Closing this issue as last pending task of failing lagging/stale replica is merged with this PR: #6850 |
Is your feature request related to a problem? Please describe.
Once we enable segment based replication for an index, we wouldn't be indexing any operation on the replica(just writing to translog for durability). Just by virtue of having a successful write to translog we would assume that the replica is caught up. However, since no indexing operation is applied on replicas except the segments on checkpoint refresh, it's possible that the replica may not have successfully processed the checkpoint for a while due to shard overload/slow I/O would still be serving reads.
Currently there are no additional mechanisms(once translog has been written on the replica) to apply back pressure on primary if the replica is slow in processing checkpoints which would be aggravated with remote translog since there wouldn't be any I/O on replica at all since remote translog writes on primary will handle durability altogether.
Describe the solution you'd like
Need to support mechanisms to apply back pressure and as a last resort fail the replica copy if its unable to process any further checkpoint beyond a threshold
Describe alternatives you've considered
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: