[Segment Replication & Remote Translog] Back-pressure and Recovery for lagging replica copies #4478

Bukhtawar · 2022-09-10T11:28:38Z

Is your feature request related to a problem? Please describe.
Once we enable segment based replication for an index, we wouldn't be indexing any operation on the replica(just writing to translog for durability). Just by virtue of having a successful write to translog we would assume that the replica is caught up. However, since no indexing operation is applied on replicas except the segments on checkpoint refresh, it's possible that the replica may not have successfully processed the checkpoint for a while due to shard overload/slow I/O would still be serving reads.
Currently there are no additional mechanisms(once translog has been written on the replica) to apply back pressure on primary if the replica is slow in processing checkpoints which would be aggravated with remote translog since there wouldn't be any I/O on replica at all since remote translog writes on primary will handle durability altogether.

Describe the solution you'd like
Need to support mechanisms to apply back pressure and as a last resort fail the replica copy if its unable to process any further checkpoint beyond a threshold

Describe alternatives you've considered

Additional context
Add any other context or screenshots about the feature request here.

mch2 · 2023-02-01T04:52:14Z

@Bukhtawar Have put together a POC that applies some backpressure as replicas fall behind based on some arbitrary doc count limit. Would like your thoughts:

At a high level we can update the write path to return data from each replica on its current status. I'm leaning toward a seqNo indicating a replica's "visible" checkpoint. The primary can then store this data and use it to determine if backpressure should be applied based on some configurable setting. We also include a another setting to configure when pressure is applied based on % of replicas that are 'stale.' I'm determining if a replica is 'stale' if its latest visible seqNo, or if actively syncing the seqNo it is syncing to is more than n checkpoints behind the primary. The primary then updates its local state after each copy event completes. I prefer n checkpoints behind to an arbitrary doc count, given the size of checkpoints can vary.

Some additional thoughts:

We should also consider marking shards as stale if replication times out. Right now we fail the replica and recover again if timeout occurs, but this could hit an infinite loop given timeout setting & checkpoint size, leaving the shard initializing over and over. At a minimum here I think we need to ensure files that have been copied remain on disk so we can stop & resume recovery/replication.

If going seqNo Route, some detailed steps/tasks:

Update IndexShard's ReplicationTracker to store what I'm calling a "visibleCheckpoint" to capture the latest seqNo that a replica has received.
Update the replicaResponse on the write path to return the latest visible seqNo.
Update ReplicationOperation#updateCheckPoints to update the returned visible seqNo in the primary's replicationTracker.
Update IndexShard to store the last published replication checkpoint, update PublishCheckpointAction to set this value.
Create a new field in ShardIndexingPressure similar to ShardIndexingPressureMemoryManager, I've called it SegmentReplicationIndexingPressureManager, that is responsible for determining if a request should be rejected on the primary.
Implement SegmentReplicationIndexingPressureManager with logic to reject requests if any replica is behind. Comparing the latestPublishedCheckpoint on the primary to the visible seqNo's available on other shards. Add a new setting that is used to determine this threshhold.

mch2 · 2023-03-01T08:01:25Z

I've put up #6520 to start capturing metrics we can use for applying pressure. This includes checkpoints behind, bytes behind, and some average replication times. I'm leaning toward a combination of breaching thresh holds for all three.

'm debating if we should apply pressure globally to a node or for individual replication groups. I think rejecting on the entire node could be useful if driven by the sum of current bytes replicas are behind, given all will consume shared resources to copy segments. Though this would mean heavier indices would impact those that are lighter.

I'm leaning toward rejecting within a replication group, given the purpose for rejection is to allow replicas to catch up, and rely on existing throttle settings & index backpressure to preserve node resources. @dreamer-89, @Bukhtawar curious what you two think here.

mch2 · 2023-03-02T19:43:57Z

Am thinking of this simple algorithm for applying pressure in SR based on replica state:

On an incoming request to a primary fetch the SegmentReplicationStats introduced with Compute Segment Replication stats for SR backpressure. #6520.
Compute the amount of stale shards from fetched data. A shard is stale if it is behind by n checkpoints and is currently syncing to a checkpoint for over y mins. Both n and y are configurable as settings, I'm thinking a default n = 4, default y = 5 mins. This would be lenient enough to not block when primary is frequently publishing small checkpoints, and strict enough to prevent replicas from falling too far behind based on time. I think a reasonable addition is to also include total bytes behind as an expert setting.
If an individual shard is over 2*y we mark it for failure so that it is not serving reads, meaning applying our pressure did not help catch the shard up in a reasonable timeframe.
If more than a certain % of the replication group is stale, we block the req only for that replication group. Thinking this would default to 50%.

I think this is a reasonable best effort to prevent replicas from falling too far behind until we have a streaming API where we can control our ingestion rate based on these metrics.

Bukhtawar · 2023-03-06T18:24:24Z

From what I understand we are starting with failing the shard first rather than putting a backpressure i.e. disallowing primaries to take in more write requests and allowing lagging replicas a cool-off. If backpressure isn't helping with alleviating the pain within a bounded interval, we can fail the shard copy considering it to be the reason for bottleneck and knowing that we cannot allow replicas to fall too much behind without blocking incoming writes

mch2 · 2023-03-06T21:54:33Z

From what I understand we are starting with failing the shard first rather than putting a backpressure i.e. disallowing primaries to take in more write requests and allowing lagging replicas a cool-off. If backpressure isn't helping with alleviating the pain within a bounded interval, we can fail the shard copy considering it to be the reason for bottleneck and knowing that we cannot allow replicas to fall too much behind without blocking incoming writes

@Bukhtawar I think we are on the same page here though your first sentence is the opposite. We will apply pressure first, and if the replica is not able to cool off & catch up within a bounded interval it will be failed.

mch2 · 2023-03-09T18:49:40Z

Will add 1 more issue here linked to this to actually fail replicas. I suggest we have a background task that periodically (30s or so) fetches the stats introduced with #6520 and fails any replicas in the RG that are behind. @Bukhtawar wdyt?

Rishikesh1159 · 2023-04-06T13:05:07Z

Closing this issue as last pending task of failing lagging/stale replica is merged with this PR: #6850

Bukhtawar added enhancement Enhancement or improvement to existing feature or request untriaged labels Sep 10, 2022

dreamer-89 removed the untriaged label Sep 12, 2022

mch2 mentioned this issue Sep 12, 2022

[META] Segment Replication Issue list #2194

Closed

31 tasks

dreamer-89 added the distributed framework label Sep 13, 2022

mch2 mentioned this issue Nov 8, 2022

[Meta] Promote Segment Replication out of experimental. #5147

Closed

16 tasks

mch2 self-assigned this Nov 14, 2022

mch2 mentioned this issue Jan 27, 2023

[DISCUSS] Add back preference for searching _primaries or _replicas #6046

Closed

anasalkouz added this to Segment Replication Feb 9, 2023

anasalkouz moved this to In Progress in Segment Replication Feb 9, 2023

mch2 mentioned this issue Mar 1, 2023

Compute Segment Replication stats for SR backpressure. #6520

Closed

6 tasks

mch2 linked a pull request Mar 2, 2023 that will close this issue

Compute Segment Replication stats for SR backpressure. #6520

Closed

6 tasks

mch2 mentioned this issue Mar 7, 2023

Implement Segment replication Backpressure #6563

Merged

6 tasks

This was referenced Mar 9, 2023

[Segment Replication] Send Remote shard failure for lagging replicas #6606

Closed

Segment Replication - Update /_cat/segment_replication API with backpressure metrics. #6674

Merged

anasalkouz moved this from In Progress to Done in Segment Replication Mar 23, 2023

Rishikesh1159 mentioned this issue Mar 23, 2023

[Backport 2.x] Segment Replication - Update /_cat/segment_replication API with backp… #6741

Merged

6 tasks

This was referenced Mar 28, 2023

Back pressure for lagging remote segments upload #6851

Closed

Back pressure and shard failure for remote translog upload in sync/async durability mode #6852

Open

Rishikesh1159 closed this as completed Apr 6, 2023

Rishikesh1159 mentioned this issue Apr 6, 2023

[DOC] Add Segment Replication Backpressure to documentation opensearch-project/documentation-website#3695

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Segment Replication & Remote Translog] Back-pressure and Recovery for lagging replica copies #4478

[Segment Replication & Remote Translog] Back-pressure and Recovery for lagging replica copies #4478

Bukhtawar commented Sep 10, 2022

mch2 commented Feb 1, 2023 •

edited

Loading

mch2 commented Mar 1, 2023 •

edited

Loading

mch2 commented Mar 2, 2023

Bukhtawar commented Mar 6, 2023

mch2 commented Mar 6, 2023

mch2 commented Mar 9, 2023

Rishikesh1159 commented Apr 6, 2023

[Segment Replication & Remote Translog] Back-pressure and Recovery for lagging replica copies #4478

[Segment Replication & Remote Translog] Back-pressure and Recovery for lagging replica copies #4478

Comments

Bukhtawar commented Sep 10, 2022

mch2 commented Feb 1, 2023 • edited Loading

mch2 commented Mar 1, 2023 • edited Loading

mch2 commented Mar 2, 2023

Bukhtawar commented Mar 6, 2023

mch2 commented Mar 6, 2023

mch2 commented Mar 9, 2023

Rishikesh1159 commented Apr 6, 2023

mch2 commented Feb 1, 2023 •

edited

Loading

mch2 commented Mar 1, 2023 •

edited

Loading