Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

M3 mirrored placement algorithm should support concurrent replaces for instances #2850

Closed
andrewmains12 opened this issue Nov 6, 2020 · 5 comments
Assignees
Labels
area:aggregator All issues pertaining to aggregator

Comments

@andrewmains12
Copy link
Contributor

Currently, the M3 mirrored placement algorithm blocks concurrent replace operations, even when the two operations are independent.

Say you have a placement like:

i1 shardsetID 1, 4 shards available, 0 init, 0 leaving
i2 shardSetID 1, 4 shards available, 0 init, 0 leaving

i3 shardsetID 2, 4 shards available, 0 init, 0 leaving
i4 shardSetID 2, 4 shards available, 0 init, 0 leaving

with i1, i2 in a shardset pair and i3, i4 in a shardset pair.

If you try to do 2 consecutive replaces, the second replace will block until the first replace finishes, e.g.:

R1: replace i1 with i5
R2: replace i3 with i6   # this replace will block until R1 finishes

Since replaces can take a long time to complete for long tile sizes (e.g. 1 hour), this is non ideal.

The reason this ends up blocking is that we call MarkAllShardsAvailable before doing the replacement (code, which means that all shards in the placement have to pass the IsCutoverFn and IsCutoffFn checks. For hour tiles, this means that the second replace has to wait an additional hour or so to go through.

The point of this call (iiuc) is to make sure that the placement is in a clean state before doing any shard movement. That is, if you have shards that are moving between nodes already, you shouldn't perform a replace on those nodes.

Potential Fixes

We may be able to fix this by limiting the instances we mark available to those affected by the replace, i.e. the leaving (replaced) instances. This will allow replaces that operate on independent shardsets to proceed concurrently.

@andrewmains12
Copy link
Contributor Author

cc @ryanhall07 -- @robskillington mentioned you as a potential good POC on the Chronosphere side for this. I'm working on implementing a fix now, but would love any input on the approach.

@andrewmains12
Copy link
Contributor Author

Also cc @prateek

@gibbscullen
Copy link
Collaborator

@andrewmains12 -- checking in .. this still an issue after #2858?

@gibbscullen gibbscullen self-assigned this Nov 11, 2020
abliqo added a commit that referenced this issue Jan 23, 2021
Problem:

This fixes issue #2850. The mirrored placement requires all shards to be available to process an instance replace. This makes it impossible to have a consecutive replace following an earlier replace until *all* shards in placement are cutover, even if the scopes of replaces do not overlap, i.e. the replace pairs own disjoint sets of shards. In practice it significantly slows down consecutive replaces and increases the risk of data loss because the longest supported aggregation tile is 1 hour.

Solution:

When processing a replace, require only the leaving instance and its peers to have their shards available. The peer instances are instances that own the same shardset which includes mirror peers (when replication factor >= 2) and any pending replaces where the specified leaving node can be either a replacement or replaced.

Add new tests asserting this use case.
@gibbscullen gibbscullen added the area:aggregator All issues pertaining to aggregator label Jan 28, 2021
@gibbscullen
Copy link
Collaborator

Closing as being worked on in PR #3117.

@abliqo
Copy link
Collaborator

abliqo commented Jan 30, 2021

@andrewmains12 -- checking in .. this still an issue after #2858?

Yes it is. The #2858 allows to specify a custom placement algorithm that could be implemented outside of this OSS code base but it's better to fix existing mirrored placement algorithm as I'm doing in #3117

abliqo added a commit that referenced this issue Feb 5, 2021
* Consecutive replaces in mirrored placement

Problem:

This fixes issue #2850. The mirrored placement requires all shards to be available to process an instance replace. This makes it impossible to have a consecutive replace following an earlier replace until *all* shards in placement are cutover, even if the scopes of replaces do not overlap, i.e. the replace pairs own disjoint sets of shards. In practice it significantly slows down consecutive replaces and increases the risk of data loss because the longest supported aggregation tile is 1 hour.

Solution:

When processing a replace, require only the leaving instance and its peers to have their shards available. The peer instances are instances that own the same shardset which includes mirror peers (when replication factor >= 2) and any pending replaces where the specified leaving node can be either a replacement or replaced.

Add new tests asserting this use case.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:aggregator All issues pertaining to aggregator
Projects
None yet
Development

No branches or pull requests

3 participants