Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Replication Failover Gap #1349

Open
zalseryani opened this issue Mar 12, 2024 · 1 comment
Open

[BUG] Replication Failover Gap #1349

zalseryani opened this issue Mar 12, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@zalseryani
Copy link

zalseryani commented Mar 12, 2024

Replication Failover on Production Outage has Data Gap

  • When configuring replication between Prod and DR sites of opensearch, and we have an outage on Production, there will be some data that are not synced to DR opensearch.

  • What is the proper solution for such case?

  • Failing over will result in having some documents or messages that are not available on the DR site, would it be a solution to let the ETL start again from the point where it should start from ?
  • Is there any other better way to handle/configure synchronous replication between opensearch Prod and opensearch DR sites, something like 2 phase commit, meaning data will not be written/committed on Prod unless it is written on DR site ?
  • because I do not see any replication configuration to tune the speed of replication or the pulling interval for the data (not metadata/settings or new matching indices when having an auto-follow rule configured between Prod and DR sites) Replication settings

Kindly advise, and thanks in advance for your time and support.

@zalseryani zalseryani added bug Something isn't working untriaged labels Mar 12, 2024
@ankitkala
Copy link
Member

We do not support synchronous replication.

During DR, follower stats can give you the last tracked leaderCheckpoint & followerCheckpoint but 1) it tracks changes at shard level whereas user is concerned about REST API level. 2) Checkpoint doesn't tell you the total data replicated in terms on time but rather as a monotonically increasing integer value.

CCR provides 1 min SLA for replication and usually is under 20 seconds. But its hard to guarantee this as a lot depends on the workload and overall resource consumption.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants