Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backport 2.x] Fix bug where retries within RemoteStoreRefreshListener cause infos/checkpoint mismatch #10760

Merged
merged 1 commit into from
Oct 19, 2023

Conversation

opensearch-trigger-bot[bot]
Copy link
Contributor

Backport e389a09 from #10655.

…heckpoint mismatch (#10655)

* Fix bug where retries within RemoteStoreRefreshListener cause mismatch between ReplicationCheckpoint and uploaded SegmentInfos.

Retries within RemoteStoreRefreshListener run outside of the refresh thread.  This means that concurrent refreshes
may occur during syncSegments execution updating the on-reader SegmentInfos.  A shard's latest ReplicationCheckpoint
is computed and set in a refresh listener, but it is not guaranteed the listener has run before the retry fetches the infos or checkpoint independently.
This fix ensures the listener recomputes the checkpoint while fetching the SegmentInfos. This change also
ensures that we only recompute the checkpoint when necessary because it comes with an IO cost to compute StoreFileMetadata.

Signed-off-by: Marc Handalian <handalm@amazon.com>

Update refresh listener to recompute checkpoint from latest infos snapshot.

Signed-off-by: Marc Handalian <handalm@amazon.com>

Fix broken test case by comparing segments gen

Signed-off-by: Marc Handalian <handalm@amazon.com>

spotless

Signed-off-by: Marc Handalian <handalm@amazon.com>

Fix RemoteStoreRefreshListener tests

Signed-off-by: Marc Handalian <handalm@amazon.com>

* add extra log

Signed-off-by: Marc Handalian <handalm@amazon.com>

---------

Signed-off-by: Marc Handalian <handalm@amazon.com>
(cherry picked from commit e389a09)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@github-actions
Copy link
Contributor

Compatibility status:

Checks if related components are compatible with change 23d691d

Incompatible components

Incompatible components: [https://github.com/opensearch-project/security-analytics.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/performance-analyzer-rca.git]

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@mch2
Copy link
Member

mch2 commented Oct 19, 2023

Gradle Check (Jenkins) Run Completed with:

#9401 and #8932

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.cluster.allocation.ClusterRerouteIT.testDelayWithALargeAmountOfShards
      1 org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness

@codecov
Copy link

codecov bot commented Oct 19, 2023

Codecov Report

Merging #10760 (23d691d) into 2.x (822b9d3) will decrease coverage by 0.21%.
Report is 3 commits behind head on 2.x.
The diff coverage is 68.96%.

@@             Coverage Diff              @@
##                2.x   #10760      +/-   ##
============================================
- Coverage     70.98%   70.77%   -0.21%     
+ Complexity    58685    58575     -110     
============================================
  Files          4839     4839              
  Lines        277012   277118     +106     
  Branches      40639    40655      +16     
============================================
- Hits         196629   196138     -491     
- Misses        63664    64213     +549     
- Partials      16719    16767      +48     
Files Coverage Δ
...pensearch/common/settings/FeatureFlagSettings.java 50.00% <ø> (ø)
.../java/org/opensearch/gateway/GatewayMetaState.java 68.51% <100.00%> (-0.96%) ⬇️
...arch/gateway/remote/RemoteClusterStateService.java 69.54% <100.00%> (-0.14%) ⬇️
.../org/opensearch/index/mapper/RangeFieldMapper.java 87.25% <100.00%> (ø)
.../org/opensearch/index/mapper/RootObjectMapper.java 82.74% <100.00%> (+0.07%) ⬆️
...search/index/shard/RemoteStoreRefreshListener.java 92.43% <100.00%> (+2.00%) ⬆️
...ces/replication/PrimaryShardReplicationSource.java 96.00% <ø> (ø)
...ices/replication/RemoteStoreReplicationSource.java 91.07% <100.00%> (+0.16%) ⬆️
...n/java/org/opensearch/script/ScoreScriptUtils.java 1.50% <100.00%> (ø)
...nsearch/search/aggregations/support/ValueType.java 79.59% <100.00%> (+0.42%) ⬆️
... and 12 more

... and 475 files with indirect coverage changes

@mch2 mch2 merged commit 79eb57a into 2.x Oct 19, 2023
41 of 67 checks passed
@mch2 mch2 deleted the backport/backport-10655-to-2.x branch October 19, 2023 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant