Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read from a checkpoint for RFS #1149

Conversation

mikaylathompson
Copy link
Collaborator

@mikaylathompson mikaylathompson commented Nov 19, 2024

Description

One of the key components of sub-shard RFS (by any approach) is to be able to pick up and read documents from a specific point in the shard. Each shard is composed of segments, and documents are indexed sequentially within segments. By specifying the index of the starting segment and the document within that segment, a specific spot within a shard can be pinpointed.

This PR sets the scene to pick up sub-shard work by allowing IndexAndShard work items to also specify a starting segment index and doc id. We don't expect this to actually be exercised until https://opensearch.atlassian.net/browse/MIGRATIONS-2128 is merged (which is create work items that specify these values). If they're not present, segment and doc 0 are assumed.

Issues Resolved

https://opensearch.atlassian.net/browse/MIGRATIONS-2164

Testing

Unit tests are added, based on the snapshots in https://github.com/opensearch-project/opensearch-migrations/tree/main/RFS/test-resources/snapshots, which have a variety of segment configurations.

Check List

  • New functionality includes testing
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Mikayla Thompson <thomika@amazon.com>
Signed-off-by: Mikayla Thompson <thomika@amazon.com>
Signed-off-by: Mikayla Thompson <thomika@amazon.com>
Signed-off-by: Mikayla Thompson <thomika@amazon.com>
@@ -107,7 +107,7 @@ protected RfsLuceneDocument getDocument(IndexReader reader, int docId, boolean i

// Start reindexing in a separate thread
Thread reindexThread = new Thread(() -> {
reindexer.reindex("test-index", reader.readDocuments(), mockContext).block();
reindexer.reindex("test-index", reader.readDocuments(0, 0), mockContext).block();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the override exist too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -114,35 +114,40 @@ protected DirectoryReader getReader() throws IOException {// Get the list of com
}
}

Publisher<RfsLuceneDocument> readDocsByLeavesInParallel(DirectoryReader reader) {
var segmentsToReadAtOnce = 5; // Arbitrary value
/* Start reading docs from a specific segment and document id.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - spacing

.addArgument(reader::maxDoc)
.addArgument(() -> reader.leaves().size())
.log();
.addArgument(reader::maxDoc)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are spacing differences in this PR that are making it look bigger than it is

Comment on lines 14 to 15
int startingSegmentIndex;
int startingDocId;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend renaming the class now that these two fields are here too. Feels more like a shard cursor

Comment on lines 41 to 43
return new IndexAndShard(components[0], Integer.parseInt(components[1]),
components.length >= 3 ? Integer.parseInt(components[2]) : 0,
components.length >= 4 ? Integer.parseInt(components[3]) : 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate java - sorry that you don't have pattern matching

Signed-off-by: Mikayla Thompson <thomika@amazon.com>
Comment on lines 360 to 369
var verifier = StepVerifier.create(documents);
var expectedDocumentIds = documentIds.get(i);
expectedDocumentIds.forEach(id -> {
verifier.expectNextMatches(doc -> {
Assertions.assertEquals(id, doc.id);
return true;
});
});

verifier.expectComplete().verify();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice code - but you won't get full context back when it fails. Something to consider would be to just concatenate the expected data into a list or json & then do the same for the test data. JUnit runners are pretty good about showing diffs

Copy link

codecov bot commented Nov 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.75%. Comparing base (5514bc7) to head (56766f8).
Report is 16 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1149      +/-   ##
============================================
+ Coverage     80.72%   80.75%   +0.03%     
- Complexity     2947     2953       +6     
============================================
  Files           399      399              
  Lines         14965    15101     +136     
  Branches       1017     1021       +4     
============================================
+ Hits          12080    12195     +115     
- Misses         2274     2292      +18     
- Partials        611      614       +3     
Flag Coverage Δ
gradle-test ?
python-test ?
unittests 80.75% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

Signed-off-by: Mikayla Thompson <thomika@amazon.com>
@mikaylathompson mikaylathompson merged commit 176eefd into opensearch-project:main Nov 19, 2024
17 checks passed
@mikaylathompson mikaylathompson deleted the rfs-read-starting-mid-shard branch November 19, 2024 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants