Read from a checkpoint for RFS #1149

mikaylathompson · 2024-11-19T18:48:14Z

Description

One of the key components of sub-shard RFS (by any approach) is to be able to pick up and read documents from a specific point in the shard. Each shard is composed of segments, and documents are indexed sequentially within segments. By specifying the index of the starting segment and the document within that segment, a specific spot within a shard can be pinpointed.

This PR sets the scene to pick up sub-shard work by allowing IndexAndShard work items to also specify a starting segment index and doc id. We don't expect this to actually be exercised until https://opensearch.atlassian.net/browse/MIGRATIONS-2128 is merged (which is create work items that specify these values). If they're not present, segment and doc 0 are assumed.

Issues Resolved

https://opensearch.atlassian.net/browse/MIGRATIONS-2164

Testing

Unit tests are added, based on the snapshots in https://github.com/opensearch-project/opensearch-migrations/tree/main/RFS/test-resources/snapshots, which have a variety of segment configurations.

Check List

New functionality includes testing
- All tests pass, including unit test, integration test and doctest
New functionality has been documented
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Mikayla Thompson <thomika@amazon.com>

gregschohn · 2024-11-19T19:44:11Z

...tMigration/src/test/java/org/opensearch/migrations/bulkload/PerformanceVerificationTest.java

@@ -107,7 +107,7 @@ protected RfsLuceneDocument getDocument(IndexReader reader, int docId, boolean i

        // Start reindexing in a separate thread
        Thread reindexThread = new Thread(() -> {
-            reindexer.reindex("test-index", reader.readDocuments(), mockContext).block();
+            reindexer.reindex("test-index", reader.readDocuments(0, 0), mockContext).block();


Should the override exist too?

gregschohn · 2024-11-19T19:45:03Z

RFS/src/main/java/org/opensearch/migrations/bulkload/common/LuceneDocumentsReader.java

@@ -114,35 +114,40 @@ protected DirectoryReader getReader() throws IOException {// Get the list of com
        }
    }

-    Publisher<RfsLuceneDocument> readDocsByLeavesInParallel(DirectoryReader reader) {
-        var segmentsToReadAtOnce = 5; // Arbitrary value
+    /* Start reading docs from a specific segment and document id.


nit - spacing

gregschohn · 2024-11-19T19:45:54Z

RFS/src/main/java/org/opensearch/migrations/bulkload/common/LuceneDocumentsReader.java

-            .addArgument(reader::maxDoc)
-            .addArgument(() -> reader.leaves().size())
-            .log();
+                .addArgument(reader::maxDoc)


there are spacing differences in this PR that are making it look bigger than it is

gregschohn · 2024-11-19T19:51:22Z

RFS/src/main/java/org/opensearch/migrations/bulkload/worker/IndexAndShard.java

+    int startingSegmentIndex;
+    int startingDocId;


I'd recommend renaming the class now that these two fields are here too. Feels more like a shard cursor

gregschohn · 2024-11-19T19:53:52Z

RFS/src/main/java/org/opensearch/migrations/bulkload/worker/IndexAndShard.java

+        return new IndexAndShard(components[0], Integer.parseInt(components[1]),
+                components.length >= 3 ? Integer.parseInt(components[2]) : 0,
+                components.length >= 4 ? Integer.parseInt(components[3]) : 0);


I hate java - sorry that you don't have pattern matching

Signed-off-by: Mikayla Thompson <thomika@amazon.com>

gregschohn · 2024-11-19T21:00:07Z

RFS/src/test/java/org/opensearch/migrations/bulkload/common/LuceneDocumentsReaderTest.java

+            var verifier = StepVerifier.create(documents);
+            var expectedDocumentIds = documentIds.get(i);
+            expectedDocumentIds.forEach(id -> {
+                verifier.expectNextMatches(doc -> {
+                    Assertions.assertEquals(id, doc.id);
+                    return true;
+                });
+            });
+
+            verifier.expectComplete().verify();


This is really nice code - but you won't get full context back when it fails. Something to consider would be to just concatenate the expected data into a list or json & then do the same for the test data. JUnit runners are pretty good about showing diffs

codecov · 2024-11-19T21:03:43Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.75%. Comparing base (5514bc7) to head (56766f8).
Report is 16 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #1149      +/-   ##
============================================
+ Coverage     80.72%   80.75%   +0.03%     
- Complexity     2947     2953       +6     
============================================
  Files           399      399              
  Lines         14965    15101     +136     
  Branches       1017     1021       +4     
============================================
+ Hits          12080    12195     +115     
- Misses         2274     2292      +18     
- Partials        611      614       +3

Flag	Coverage Δ
gradle-test	`?`
python-test	`?`
unittests	`80.75% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests
JS Bundle Analysis - Avoid shipping oversized bundles

Signed-off-by: Mikayla Thompson <thomika@amazon.com>

mikaylathompson added 5 commits November 7, 2024 14:52

begin to remove parallelism

b20861d

Signed-off-by: Mikayla Thompson <thomika@amazon.com>

Merge branch 'main' into rfs-read-starting-mid-shard

1939b5c

Working commit with startSegmentIndex and startDocId

0d49a9d

Signed-off-by: Mikayla Thompson <thomika@amazon.com>

Add tests

a5f0f35

Signed-off-by: Mikayla Thompson <thomika@amazon.com>

Apply spotless changes

cdc70d0

Signed-off-by: Mikayla Thompson <thomika@amazon.com>

mikaylathompson marked this pull request as ready for review November 19, 2024 19:51

mikaylathompson requested review from AndreKurait, chelma, gregschohn, lewijacn, peternied and sumobrian as code owners November 19, 2024 19:51

gregschohn reviewed Nov 19, 2024

View reviewed changes

Address review comments

b51a08e

Signed-off-by: Mikayla Thompson <thomika@amazon.com>

gregschohn approved these changes Nov 19, 2024

View reviewed changes

Refactor assertions to compare lists

56766f8

Signed-off-by: Mikayla Thompson <thomika@amazon.com>

mikaylathompson merged commit 176eefd into opensearch-project:main Nov 19, 2024
17 checks passed

mikaylathompson deleted the rfs-read-starting-mid-shard branch November 19, 2024 21:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read from a checkpoint for RFS #1149

Read from a checkpoint for RFS #1149

mikaylathompson commented Nov 19, 2024 •

edited

Loading

gregschohn Nov 19, 2024

mikaylathompson Nov 19, 2024

gregschohn Nov 19, 2024

gregschohn Nov 19, 2024

gregschohn Nov 19, 2024

gregschohn Nov 19, 2024

gregschohn Nov 19, 2024

codecov bot commented Nov 19, 2024 •

edited

Loading

Read from a checkpoint for RFS #1149

Read from a checkpoint for RFS #1149

Conversation

mikaylathompson commented Nov 19, 2024 • edited Loading

Description

Issues Resolved

Testing

Check List

gregschohn Nov 19, 2024

Choose a reason for hiding this comment

mikaylathompson Nov 19, 2024

Choose a reason for hiding this comment

gregschohn Nov 19, 2024

Choose a reason for hiding this comment

gregschohn Nov 19, 2024

Choose a reason for hiding this comment

gregschohn Nov 19, 2024

Choose a reason for hiding this comment

gregschohn Nov 19, 2024

Choose a reason for hiding this comment

gregschohn Nov 19, 2024

Choose a reason for hiding this comment

codecov bot commented Nov 19, 2024 • edited Loading

Codecov Report

mikaylathompson commented Nov 19, 2024 •

edited

Loading

codecov bot commented Nov 19, 2024 •

edited

Loading