Do not expose hard-deleted docs in Lucene history #32333

dnhatn · 2018-07-24T16:22:47Z

Today when reading operation history in Lucene, we read all documents.
However, if indexing a document is aborted, IndexWriter will hard-delete
it; we, therefore, need to exclude that document from Lucene history.

This commit makes sure that we exclude aborted documents by using the
hard liveDocs of a SegmentReader if there are deletes.

Closes #32269

Today when reading operation history in Lucene, we read all documents. However, if indexing a document is aborted, IndexWriter will hard-delete it; we, therefore, need to exclude that document from Lucene history. This commit makes sure that we exclude aborted documents by using the hard liveDocs of a SegmentReader if there are deletes. Note that this wrapper does not work well with IndexWriter#tryDeleteDocument. We need to revisit the wrapper after LUCENE-8425 gets in.

This reverts commit 8e66a93.

elasticmachine · 2018-07-24T16:22:48Z

Pinging @elastic/es-distributed

s1monw · 2018-07-25T07:52:12Z

server/src/main/java/org/elasticsearch/common/lucene/Lucene.java

+                        if (si.getDelCount() == 0) {
+                            return new LeafReaderWithLiveDocs(segmentReader, null, segmentReader.maxDoc());
+                        } else {
+                            Bits hardLiveBits = si.info.getCodec().liveDocsFormat().readLiveDocs(si.info.dir, si, IOContext.READ);


I tried to think this through and we might be subject to concurrent deletes if we do it this way. Can't we instead do something like this:

DocIdSetIterator soft_deletes = DocValuesFieldExistsQuery.getDocValuesDocIdSetIterator("soft_deletes", sr); Bits liveDocs = sr.getLiveDocs(); FixedBitSet hardLiveDocs = new FixedBitSet(numDeletes); hardLiveDocs.set(0, numDeletes); for (int i = 0; i < liveDocs.length(); i++) { if (liveDocs.get(i) == false) { if (soft_deletes.docID() < i) { int doc = soft_deletes.docID() == DocIdSetIterator.NO_MORE_DOCS ? DocIdSetIterator.NO_MORE_DOCS : soft_deletes.advance(i); if (doc != i) { hardLiveDocs.clear(i); } } } }

note I didn't try this out.. just to provide an idea

dnhatn · 2018-07-30T05:02:53Z

@s1monw I've updated the PR using the hardLiveDocs exposed in Lucene. I wonder if we should expose "isNRT" property of a SegmentReader so that we know if we can use SegmentInfos to calculate numDocs instead of getting the cardinality of the liveDocs. Can you please have another look? Thank you.

jpountz

I left some comments.

jpountz · 2018-07-30T07:28:54Z

server/src/main/java/org/elasticsearch/common/lucene/Lucene.java

+                        return new LeafReaderWithLiveDocs(leaf, null, leaf.maxDoc());
+                    }
+                    final int numDocs;
+                    if (hardLiveDocs instanceof FixedBitSet) {


does it ever kick in? I would assume you get a FixedBits instance for live docs instead of a mutable FixedBitSet.

Yes, we convert then return a read-only bits. I removed this 💯.

jpountz · 2018-07-30T07:38:19Z

server/src/test/java/org/elasticsearch/common/lucene/LuceneTests.java

+            assertThat(actualDocs, equalTo(liveDocs));
+        }
+        IOUtils.close(writer, dir);
+    }


can we make it two tests? one that never inserts a document that fails indexing, and another one that always inserts one?

dnhatn · 2018-07-30T13:59:00Z

@jpountz I have addressed your feedbacks? Can you please have another look? Thank you.

jpountz

LGTM. We should probably leave a TODO about avoiding to count live docs. Unfortunately we can't do it easily with a FilterDirectoryReader since composite readers count documents eagerly, but maybe we could find a different way to consume hard deletes that doesn't require to count them.

dnhatn · 2018-07-30T14:43:01Z

@jpountz Thanks for reviewing. I added a TODO.

dnhatn · 2018-07-30T18:30:33Z

Thanks @jpountz and @s1monw.

* elastic/ccr: (57 commits) ShardFollowNodeTask should fetch operation once (elastic#32455) Do not expose hard-deleted docs in Lucene history (elastic#32333) Tests: Fix convert error tests to use fixed value (elastic#32415) IndicesClusterStateService should replace an init. replica with an init. primary with the same aId (elastic#32374) REST high-level client: parse back _ignored meta field (elastic#32362) [CI] Mute DocumentSubsetReaderTests testSearch Reject follow request if following setting not enabled on follower (elastic#32448) TEST: testDocStats should always use forceMerge (elastic#32450) TEST: avoid merge in testSegmentMemoryTrackedInBreaker TEST: Avoid deletion in FlushIT AwaitsFix IndexShardTests#testDocStats Painless: Add method type to method. (elastic#32441) Remove reference to non-existent store type (elastic#32418) [TEST] Mute failing FlushIT test Fix ordering of bootstrap checks in docs (elastic#32417) [TEST] Mute failing InternalEngineTests#testSeqNoAndCheckpoints Validate source of an index in LuceneChangesSnapshot (elastic#32288) [TEST] Mute failing testConvertLongHexError bump lucene version after backport Upgrade to Lucene-7.5.0-snapshot-608f0277b0 (elastic#32390) ...

Today when reading operation history in Lucene, we read all documents. However, if indexing a document is aborted, IndexWriter will hard-delete it; we, therefore, need to exclude that document from Lucene history. This commit makes sure that we exclude aborted documents by using the hard liveDocs of a SegmentReader if there are deletes.

dnhatn added 2 commits July 24, 2018 12:15

Revert "AwaitsFix: forbid completion without contexts test"

6cb9044

This reverts commit 8e66a93.

dnhatn added >bug :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. labels Jul 24, 2018

dnhatn requested review from jpountz, s1monw and bleskes July 24, 2018 16:22

dnhatn mentioned this pull request Jul 24, 2018

CCR: Aborted document is exposed in Lucene changes #32269

Closed

remove unused imports

5405c6e

s1monw requested changes Jul 25, 2018

View reviewed changes

dnhatn added 2 commits July 29, 2018 10:05

Merge branch 'ccr' into exclude-hard-deletes

1af3e9c

Use hardLiveDocs

5a7251b

dnhatn requested a review from s1monw July 30, 2018 05:03

jpountz requested changes Jul 30, 2018

View reviewed changes

Adrien’s feedback

ee8524a

jpountz approved these changes Jul 30, 2018

View reviewed changes

Add todo

996ad36

dnhatn merged commit 1fdc3f0 into elastic:ccr Jul 30, 2018

dnhatn deleted the exclude-hard-deletes branch July 30, 2018 18:30

dnhatn added the backport pending label Jul 30, 2018

dnhatn mentioned this pull request Jul 30, 2018

Use soft-deletes to maintain document history #29530

Closed

14 tasks

dnhatn removed the backport pending label Aug 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not expose hard-deleted docs in Lucene history #32333

Do not expose hard-deleted docs in Lucene history #32333

dnhatn commented Jul 24, 2018 •

edited

Loading

elasticmachine commented Jul 24, 2018

s1monw Jul 25, 2018

dnhatn commented Jul 30, 2018

jpountz left a comment

jpountz Jul 30, 2018

dnhatn Jul 30, 2018

jpountz Jul 30, 2018

dnhatn Jul 30, 2018

dnhatn commented Jul 30, 2018

jpountz left a comment

dnhatn commented Jul 30, 2018

dnhatn commented Jul 30, 2018

Do not expose hard-deleted docs in Lucene history #32333

Do not expose hard-deleted docs in Lucene history #32333

Conversation

dnhatn commented Jul 24, 2018 • edited Loading

elasticmachine commented Jul 24, 2018

s1monw Jul 25, 2018

Choose a reason for hiding this comment

dnhatn commented Jul 30, 2018

jpountz left a comment

Choose a reason for hiding this comment

jpountz Jul 30, 2018

Choose a reason for hiding this comment

dnhatn Jul 30, 2018

Choose a reason for hiding this comment

jpountz Jul 30, 2018

Choose a reason for hiding this comment

dnhatn Jul 30, 2018

Choose a reason for hiding this comment

dnhatn commented Jul 30, 2018

jpountz left a comment

Choose a reason for hiding this comment

dnhatn commented Jul 30, 2018

dnhatn commented Jul 30, 2018

dnhatn commented Jul 24, 2018 •

edited

Loading