Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not expose hard-deleted docs in Lucene history #32333

Merged
merged 7 commits into from
Jul 30, 2018

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Jul 24, 2018

Today when reading operation history in Lucene, we read all documents.
However, if indexing a document is aborted, IndexWriter will hard-delete
it; we, therefore, need to exclude that document from Lucene history.

This commit makes sure that we exclude aborted documents by using the
hard liveDocs of a SegmentReader if there are deletes.

Closes #32269

dnhatn added 2 commits July 24, 2018 12:15
Today when reading operation history in Lucene, we read all documents.
However, if indexing a document is aborted, IndexWriter will hard-delete
it; we, therefore, need to exclude that document from Lucene history.

This commit makes sure that we exclude aborted documents by using the
hard liveDocs of a SegmentReader if there are deletes. Note that this
wrapper does not work well with IndexWriter#tryDeleteDocument. We need
to revisit the wrapper after LUCENE-8425 gets in.
@dnhatn dnhatn added >bug :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. labels Jul 24, 2018
@dnhatn dnhatn requested review from jpountz, s1monw and bleskes July 24, 2018 16:22
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

if (si.getDelCount() == 0) {
return new LeafReaderWithLiveDocs(segmentReader, null, segmentReader.maxDoc());
} else {
Bits hardLiveBits = si.info.getCodec().liveDocsFormat().readLiveDocs(si.info.dir, si, IOContext.READ);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to think this through and we might be subject to concurrent deletes if we do it this way. Can't we instead do something like this:

DocIdSetIterator soft_deletes = DocValuesFieldExistsQuery.getDocValuesDocIdSetIterator("soft_deletes", sr);
          Bits liveDocs = sr.getLiveDocs();
          FixedBitSet hardLiveDocs = new FixedBitSet(numDeletes);
          hardLiveDocs.set(0, numDeletes);
          for (int i = 0; i < liveDocs.length(); i++) {
            if (liveDocs.get(i) == false) {
              if (soft_deletes.docID() < i) {
                int doc = soft_deletes.docID() == DocIdSetIterator.NO_MORE_DOCS ? DocIdSetIterator.NO_MORE_DOCS : soft_deletes.advance(i);
                if (doc != i) {
                  hardLiveDocs.clear(i);
                }
              }
            }
          }

note I didn't try this out.. just to provide an idea

@dnhatn
Copy link
Member Author

dnhatn commented Jul 30, 2018

@s1monw I've updated the PR using the hardLiveDocs exposed in Lucene. I wonder if we should expose "isNRT" property of a SegmentReader so that we know if we can use SegmentInfos to calculate numDocs instead of getting the cardinality of the liveDocs. Can you please have another look? Thank you.

@dnhatn dnhatn requested a review from s1monw July 30, 2018 05:03
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments.

return new LeafReaderWithLiveDocs(leaf, null, leaf.maxDoc());
}
final int numDocs;
if (hardLiveDocs instanceof FixedBitSet) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it ever kick in? I would assume you get a FixedBits instance for live docs instead of a mutable FixedBitSet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we convert then return a read-only bits. I removed this 💯.

assertThat(actualDocs, equalTo(liveDocs));
}
IOUtils.close(writer, dir);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make it two tests? one that never inserts a document that fails indexing, and another one that always inserts one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@dnhatn
Copy link
Member Author

dnhatn commented Jul 30, 2018

@jpountz I have addressed your feedbacks? Can you please have another look? Thank you.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We should probably leave a TODO about avoiding to count live docs. Unfortunately we can't do it easily with a FilterDirectoryReader since composite readers count documents eagerly, but maybe we could find a different way to consume hard deletes that doesn't require to count them.

@dnhatn
Copy link
Member Author

dnhatn commented Jul 30, 2018

@jpountz Thanks for reviewing. I added a TODO.

@dnhatn
Copy link
Member Author

dnhatn commented Jul 30, 2018

Thanks @jpountz and @s1monw.

@dnhatn dnhatn merged commit 1fdc3f0 into elastic:ccr Jul 30, 2018
@dnhatn dnhatn deleted the exclude-hard-deletes branch July 30, 2018 18:30
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jul 31, 2018
* elastic/ccr: (57 commits)
  ShardFollowNodeTask should fetch operation once (elastic#32455)
  Do not expose hard-deleted docs in Lucene history (elastic#32333)
  Tests: Fix convert error tests to use fixed value (elastic#32415)
  IndicesClusterStateService should replace an init. replica with an init. primary with the same aId (elastic#32374)
  REST high-level client: parse back _ignored meta field (elastic#32362)
  [CI] Mute DocumentSubsetReaderTests testSearch
  Reject follow request if following setting not enabled on follower (elastic#32448)
  TEST: testDocStats should always use forceMerge (elastic#32450)
  TEST: avoid merge in testSegmentMemoryTrackedInBreaker
  TEST: Avoid deletion in FlushIT
  AwaitsFix IndexShardTests#testDocStats
  Painless: Add method type to method. (elastic#32441)
  Remove reference to non-existent store type (elastic#32418)
  [TEST] Mute failing FlushIT test
  Fix ordering of bootstrap checks in docs (elastic#32417)
  [TEST] Mute failing InternalEngineTests#testSeqNoAndCheckpoints
  Validate source of an index in LuceneChangesSnapshot (elastic#32288)
  [TEST] Mute failing testConvertLongHexError
  bump lucene version after backport
  Upgrade to Lucene-7.5.0-snapshot-608f0277b0 (elastic#32390)
  ...
dnhatn added a commit that referenced this pull request Aug 2, 2018
Today when reading operation history in Lucene, we read all documents.
However, if indexing a document is aborted, IndexWriter will hard-delete
it; we, therefore, need to exclude that document from Lucene history.

This commit makes sure that we exclude aborted documents by using the
hard liveDocs of a SegmentReader if there are deletes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants