Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CCR: Aborted document is exposed in Lucene changes #32269

Closed
dnhatn opened this issue Jul 22, 2018 · 6 comments
Closed

CCR: Aborted document is exposed in Lucene changes #32269

dnhatn opened this issue Jul 22, 2018 · 6 comments
Labels
>bug :Distributed/CCR Issues around the Cross Cluster State Replication features

Comments

@dnhatn
Copy link
Member

dnhatn commented Jul 22, 2018

The CCR branch started failing frequently after merging #31007. Some CI instances:

These failures can be explained as follows:

  1. A user issues an indexing which will throw an exception in the analyzing phase. Since the IndexingChain fails to process a document, the DocumentsWriterPerThread will hard-delete that document internally in Lucene on the primary.
[2018-07-20T18:04:15,095][DEBUG][o.e.a.b.TransportShardBulkAction] [test][0] failed to execute bulk item (index) BulkShardRequest [[test][0]] containing [index {[test][test][2], source[{"suggest_context":{"input":"foo"}}]}]
java.lang.IllegalArgumentException: Contexts are mandatory in context enabled completion field [suggest_context]
  1. On ES, we make all docs live when reading Lucene changes history (in fact, we can not distinguish between hard-deletes and soft-deletes). If a recovering replica reads the aborted document, it will fail to index that document. In fact, the replica will never be able to complete its recovery.
[2018-07-20T18:04:15,148][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node-1] fatal error in thread [elasticsearch[node-1][generic][T#4]], exiting
java.lang.AssertionError: unexpected failure while replicating translog entry: java.lang.IllegalArgumentException: Contexts are mandatory in context enabled completion field [suggest_context]
    at org.elasticsearch.indices.recovery.RecoveryTarget.indexTranslogOperations(RecoveryTarget.java:401) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
    at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$TranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:458) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
    at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$TranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:448)

The problem is that we read aborted documents which should never be exposed. This might be a critical problem in CCR and Lucene rollbacks.

/cc @s1monw and @bleskes

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@dnhatn dnhatn added >bug :Distributed/CCR Issues around the Cross Cluster State Replication features labels Jul 22, 2018
dnhatn added a commit that referenced this issue Jul 22, 2018
@dnhatn
Copy link
Member Author

dnhatn commented Jul 23, 2018

I see a solution that brings a deleted doc to live iff it was soft-deleted. However, this approach may be fragile for stale documents (soft-deleted before indexing) because it relies on the order of processing fields. If the soft-deletes field of a doc is processed then that doc is aborted, we can't exclude that document from liveDocs. Another option is to expose hard liveDocs in Lucene.

@s1monw
Copy link
Contributor

s1monw commented Jul 23, 2018

I think we can expose the actual hard deletes on a SegmentReader level. that is the best solution IMO.

@s1monw
Copy link
Contributor

s1monw commented Jul 24, 2018

as a workaround we can fix this specific issue by loading the original livedocs if there are any deletions. This is safe in this case since we write deletes to disk on flush so if there is an aborted doc we will have at least 1 hard deleted doc. We can then do this for the segment reader in question:

SegmentReader reader = ...;
SegmentCommitInfo si = reader.getSegmentInfo();
Bits hardLiveDocs = si.getDelCount() != 0 ?  si.info.getCodec().liveDocsFormat().readLiveDocs(si.info.dir, si, IOContext.READONCE) : null;

@dnhatn WDYT?

@dnhatn
Copy link
Member Author

dnhatn commented Jul 24, 2018

@s1monw Thanks for the hint. I opened #32333.

dnhatn added a commit that referenced this issue Jul 30, 2018
Today when reading operation history in Lucene, we read all documents.
However, if indexing a document is aborted, IndexWriter will hard-delete
it; we, therefore, need to exclude that document from Lucene history.

This commit makes sure that we exclude aborted documents by using the
hard liveDocs of a SegmentReader if there are deletes.

Closes #32269
@dnhatn
Copy link
Member Author

dnhatn commented Jul 30, 2018

Fixed by #32333.

@dnhatn dnhatn closed this as completed Jul 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/CCR Issues around the Cross Cluster State Replication features
Projects
None yet
Development

No branches or pull requests

3 participants