Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a _recovery_source if source is omitted or modified #31106

Merged
merged 11 commits into from
Jun 7, 2018

Conversation

s1monw
Copy link
Contributor

@s1monw s1monw commented Jun 5, 2018

Today if a user omits the _source entirely or modifies the source
on indexing we have no chance to re-create the document after it has
been added. This is an issue for CCR and recovery based on soft deletes
which we are going to make the default. This change adds an additional
recovery source if the source is disabled or modified that is only kept
around until the document leaves the retention policy window.

This change adds a merge policy that efficiently removes this extra source
on merge for all document that are live and not in the retention policy window
anymore.

it's not fully tested and needs some cleanups but I wanted to put it out here for discussion.

Today if a user omits the `_source` entirely or modifies the source
on indexing we have no chance to re-create the document after it has
been added. This is an issue for CCR and recovery based on soft deletes
which we are going to make the default. This change adds an additional
recovery source if the source is disabled or modified that is only kept
around until the document leaves the retention policy window.

This change adds a merge policy that efficiently removes this extra source
on merge for all document that are live and not in the retention policy window
anymore.
@s1monw s1monw added >enhancement :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features labels Jun 5, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the approach.

Since I'd expect most documents to have dropped their recovery source already I'm wondering whether it would be a bit more efficient to compute the bitset of documents for which we need to retain the recovery source rather than drop.

public NumericDocValues getNumeric(FieldInfo field) throws IOException {
NumericDocValues numeric = super.getNumeric(field);
if (recoverySourceField.equals(field.name)) {
return new FilterNumericDocValues(numeric) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe leave comments about why we don't need to check whether numeric and docValuesReader are null

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really neat :)

if (recoverySource == null) {
return false;
}
if (tombstoneDV.docID() > segmentDocId) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tombstoneDV -> recoverySource

if (tombstoneDV.docID() > segmentDocId) {
recoverySource = leafReader.getNumericDocValues(SourceFieldMapper.RECOVERY_SOURCE_NAME);
}
return tombstoneDV.advanceExact(segmentDocId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here tombstoneDV -> recoverySource


if (originalSource != null && source != originalSource && context.indexSettings().isSoftDeleteEnabled()) {
// if we omitted source or modified it we add the _recovery_source to ensure we have it for ops based recovery
BytesRef ref = source.toBytesRef();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we store the original source here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOL yeah we should :D - I will fix

import java.util.function.Supplier;

final class RecoverySourcePruneMergePolicy extends OneMergeWrappingMergePolicy {
RecoverySourcePruneMergePolicy(String recoverySourceField, Supplier<Query> retentionPolicySupplier, MergePolicy in) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retentionPolicySupplier is confusing. It's a prune query supplier.

}

// pkg private for testing
static CodecReader wrapReader(String recoverySourceField, CodecReader reader, Supplier<Query> retentionPolicySupplier)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: retentionPolicySupplier

@s1monw
Copy link
Contributor Author

s1monw commented Jun 6, 2018

@dnhatn @jpountz I addressed your comments and added more tests. Can you take another look?

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I left a suggestion for an improvement.

if (scorer != null) {
return new SourcePruningFilterCodecReader(recoverySourceField, reader, BitSet.of(scorer.iterator(), reader.maxDoc()));
} else {
return new SourcePruningFilterCodecReader(recoverySourceField, reader, new BitSet.MatchNoBits(reader.maxDoc()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/BitSet.MatchNoBits/Bits.MatchNoBits/

NumericDocValues numeric = super.getNumeric(field);
if (recoverySourceField.equals(field.name)) {
assert numeric != null : recoverySourceField + " must have numeric DV but was null";
return new FilterNumericDocValues(numeric) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If recoverySourceToKeep was a bitset, we could do a leap frog, which would be faster if recoverySourceToKeep is sparse.

final ConjunctionDISI intersection = ConjunctionDISI.intersectIterators(Arrays.asList(numeric, new BitSetIterator(recoverySourceToKeep)));
return new FilterNumericDocValues(numeric) {
  @Override
  public int nextDoc() throws IOException {
    return intersection.nextDoc();
  }
};

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's nice that we now use a single retention query for both MPs.

// pkg private for testing
static CodecReader wrapReader(String recoverySourceField, CodecReader reader, Supplier<Query> retainSourceQuerySupplier)
throws IOException {
NumericDocValues recovery_source = reader.getNumericDocValues(recoverySourceField);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: snake_case.

@@ -2013,7 +2014,8 @@ private IndexWriterConfig getIndexWriterConfig() {
MergePolicy mergePolicy = config().getMergePolicy();
if (softDeleteEnabled) {
iwc.setSoftDeletesField(Lucene.SOFT_DELETE_FIELD);
mergePolicy = new SoftDeletesRetentionMergePolicy(Lucene.SOFT_DELETE_FIELD, this::softDeletesRetentionQuery, mergePolicy);
mergePolicy = new RecoverySourcePruneMergePolicy(SourceFieldMapper.RECOVERY_SOURCE_NAME, this::softDeletesRetentionQuery,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, a single retention query for both 💯

@s1monw s1monw merged commit 5c6711b into elastic:ccr Jun 7, 2018
s1monw added a commit that referenced this pull request Jun 7, 2018
Today if a user omits the `_source` entirely or modifies the source
on indexing we have no chance to re-create the document after it has
been added. This is an issue for CCR and recovery based on soft deletes
which we are going to make the default. This change adds an additional
recovery source if the source is disabled or modified that is only kept
around until the document leaves the retention policy window.

This change adds a merge policy that efficiently removes this extra source
on merge for all document that are live and not in the retention policy window
anymore.
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Aug 29, 2018
This PR integrates Lucene soft-deletes (LUCENE-8200) into Elasticsearch.
Highlight works in this PR include:

1. Replace hard-deletes by soft-deletes in InternalEngine
2. Use _recovery_source if _source is disabled or modified (elastic#31106)
3. Soft-deletes retention policy based on the global checkpoint (elastic#30335)
4. Read operation history from Lucene instead of translog (elastic#30120)
5. Use Lucene history in peer-recovery (elastic#30522)

These works have been done by the whole team; however, these individuals
(lexical order) have significant contribution in coding and reviewing:

Co-authored-by: Adrien Grand <jpountz@gmail.com>
Co-authored-by: Boaz Leskes <b.leskes@gmail.com>
Co-authored-by: Jason Tedor <jason@tedor.me>
Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com>
Co-authored-by: Nhat Nguyen <nhat.nguyen@elastic.co>
Co-authored-by: Simon Willnauer <simonw@apache.org>
dnhatn added a commit that referenced this pull request Aug 31, 2018
This PR integrates Lucene soft-deletes(LUCENE-8200) into Elasticsearch.
Highlight works in this PR include:

- Replace hard-deletes by soft-deletes in InternalEngine
- Use _recovery_source if _source is disabled or modified (#31106)
- Soft-deletes retention policy based on the global checkpoint (#30335)
- Read operation history from Lucene instead of translog (#30120)
- Use Lucene history in peer-recovery (#30522)

Relates #30086
Closes #29530

---
These works have been done by the whole team; however, these individuals
(lexical order) have significant contribution in coding and reviewing:

Co-authored-by: Adrien Grand jpountz@gmail.com
Co-authored-by: Boaz Leskes b.leskes@gmail.com
Co-authored-by: Jason Tedor jason@tedor.me
Co-authored-by: Martijn van Groningen martijn.v.groningen@gmail.com
Co-authored-by: Nhat Nguyen nhat.nguyen@elastic.co
Co-authored-by: Simon Willnauer simonw@apache.org
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Aug 31, 2018
This PR integrates Lucene soft-deletes(LUCENE-8200) into Elasticsearch.
Highlight works in this PR include:

- Replace hard-deletes by soft-deletes in InternalEngine
- Use _recovery_source if _source is disabled or modified (elastic#31106)
- Soft-deletes retention policy based on the global checkpoint (elastic#30335)
- Read operation history from Lucene instead of translog (elastic#30120)
- Use Lucene history in peer-recovery (elastic#30522)

Relates elastic#30086
Closes elastic#29530

---
These works have been done by the whole team; however, these individuals
(lexical order) have significant contribution in coding and reviewing:

Co-authored-by: Adrien Grand <jpountz@gmail.com>
Co-authored-by: Boaz Leskes <b.leskes@gmail.com>
Co-authored-by: Jason Tedor <jason@tedor.me>
Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com>
Co-authored-by: Nhat Nguyen <nhat.nguyen@elastic.co>
Co-authored-by: Simon Willnauer <simonw@apache.org>
dnhatn added a commit that referenced this pull request Aug 31, 2018
This PR integrates Lucene soft-deletes(LUCENE-8200) into Elasticsearch.
Highlight works in this PR include:

- Replace hard-deletes by soft-deletes in InternalEngine
- Use _recovery_source if _source is disabled or modified (#31106)
- Soft-deletes retention policy based on the global checkpoint (#30335)
- Read operation history from Lucene instead of translog (#30120)
- Use Lucene history in peer-recovery (#30522)

Relates #30086
Closes #29530

---
These works have been done by the whole team; however, these individuals
(lexical order) have significant contribution in coding and reviewing:

Co-authored-by: Adrien Grand <jpountz@gmail.com>
Co-authored-by: Boaz Leskes <b.leskes@gmail.com>
Co-authored-by: Jason Tedor <jason@tedor.me>
Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com>
Co-authored-by: Nhat Nguyen <nhat.nguyen@elastic.co>
Co-authored-by: Simon Willnauer <simonw@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/CCR Issues around the Cross Cluster State Replication features >enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants