Use a `_recovery_source` if source is omitted or modified #31106

s1monw · 2018-06-05T13:28:03Z

Today if a user omits the _source entirely or modifies the source
on indexing we have no chance to re-create the document after it has
been added. This is an issue for CCR and recovery based on soft deletes
which we are going to make the default. This change adds an additional
recovery source if the source is disabled or modified that is only kept
around until the document leaves the retention policy window.

This change adds a merge policy that efficiently removes this extra source
on merge for all document that are live and not in the retention policy window
anymore.

it's not fully tested and needs some cleanups but I wanted to put it out here for discussion.

Today if a user omits the `_source` entirely or modifies the source on indexing we have no chance to re-create the document after it has been added. This is an issue for CCR and recovery based on soft deletes which we are going to make the default. This change adds an additional recovery source if the source is disabled or modified that is only kept around until the document leaves the retention policy window. This change adds a merge policy that efficiently removes this extra source on merge for all document that are live and not in the retention policy window anymore.

elasticmachine · 2018-06-05T13:28:05Z

Pinging @elastic/es-distributed

jpountz

I like the approach.

Since I'd expect most documents to have dropped their recovery source already I'm wondering whether it would be a bit more efficient to compute the bitset of documents for which we need to retain the recovery source rather than drop.

jpountz · 2018-06-05T14:04:11Z

server/src/main/java/org/elasticsearch/index/engine/RecoverySourcePruneMergePolicy.java

+                public NumericDocValues getNumeric(FieldInfo field) throws IOException {
+                    NumericDocValues numeric = super.getNumeric(field);
+                    if (recoverySourceField.equals(field.name)) {
+                        return new FilterNumericDocValues(numeric) {


maybe leave comments about why we don't need to check whether numeric and docValuesReader are null

dnhatn

It's really neat :)

dnhatn · 2018-06-05T16:25:47Z

server/src/main/java/org/elasticsearch/index/engine/LuceneChangesSnapshot.java

+            if (recoverySource == null) {
+                return false;
+            }
+            if (tombstoneDV.docID() > segmentDocId) {


tombstoneDV -> recoverySource

dnhatn · 2018-06-05T16:26:29Z

server/src/main/java/org/elasticsearch/index/engine/LuceneChangesSnapshot.java

+            if (tombstoneDV.docID() > segmentDocId) {
+                recoverySource = leafReader.getNumericDocValues(SourceFieldMapper.RECOVERY_SOURCE_NAME);
+            }
+            return tombstoneDV.advanceExact(segmentDocId);


Same here tombstoneDV -> recoverySource

dnhatn · 2018-06-05T16:32:49Z

server/src/main/java/org/elasticsearch/index/mapper/SourceFieldMapper.java

+
+        if (originalSource != null && source != originalSource && context.indexSettings().isSoftDeleteEnabled()) {
+            // if we omitted source or modified it we add the _recovery_source to ensure we have it for ops based recovery
+            BytesRef ref = source.toBytesRef();


Should we store the original source here?

LOL yeah we should :D - I will fix

dnhatn · 2018-06-05T20:55:33Z

server/src/main/java/org/elasticsearch/index/engine/RecoverySourcePruneMergePolicy.java

+import java.util.function.Supplier;
+
+final class RecoverySourcePruneMergePolicy extends OneMergeWrappingMergePolicy {
+    RecoverySourcePruneMergePolicy(String recoverySourceField, Supplier<Query> retentionPolicySupplier, MergePolicy in) {


retentionPolicySupplier is confusing. It's a prune query supplier.

dnhatn · 2018-06-05T20:56:02Z

server/src/main/java/org/elasticsearch/index/engine/RecoverySourcePruneMergePolicy.java

+    }
+
+    // pkg private for testing
+    static CodecReader wrapReader(String recoverySourceField, CodecReader reader, Supplier<Query> retentionPolicySupplier)


Same here: retentionPolicySupplier

s1monw · 2018-06-06T09:18:08Z

@dnhatn @jpountz I addressed your comments and added more tests. Can you take another look?

jpountz

LGTM. I left a suggestion for an improvement.

jpountz · 2018-06-06T10:16:14Z

server/src/main/java/org/elasticsearch/index/engine/RecoverySourcePruneMergePolicy.java

+        if (scorer != null) {
+            return new SourcePruningFilterCodecReader(recoverySourceField, reader, BitSet.of(scorer.iterator(), reader.maxDoc()));
+        } else {
+            return new SourcePruningFilterCodecReader(recoverySourceField, reader, new BitSet.MatchNoBits(reader.maxDoc()));


s/BitSet.MatchNoBits/Bits.MatchNoBits/

jpountz · 2018-06-06T10:21:24Z

server/src/main/java/org/elasticsearch/index/engine/RecoverySourcePruneMergePolicy.java

+                    NumericDocValues numeric = super.getNumeric(field);
+                    if (recoverySourceField.equals(field.name)) {
+                        assert numeric != null : recoverySourceField + " must have numeric DV but was null";
+                        return new FilterNumericDocValues(numeric) {


If recoverySourceToKeep was a bitset, we could do a leap frog, which would be faster if recoverySourceToKeep is sparse.

final ConjunctionDISI intersection = ConjunctionDISI.intersectIterators(Arrays.asList(numeric, new BitSetIterator(recoverySourceToKeep))); return new FilterNumericDocValues(numeric) { @Override public int nextDoc() throws IOException { return intersection.nextDoc(); } };

dnhatn

It's nice that we now use a single retention query for both MPs.

dnhatn · 2018-06-06T13:59:22Z

server/src/main/java/org/elasticsearch/index/engine/RecoverySourcePruneMergePolicy.java

+    // pkg private for testing
+    static CodecReader wrapReader(String recoverySourceField, CodecReader reader, Supplier<Query> retainSourceQuerySupplier)
+        throws IOException {
+        NumericDocValues recovery_source = reader.getNumericDocValues(recoverySourceField);


nit: snake_case.

dnhatn · 2018-06-06T13:59:53Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

@@ -2013,7 +2014,8 @@ private IndexWriterConfig getIndexWriterConfig() {
        MergePolicy mergePolicy = config().getMergePolicy();
        if (softDeleteEnabled) {
            iwc.setSoftDeletesField(Lucene.SOFT_DELETE_FIELD);
-            mergePolicy = new SoftDeletesRetentionMergePolicy(Lucene.SOFT_DELETE_FIELD, this::softDeletesRetentionQuery, mergePolicy);
+            mergePolicy = new RecoverySourcePruneMergePolicy(SourceFieldMapper.RECOVERY_SOURCE_NAME, this::softDeletesRetentionQuery,


Nice, a single retention query for both 💯

Today if a user omits the `_source` entirely or modifies the source on indexing we have no chance to re-create the document after it has been added. This is an issue for CCR and recovery based on soft deletes which we are going to make the default. This change adds an additional recovery source if the source is disabled or modified that is only kept around until the document leaves the retention policy window. This change adds a merge policy that efficiently removes this extra source on merge for all document that are live and not in the retention policy window anymore.

This PR integrates Lucene soft-deletes (LUCENE-8200) into Elasticsearch. Highlight works in this PR include: 1. Replace hard-deletes by soft-deletes in InternalEngine 2. Use _recovery_source if _source is disabled or modified (elastic#31106) 3. Soft-deletes retention policy based on the global checkpoint (elastic#30335) 4. Read operation history from Lucene instead of translog (elastic#30120) 5. Use Lucene history in peer-recovery (elastic#30522) These works have been done by the whole team; however, these individuals (lexical order) have significant contribution in coding and reviewing: Co-authored-by: Adrien Grand <jpountz@gmail.com> Co-authored-by: Boaz Leskes <b.leskes@gmail.com> Co-authored-by: Jason Tedor <jason@tedor.me> Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com> Co-authored-by: Nhat Nguyen <nhat.nguyen@elastic.co> Co-authored-by: Simon Willnauer <simonw@apache.org>

This PR integrates Lucene soft-deletes(LUCENE-8200) into Elasticsearch. Highlight works in this PR include: - Replace hard-deletes by soft-deletes in InternalEngine - Use _recovery_source if _source is disabled or modified (#31106) - Soft-deletes retention policy based on the global checkpoint (#30335) - Read operation history from Lucene instead of translog (#30120) - Use Lucene history in peer-recovery (#30522) Relates #30086 Closes #29530 --- These works have been done by the whole team; however, these individuals (lexical order) have significant contribution in coding and reviewing: Co-authored-by: Adrien Grand jpountz@gmail.com Co-authored-by: Boaz Leskes b.leskes@gmail.com Co-authored-by: Jason Tedor jason@tedor.me Co-authored-by: Martijn van Groningen martijn.v.groningen@gmail.com Co-authored-by: Nhat Nguyen nhat.nguyen@elastic.co Co-authored-by: Simon Willnauer simonw@apache.org

This PR integrates Lucene soft-deletes(LUCENE-8200) into Elasticsearch. Highlight works in this PR include: - Replace hard-deletes by soft-deletes in InternalEngine - Use _recovery_source if _source is disabled or modified (elastic#31106) - Soft-deletes retention policy based on the global checkpoint (elastic#30335) - Read operation history from Lucene instead of translog (elastic#30120) - Use Lucene history in peer-recovery (elastic#30522) Relates elastic#30086 Closes elastic#29530 --- These works have been done by the whole team; however, these individuals (lexical order) have significant contribution in coding and reviewing: Co-authored-by: Adrien Grand <jpountz@gmail.com> Co-authored-by: Boaz Leskes <b.leskes@gmail.com> Co-authored-by: Jason Tedor <jason@tedor.me> Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com> Co-authored-by: Nhat Nguyen <nhat.nguyen@elastic.co> Co-authored-by: Simon Willnauer <simonw@apache.org>

This PR integrates Lucene soft-deletes(LUCENE-8200) into Elasticsearch. Highlight works in this PR include: - Replace hard-deletes by soft-deletes in InternalEngine - Use _recovery_source if _source is disabled or modified (#31106) - Soft-deletes retention policy based on the global checkpoint (#30335) - Read operation history from Lucene instead of translog (#30120) - Use Lucene history in peer-recovery (#30522) Relates #30086 Closes #29530 --- These works have been done by the whole team; however, these individuals (lexical order) have significant contribution in coding and reviewing: Co-authored-by: Adrien Grand <jpountz@gmail.com> Co-authored-by: Boaz Leskes <b.leskes@gmail.com> Co-authored-by: Jason Tedor <jason@tedor.me> Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com> Co-authored-by: Nhat Nguyen <nhat.nguyen@elastic.co> Co-authored-by: Simon Willnauer <simonw@apache.org>

s1monw added >enhancement :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features labels Jun 5, 2018

s1monw requested review from jpountz, martijnvg, bleskes and dnhatn June 5, 2018 13:28

fix checkstyle

b09580c

jpountz approved these changes Jun 5, 2018

View reviewed changes

fix checkstyle again

921162e

dnhatn approved these changes Jun 5, 2018

View reviewed changes

dnhatn reviewed Jun 5, 2018

View reviewed changes

s1monw added 5 commits June 5, 2018 23:04

apply feedback

2c71296

fix compilation

f705eef

add more tests

a48271e

move to an retain query

afd10e5

Merge branch 'ccr' into use_recovery_source

f30d9e7

jpountz approved these changes Jun 6, 2018

View reviewed changes

s1monw added 2 commits June 6, 2018 15:13

use ConjunctionDISI.intersectIterators

3310a67

make it random again

53aa73a

dnhatn approved these changes Jun 6, 2018

View reviewed changes

fix formatting

9ee4a8b

s1monw merged commit 5c6711b into elastic:ccr Jun 7, 2018

s1monw added the backport pending label Jun 7, 2018

s1monw removed the backport pending label Jun 7, 2018

dnhatn mentioned this pull request Jun 11, 2018

Use soft-deletes to maintain document history #29530

Closed

14 tasks

dnhatn mentioned this pull request Aug 29, 2018

Integrates soft-deletes into Elasticsearch #33222

Merged

jtibshirani mentioned this pull request Apr 30, 2019

Indices with "_source.enabled: false" same size as indices with "_source.enabled: true" #41628

Closed

navneet1v mentioned this pull request May 8, 2024

Add capability to disable source recovery_source for an index opensearch-project/OpenSearch#13590

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a `_recovery_source` if source is omitted or modified #31106

Use a `_recovery_source` if source is omitted or modified #31106

s1monw commented Jun 5, 2018

elasticmachine commented Jun 5, 2018

jpountz left a comment

jpountz Jun 5, 2018

dnhatn left a comment

dnhatn Jun 5, 2018

dnhatn Jun 5, 2018

dnhatn Jun 5, 2018

s1monw Jun 5, 2018

dnhatn Jun 5, 2018

dnhatn Jun 5, 2018

s1monw commented Jun 6, 2018

jpountz left a comment

jpountz Jun 6, 2018

jpountz Jun 6, 2018

dnhatn left a comment

dnhatn Jun 6, 2018

dnhatn Jun 6, 2018

Use a _recovery_source if source is omitted or modified #31106

Use a _recovery_source if source is omitted or modified #31106

Conversation

s1monw commented Jun 5, 2018

elasticmachine commented Jun 5, 2018

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dnhatn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

s1monw commented Jun 6, 2018

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dnhatn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Use a `_recovery_source` if source is omitted or modified #31106

Use a `_recovery_source` if source is omitted or modified #31106