-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fingerprint ingest processor #13724
Conversation
Signed-off-by: Gao Binlong <gbinlong@amazon.com>
❌ Gradle check result for 50cdb84: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #13724 +/- ##
============================================
+ Coverage 71.42% 71.71% +0.29%
- Complexity 59978 62103 +2125
============================================
Files 4985 5118 +133
Lines 282275 291805 +9530
Branches 40946 42186 +1240
============================================
+ Hits 201603 209277 +7674
- Misses 63999 65283 +1284
- Partials 16673 17245 +572 ☔ View full report in Codecov by Sentry. |
❌ Gradle check result for 4a37513: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
modules/ingest-common/src/main/java/org/opensearch/ingest/common/FingerprintProcessor.java
Outdated
Show resolved
Hide resolved
modules/ingest-common/src/main/java/org/opensearch/ingest/common/FingerprintProcessor.java
Outdated
Show resolved
Hide resolved
modules/ingest-common/src/main/java/org/opensearch/ingest/common/FingerprintProcessor.java
Show resolved
Hide resolved
Signed-off-by: Gao Binlong <gbinlong@amazon.com>
* Add fingerprint ingest processor Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Ignore metadata fields Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Add sha3-256 hash method Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Remove unused code Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Add exclude_fields and remove include_all_fields Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Modify processor description Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Make FingerprintProcessor being final Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Optimize error message and check if field name is empty string Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Fix yaml test failure Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Prepend string length to the field value Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Append hash method with version number Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Update supported version in yml test file Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Add more comment Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Prepend hash method to the hash value and add more test cases Signed-off-by: Gao Binlong <gbinlong@amazon.com> --------- Signed-off-by: Gao Binlong <gbinlong@amazon.com> (cherry picked from commit 16c8806) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* Add fingerprint ingest processor * Ignore metadata fields * Add sha3-256 hash method * Remove unused code * Add exclude_fields and remove include_all_fields * Modify processor description * Make FingerprintProcessor being final * Optimize error message and check if field name is empty string * Fix yaml test failure * Prepend string length to the field value * Append hash method with version number * Update supported version in yml test file * Add more comment * Prepend hash method to the hash value and add more test cases --------- (cherry picked from commit 16c8806) Signed-off-by: Gao Binlong <gbinlong@amazon.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@gaobinlong This issue linked to this failure is incorrect. The test failing the build is
but the issue linked is for - org.opensearch.remotestore.RemoteStoreRestoreIT |
Sorry for that, I just saw they have same error message: |
* Add fingerprint ingest processor Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Ignore metadata fields Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Add sha3-256 hash method Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Remove unused code Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Add exclude_fields and remove include_all_fields Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Modify processor description Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Make FingerprintProcessor being final Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Optimize error message and check if field name is empty string Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Fix yaml test failure Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Prepend string length to the field value Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Append hash method with version number Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Update supported version in yml test file Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Add more comment Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Prepend hash method to the hash value and add more test cases Signed-off-by: Gao Binlong <gbinlong@amazon.com> --------- Signed-off-by: Gao Binlong <gbinlong@amazon.com>
…ch-project#14366) * Add fingerprint ingest processor * Ignore metadata fields * Add sha3-256 hash method * Remove unused code * Add exclude_fields and remove include_all_fields * Modify processor description * Make FingerprintProcessor being final * Optimize error message and check if field name is empty string * Fix yaml test failure * Prepend string length to the field value * Append hash method with version number * Update supported version in yml test file * Add more comment * Prepend hash method to the hash value and add more test cases --------- (cherry picked from commit 16c8806) Signed-off-by: Gao Binlong <gbinlong@amazon.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Signed-off-by: kkewwei <kkewwei@163.com>
* Add fingerprint ingest processor Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Ignore metadata fields Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Add sha3-256 hash method Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Remove unused code Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Add exclude_fields and remove include_all_fields Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Modify processor description Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Make FingerprintProcessor being final Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Optimize error message and check if field name is empty string Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Fix yaml test failure Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Prepend string length to the field value Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Append hash method with version number Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Update supported version in yml test file Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Add more comment Signed-off-by: Gao Binlong <gbinlong@amazon.com> * Prepend hash method to the hash value and add more test cases Signed-off-by: Gao Binlong <gbinlong@amazon.com> --------- Signed-off-by: Gao Binlong <gbinlong@amazon.com>
Description
Add a new ingest processor named
fingerprint
which generate a hash value for some specified fields or fields not in the specified excluded list and write the hash value to thetarget_field
, the hash value can be used to deduplicate documents within a index and collapse search results.The usage of this processor is:
or
The main parameters in this processor are:
1.fields
: fields in the document used to generate hash value, field name and value are concatenated and separated by|
, like|field1|value1|field2|value2|
, for nested fields, the field name is flattened, like|root_field.sub_field1|value1|root_field.sub_field2|value2|
fields
: fields in the document used to generate hash value, field name, ${value length}:value are concatenated and separated by|
, like|field1|3:value1|field2|10:value2|
, for nested fields, the field name is flattened, like|root_field.sub_field1|1:value1|root_field.sub_field2|100:value2|
2.include_all_fields
: whether all fields are included to generate the hash value, eitherfields
orinclude_all_fields
can be set.exclude_fields
: fields not in this list are used to generate the hash value, eitherfields
andexclude_fields
can be non-emptyhash_method
: MD5@2.16.0, SHA-1@2.16.0, SHA-256@2.16.0 or SHA3-256@2.16.0, SHA-1@2.16.0 is the default hash method. This processor is introduced in 2.16.0, we append the OpenSearch version to the hash method name to ensure that this processor always generates same hash value based on a specific hash method, if the processing logic of this processor changes in future version, then this parameter will support new hash method with new version.target_field
: the field to store the hash valueignore_missing
: if one of the specified fields is missing, the processor will exit quietly and do nothing.In addition, if
fields
andexclude_fields
are both empty or null, it meansinclude all fields
, all fields are used to generate the hash value.Related Issues
#13612
Check List
New functionality includes testing.All tests passBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.