Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fingerprint ingest processor #13724

Merged
merged 18 commits into from
Jun 14, 2024
Merged

Conversation

gaobinlong
Copy link
Collaborator

@gaobinlong gaobinlong commented May 17, 2024

Description

Add a new ingest processor named fingerprint which generate a hash value for some specified fields or fields not in the specified excluded list and write the hash value to the target_field, the hash value can be used to deduplicate documents within a index and collapse search results.

The usage of this processor is:

"processors": [
      {
        "fingerprint": {
          "fields": ["foo", "bar"],
          "target_field": "fingerprint",
          "hash_method": "SHA-256@2.16.0",
          "ignore_missing": true
        }
      }
    ]

or

"processors": [
      {
        "fingerprint": {
          "exclude_fields": ["zoo"],
          "target_field": "fingerprint"
        }
      }
    ]

The main parameters in this processor are:

1. fields: fields in the document used to generate hash value, field name and value are concatenated and separated by |, like |field1|value1|field2|value2|, for nested fields, the field name is flattened, like |root_field.sub_field1|value1|root_field.sub_field2|value2|

  1. fields: fields in the document used to generate hash value, field name, ${value length}:value are concatenated and separated by |, like |field1|3:value1|field2|10:value2|, for nested fields, the field name is flattened, like |root_field.sub_field1|1:value1|root_field.sub_field2|100:value2|
    2. include_all_fields: whether all fields are included to generate the hash value, either fields or include_all_fields can be set.
  2. exclude_fields: fields not in this list are used to generate the hash value, either fields and exclude_fields can be non-empty
  3. hash_method: MD5@2.16.0, SHA-1@2.16.0, SHA-256@2.16.0 or SHA3-256@2.16.0, SHA-1@2.16.0 is the default hash method. This processor is introduced in 2.16.0, we append the OpenSearch version to the hash method name to ensure that this processor always generates same hash value based on a specific hash method, if the processing logic of this processor changes in future version, then this parameter will support new hash method with new version.
  4. target_field: the field to store the hash value
  5. ignore_missing: if one of the specified fields is missing, the processor will exit quietly and do nothing.

In addition, if fields and exclude_fields are both empty or null, it means include all fields, all fields are used to generate the hash value.

Related Issues

#13612

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • API changes companion pull request created.
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Gao Binlong <gbinlong@amazon.com>
Copy link
Contributor

❌ Gradle check result for 50cdb84: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

✅ Gradle check result for e6b8851: SUCCESS

Copy link

codecov bot commented May 17, 2024

Codecov Report

Attention: Patch coverage is 80.17241% with 23 lines in your changes missing coverage. Please review.

Project coverage is 71.71%. Comparing base (b15cb0c) to head (1f2d5d0).
Report is 424 commits behind head on main.

Files Patch % Lines
...opensearch/ingest/common/FingerprintProcessor.java 80.53% 11 Missing and 11 partials ⚠️
...search/ingest/common/IngestCommonModulePlugin.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #13724      +/-   ##
============================================
+ Coverage     71.42%   71.71%   +0.29%     
- Complexity    59978    62103    +2125     
============================================
  Files          4985     5118     +133     
  Lines        282275   291805    +9530     
  Branches      40946    42186    +1240     
============================================
+ Hits         201603   209277    +7674     
- Misses        63999    65283    +1284     
- Partials      16673    17245     +572     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

❌ Gradle check result for 4a37513: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Gao Binlong <gbinlong@amazon.com>
Copy link
Contributor

✅ Gradle check result for 1f2d5d0: SUCCESS

@andrross andrross merged commit 16c8806 into opensearch-project:main Jun 14, 2024
38 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jun 14, 2024
* Add fingerprint ingest processor

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Ignore metadata fields

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Add sha3-256 hash method

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Remove unused code

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Add exclude_fields and remove include_all_fields

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Modify processor description

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Make FingerprintProcessor being final

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Optimize error message and check if field name is empty string

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Fix yaml test failure

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Prepend string length to the field value

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Append hash method with version number

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Update supported version in yml test file

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Add more comment

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Prepend hash method to the hash value and add more test cases

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

---------

Signed-off-by: Gao Binlong <gbinlong@amazon.com>
(cherry picked from commit 16c8806)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
reta pushed a commit that referenced this pull request Jun 15, 2024
* Add fingerprint ingest processor



* Ignore metadata fields



* Add sha3-256 hash method



* Remove unused code



* Add exclude_fields and remove include_all_fields



* Modify processor description



* Make FingerprintProcessor being final



* Optimize error message and check if field name is empty string



* Fix yaml test failure



* Prepend string length to the field value



* Append hash method with version number



* Update supported version in yml test file



* Add more comment



* Prepend hash method to the hash value and add more test cases



---------


(cherry picked from commit 16c8806)

Signed-off-by: Gao Binlong <gbinlong@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@linuxpi
Copy link
Collaborator

linuxpi commented Jun 17, 2024

❌ Gradle check result for 62703cb: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Known issue:#11085

@gaobinlong This issue linked to this failure is incorrect. The test failing the build is

[org.opensearch.indices.store.IndicesStoreIntegrationIT.testShardActiveElsewhereDoesNotDeleteAnother {p0={"cluster.remote_store.state.enabled":"true"}}](https://build.ci.opensearch.org/job/gradle-check/40909/testReport/junit/org.opensearch.indices.store/IndicesStoreIntegrationIT/testShardActiveElsewhereDoesNotDeleteAnother__p0___cluster_remote_store_state_enabled___true___/)

but the issue linked is for - org.opensearch.remotestore.RemoteStoreRestoreIT

@gaobinlong
Copy link
Collaborator Author

❌ Gradle check result for 62703cb: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Known issue:#11085

@gaobinlong This issue linked to this failure is incorrect. The test failing the build is

[org.opensearch.indices.store.IndicesStoreIntegrationIT.testShardActiveElsewhereDoesNotDeleteAnother {p0={"cluster.remote_store.state.enabled":"true"}}](https://build.ci.opensearch.org/job/gradle-check/40909/testReport/junit/org.opensearch.indices.store/IndicesStoreIntegrationIT/testShardActiveElsewhereDoesNotDeleteAnother__p0___cluster_remote_store_state_enabled___true___/)

but the issue linked is for - org.opensearch.remotestore.RemoteStoreRestoreIT

Sorry for that, I just saw they have same error message: java.lang.AssertionError: Missing cluster-manager... so I think they are the same issue, actually it should be linked to #12788.

harshavamsi pushed a commit to harshavamsi/OpenSearch that referenced this pull request Jul 12, 2024
* Add fingerprint ingest processor

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Ignore metadata fields

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Add sha3-256 hash method

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Remove unused code

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Add exclude_fields and remove include_all_fields

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Modify processor description

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Make FingerprintProcessor being final

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Optimize error message and check if field name is empty string

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Fix yaml test failure

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Prepend string length to the field value

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Append hash method with version number

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Update supported version in yml test file

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Add more comment

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Prepend hash method to the hash value and add more test cases

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

---------

Signed-off-by: Gao Binlong <gbinlong@amazon.com>
@reta reta mentioned this pull request Jul 17, 2024
3 tasks
kkewwei pushed a commit to kkewwei/OpenSearch that referenced this pull request Jul 24, 2024
…ch-project#14366)

* Add fingerprint ingest processor

* Ignore metadata fields

* Add sha3-256 hash method

* Remove unused code

* Add exclude_fields and remove include_all_fields

* Modify processor description

* Make FingerprintProcessor being final

* Optimize error message and check if field name is empty string

* Fix yaml test failure

* Prepend string length to the field value

* Append hash method with version number

* Update supported version in yml test file

* Add more comment

* Prepend hash method to the hash value and add more test cases

---------

(cherry picked from commit 16c8806)

Signed-off-by: Gao Binlong <gbinlong@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Signed-off-by: kkewwei <kkewwei@163.com>
wdongyu pushed a commit to wdongyu/OpenSearch that referenced this pull request Aug 22, 2024
* Add fingerprint ingest processor

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Ignore metadata fields

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Add sha3-256 hash method

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Remove unused code

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Add exclude_fields and remove include_all_fields

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Modify processor description

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Make FingerprintProcessor being final

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Optimize error message and check if field name is empty string

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Fix yaml test failure

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Prepend string length to the field value

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Append hash method with version number

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Update supported version in yml test file

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Add more comment

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

* Prepend hash method to the hash value and add more test cases

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

---------

Signed-off-by: Gao Binlong <gbinlong@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch ingest-pipeline v2.16.0 Issues and PRs related to version 2.16.0 v3.0.0 Issues and PRs related to version 3.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants