[ingest] Re-introduce the Hash Processor with consistent keys across all nodes #34085

jakelandis · 2018-09-26T15:16:34Z

A Hash processor was introduced (#31087), and subsequently reverted once it was realized that if the keys (local to the node) ever got out of sync that could cause silent functional issues (same values hashed to different values based on the node which processed them).

Since the key is sensitive data, we can not simply store it cluster state. After some internal discussions we have identified two possible strategies to mitigate the out of sync issue.

a) Introduce the concept of a consistent setting. Keep the key stored in the keystore and keep a id of sorts (a hash of the key ?) either in cluster state, or require that id to be included in the configuration for the hash processor. What to do when the setting is inconsistent will require some more discussion.

b) Introduce the concept of encrypted settings stored in the cluster state. More discussion here: #32727

elasticmachine · 2018-09-26T15:23:16Z

Pinging @elastic/es-core-infra

jakelandis · 2019-09-03T17:05:11Z

Unblocked by #40416

elasticmachine · 2019-09-03T22:09:42Z

Pinging @elastic/es-core-features

jakelandis · 2019-09-09T15:22:49Z

Once #46241 is merged, we may want to consider this processor to be async.

jasontedor · 2019-09-09T15:32:03Z

@jakelandis Can you clarify why we would want to go async here? I would expect calculating the hash to be fast, and not worth context switching off the processing thread?

jakelandis · 2019-09-09T16:21:16Z

At job-1 we did something similar on high volume ingestion (not to ES) and calculating the hash was by far the most expensive of the transformations applied to the data. Once this is implemented, running this with Rally would be helpful to make that decision.

jasontedor · 2019-09-09T16:30:22Z

It might be the most expensive, but that doesn't mean it's so expensive that it needs to be done after a couple of context switches. Recall, the use-case for the hash processor is to anonymize fields like names. Those are not huge fields.

jakelandis · 2019-09-09T16:38:52Z

I'm not sure if context switches or cryptographic hashes (for small data) is more expensive. However, I do think we should measure it and let that guide the decision.

EDIT: for clarity, i am not saying we should choose async if it is marginally faster in the tests... IIRC at job-1 the overhead was substantial 5-10ms per doc on enterprise class bare metal when multiplied by millions adds up. I would just want to confirm that we are not introducing a large performance difference

jasontedor · 2019-09-10T01:11:19Z

I am concerned with spending time on details that are intuitively not expected to be a problem, and that maybe we should only consider if it proves to be a problem in the wild.

jakelandis · 2019-09-11T14:31:53Z

I ran a quick JMH benchmark to see if my concerns were justified. During that process, remembered that that at job-1 we used SHA256withRSA, not HmacSHA256 (the default).

HmacSHA256 is plenty fast enough and we don't need to explore going aysnc here. (SHA256withRSA is indeed slow but a not relevant here since it is not supported)

justinfiore · 2020-07-31T12:35:15Z

Is this issue still being worked on?
If so, do we have any idea when we might see it?

geekpete · 2020-08-03T00:10:16Z

Team, wondering if someone can verify if this example is workable/safe purely for the purpose of generating unique document ids for de-duplication purposes (ie, not safe for cryptography purposes due to the original statement of this issue)?

dakrone · 2024-05-08T21:59:16Z

This has been open for quite a while, and we haven't made much progress on this due to focus in other areas. For now I'm going to close this as something we aren't planning on implementing. We can re-open it later if needed.

jakelandis mentioned this issue Sep 26, 2018

[Ingest] Hash processor - require keyed hash ? #31692

Closed

jakelandis added the :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP label Sep 26, 2018

talevy mentioned this issue Oct 25, 2018

add documentation for Security's Hash Processor #32112

Closed

jakelandis mentioned this issue Nov 15, 2018

Fingerprinting Ingest Processor #16938

Closed

talevy mentioned this issue Dec 11, 2018

Write documentation for Security's HashProcessor #31694

Closed

colings86 added the 7x label Apr 12, 2019

jakelandis added :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP and removed :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP labels Sep 3, 2019

danhermann self-assigned this Sep 4, 2019

martijnvg mentioned this issue Sep 30, 2019

Provide Thread Safe, Fast Hashing Method #30790

Closed

ycombinator mentioned this issue Oct 23, 2019

Fingerprint processor elastic/beats#14205

Merged

martijnvg mentioned this issue Nov 12, 2019

New processors to expand ingest node's capabilities #48986

Open

9 tasks

$@polyfractal$ polyfractal removed the 7x label Dec 12, 2019

rjernst added the Team:Data Management Meta label for data/management team label May 4, 2020

danhermann mentioned this issue Feb 9, 2021

Add "fingerprint" ingest processor #53578

Closed

dakrone closed this as not planned Won't fix, can't repro, duplicate, stale May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ingest] Re-introduce the Hash Processor with consistent keys across all nodes #34085

[ingest] Re-introduce the Hash Processor with consistent keys across all nodes #34085

jakelandis commented Sep 26, 2018

elasticmachine commented Sep 26, 2018

jakelandis commented Sep 3, 2019

elasticmachine commented Sep 3, 2019

jakelandis commented Sep 9, 2019

jasontedor commented Sep 9, 2019 •

edited

Loading

jakelandis commented Sep 9, 2019

jasontedor commented Sep 9, 2019

jakelandis commented Sep 9, 2019 •

edited

Loading

jasontedor commented Sep 10, 2019

jakelandis commented Sep 11, 2019 •

edited

Loading

justinfiore commented Jul 31, 2020

geekpete commented Aug 3, 2020 •

edited

Loading

dakrone commented May 8, 2024

[ingest] Re-introduce the Hash Processor with consistent keys across all nodes #34085

[ingest] Re-introduce the Hash Processor with consistent keys across all nodes #34085

Comments

jakelandis commented Sep 26, 2018

elasticmachine commented Sep 26, 2018

jakelandis commented Sep 3, 2019

elasticmachine commented Sep 3, 2019

jakelandis commented Sep 9, 2019

jasontedor commented Sep 9, 2019 • edited Loading

jakelandis commented Sep 9, 2019

jasontedor commented Sep 9, 2019

jakelandis commented Sep 9, 2019 • edited Loading

jasontedor commented Sep 10, 2019

jakelandis commented Sep 11, 2019 • edited Loading

justinfiore commented Jul 31, 2020

geekpete commented Aug 3, 2020 • edited Loading

dakrone commented May 8, 2024

jasontedor commented Sep 9, 2019 •

edited

Loading

jakelandis commented Sep 9, 2019 •

edited

Loading

jakelandis commented Sep 11, 2019 •

edited

Loading

geekpete commented Aug 3, 2020 •

edited

Loading