Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ingest] Re-introduce the Hash Processor with consistent keys across all nodes #34085

Closed
jakelandis opened this issue Sep 26, 2018 · 13 comments
Closed
Assignees
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team

Comments

@jakelandis
Copy link
Contributor

A Hash processor was introduced (#31087), and subsequently reverted once it was realized that if the keys (local to the node) ever got out of sync that could cause silent functional issues (same values hashed to different values based on the node which processed them).

Since the key is sensitive data, we can not simply store it cluster state. After some internal discussions we have identified two possible strategies to mitigate the out of sync issue.

a) Introduce the concept of a consistent setting. Keep the key stored in the keystore and keep a id of sorts (a hash of the key ?) either in cluster state, or require that id to be included in the configuration for the hash processor. What to do when the setting is inconsistent will require some more discussion.

b) Introduce the concept of encrypted settings stored in the cluster state. More discussion here: #32727

@jakelandis jakelandis added the :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP label Sep 26, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@jakelandis
Copy link
Contributor Author

Unblocked by #40416

@jakelandis jakelandis added :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP and removed :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP labels Sep 3, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features

@danhermann danhermann self-assigned this Sep 4, 2019
@jakelandis
Copy link
Contributor Author

Once #46241 is merged, we may want to consider this processor to be async.

@jasontedor
Copy link
Member

jasontedor commented Sep 9, 2019

@jakelandis Can you clarify why we would want to go async here? I would expect calculating the hash to be fast, and not worth context switching off the processing thread?

@jakelandis
Copy link
Contributor Author

At job-1 we did something similar on high volume ingestion (not to ES) and calculating the hash was by far the most expensive of the transformations applied to the data. Once this is implemented, running this with Rally would be helpful to make that decision.

@jasontedor
Copy link
Member

It might be the most expensive, but that doesn't mean it's so expensive that it needs to be done after a couple of context switches. Recall, the use-case for the hash processor is to anonymize fields like names. Those are not huge fields.

@jakelandis
Copy link
Contributor Author

jakelandis commented Sep 9, 2019

I'm not sure if context switches or cryptographic hashes (for small data) is more expensive. However, I do think we should measure it and let that guide the decision.

EDIT: for clarity, i am not saying we should choose async if it is marginally faster in the tests... IIRC at job-1 the overhead was substantial 5-10ms per doc on enterprise class bare metal when multiplied by millions adds up. I would just want to confirm that we are not introducing a large performance difference

@jasontedor
Copy link
Member

I am concerned with spending time on details that are intuitively not expected to be a problem, and that maybe we should only consider if it proves to be a problem in the wild.

@jakelandis
Copy link
Contributor Author

jakelandis commented Sep 11, 2019

I ran a quick JMH benchmark to see if my concerns were justified. During that process, remembered that that at job-1 we used SHA256withRSA, not HmacSHA256 (the default).

HmacSHA256 is plenty fast enough and we don't need to explore going aysnc here. (SHA256withRSA is indeed slow but a not relevant here since it is not supported)

@justinfiore
Copy link

Is this issue still being worked on?
If so, do we have any idea when we might see it?

@geekpete
Copy link
Member

geekpete commented Aug 3, 2020

Team, wondering if someone can verify if this example is workable/safe purely for the purpose of generating unique document ids for de-duplication purposes (ie, not safe for cryptography purposes due to the original statement of this issue)?

@dakrone
Copy link
Member

dakrone commented May 8, 2024

This has been open for quite a while, and we haven't made much progress on this due to focus in other areas. For now I'm going to close this as something we aren't planning on implementing. We can re-open it later if needed.

@dakrone dakrone closed this as not planned Won't fix, can't repro, duplicate, stale May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team
Projects
None yet
Development

No branches or pull requests

10 participants