-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Improve uniqueness of result document IDs #50644
Conversation
Switch from a 32 bit Java hash to a 128 bit Murmur hash for creating document IDs from by/over/partition field values. The 32 bit Java hash was not sufficiently unique, and could produce identical numbers for relatively common combinations of by/partition field values such as L018/128 and L017/228. Fixes elastic#50613
Pinging @elastic/ml-core (:ml) |
In the case of a job re-running from a snapshot, we delete results directly correct? And do not rely on the IDs to be the same between the two runs? |
We do if it's an explicit revert with elasticsearch/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/datafeed/DatafeedJob.java Line 95 in 197d5e7
This latter scenario will always be the case during a failover from one node to another. It is actually possible for a model plot document to exist that's more recent than the restart time. This is because model plot documents are written before bucket documents - see https://github.com/elastic/ml-cpp/blob/a9c468cf8b991b8d30f1a9ba2846ff90edaa8bcc/lib/api/CAnomalyJob.cc#L626-L629 So in the worst case we'd persist some model plot documents that would get indexed due to a bulk request filling up, then the node would be killed before the corresponding bucket or data counts documents could be indexed, then the job would restart on a different node, redo the same bucket and we'd get duplicate model plot documents for one bucket. This would only be a problem in the case where the old node was running a version before this fix and the new node was running a version after this fix. I think it's probably worth tolerating this unlikely/single bucket problem to fix the problem of entire partitions never having any model plot. We do explicit deletes in the case of interim results, so those won't be a problem - see Lines 204 to 208 in 5c3dd57
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice to have a ID length worst case test where every possible value is at its configurable max.
Looking at the maximum value possible for murmur hash bytes: "170141183460469231722463931679029329919"
which is two Long.MAX_VALUE
bytes. That as a length of UTF_8 bytes is: 39
.
I think we are way under the limit, but it would be nice for such a test to cover that extreme (and unlikely) case.
I added a test and we're under 200 bytes in total in the worst case. |
Switch from a 32 bit Java hash to a 128 bit Murmur hash for creating document IDs from by/over/partition field values. The 32 bit Java hash was not sufficiently unique, and could produce identical numbers for relatively common combinations of by/partition field values such as L018/128 and L017/228. Fixes #50613
Switch from a 32 bit Java hash to a 128 bit Murmur hash for creating document IDs from by/over/partition field values. The 32 bit Java hash was not sufficiently unique, and could produce identical numbers for relatively common combinations of by/partition field values such as L018/128 and L017/228. Fixes elastic#50613
Switch from a 32 bit Java hash to a 128 bit Murmur hash for
creating document IDs from by/over/partition field values.
The 32 bit Java hash was not sufficiently unique, and could
produce identical numbers for relatively common combinations
of by/partition field values such as L018/128 and L017/228.
Fixes #50613