-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix WorkItem.__hash__ #617
Conversation
Do you happen to know why our tests did not already cover this issue? |
There is also no test dedicated to test WorkItem and it seems currently WorkItems are normally kept inside lists in Coffea, so their hash method isn't used. |
Is there any way we could turn the dicts into nested tuples? or do we think it is not relevant to hash all the metadata? |
And, indeed, please add a test! |
Even if the |
From the perspective of describing the data to be read in and processed I agree, that makes sense. I was thinking to cases where some one is considering corrections or something else as metadata, rather than packing it with the processor. Perhaps we should have something that yells at people if they have some enormous piece of metadata for the dataset? |
With "enormous" do you mean something non-hashable? |
Not necessarily, string blob data is perfectly well hashable. |
Later on, the metadata is attached as a parameter to an awkward array object, so it must be json-serializable. One could check at the beginning if that is the case by trying to run |
I guess then the question goes back to @jrueb... For what reason are you hashing the WorkItem? Maybe I missed it on the last PR. I'm OK with omitting the user meta if it's not something that's normally done. |
Initially I thought it was for the dask worker affinity, but looks like there I hash specific fields instead: coffea/coffea/processor/executor.py Lines 671 to 677 in 183cbe4
|
Personally I would like to use the hash, or rather the ability to store the items in a set to search for them in a reasonable amount of time, in an Executor class I made. |
OK if it's only those specific fields any way then that sorta settles it, we can skip the usermeta. |
Currently getting the hash of a
WorkItem
will often result in aTypeError
, even though we force the generation of a__hash__
method insideWorkItem
usingunsafe_hash=True
. This is becauseWorkItem.usermeta
is a dict when it's not None, and a dict is not hashable.The solution is to exclude the
usermeta
from comparison, which in turn will also exclude it from the hashing.