Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix WorkItem.__hash__ #617

Merged
merged 2 commits into from
Nov 22, 2021
Merged

Fix WorkItem.__hash__ #617

merged 2 commits into from
Nov 22, 2021

Conversation

jrueb
Copy link
Contributor

@jrueb jrueb commented Nov 15, 2021

Currently getting the hash of a WorkItem will often result in a TypeError, even though we force the generation of a __hash__ method inside WorkItem using unsafe_hash=True. This is because WorkItem.usermeta is a dict when it's not None, and a dict is not hashable.

The solution is to exclude the usermeta from comparison, which in turn will also exclude it from the hashing.

@nsmith-
Copy link
Member

nsmith- commented Nov 15, 2021

Do you happen to know why our tests did not already cover this issue?

@jrueb
Copy link
Contributor Author

jrueb commented Nov 15, 2021

There is also no test dedicated to test WorkItem and it seems currently WorkItems are normally kept inside lists in Coffea, so their hash method isn't used.

@lgray
Copy link
Collaborator

lgray commented Nov 15, 2021

Is there any way we could turn the dicts into nested tuples? or do we think it is not relevant to hash all the metadata?

@lgray
Copy link
Collaborator

lgray commented Nov 15, 2021

And, indeed, please add a test!

@jrueb
Copy link
Contributor Author

jrueb commented Nov 17, 2021

Is there any way we could turn the dicts into nested tuples? or do we think it is not relevant to hash all the metadata?

Even if the usermeta is saved as tuples, probably the issue would arise again once somebody decides to put a value that is some non-hashable object inside usermeta.
Does a work item really become different just by having a different user meta data? If everything else is the same, it still defines the same task to be done, so to me excluding usermeta from hash and eq makes sense.

@lgray
Copy link
Collaborator

lgray commented Nov 17, 2021

From the perspective of describing the data to be read in and processed I agree, that makes sense.

I was thinking to cases where some one is considering corrections or something else as metadata, rather than packing it with the processor. Perhaps we should have something that yells at people if they have some enormous piece of metadata for the dataset?

@jrueb
Copy link
Contributor Author

jrueb commented Nov 17, 2021

With "enormous" do you mean something non-hashable?

@lgray
Copy link
Collaborator

lgray commented Nov 17, 2021

Not necessarily, string blob data is perfectly well hashable.

@nsmith-
Copy link
Member

nsmith- commented Nov 17, 2021

Later on, the metadata is attached as a parameter to an awkward array object, so it must be json-serializable. One could check at the beginning if that is the case by trying to run json.dumps(metadata) and if it fails emit an exception. Nevertheless, I think at least in the current design the metadata should not be hashed to define the work item.

@lgray
Copy link
Collaborator

lgray commented Nov 17, 2021

I guess then the question goes back to @jrueb... For what reason are you hashing the WorkItem? Maybe I missed it on the last PR.

I'm OK with omitting the user meta if it's not something that's normally done.

@nsmith-
Copy link
Member

nsmith- commented Nov 17, 2021

Initially I thought it was for the dask worker affinity, but looks like there I hash specific fields instead:

def belongsto(heavy_input, workerindex, item):
if heavy_input is not None:
item = item[0]
hashed = _hash(
(item.fileuuid, item.treename, item.entrystart, item.entrystop)
)
return hashed % len(workers) == workerindex

@jrueb
Copy link
Contributor Author

jrueb commented Nov 18, 2021

I guess then the question goes back to @jrueb... For what reason are you hashing the WorkItem? Maybe I missed it on the last PR.

Personally I would like to use the hash, or rather the ability to store the items in a set to search for them in a reasonable amount of time, in an Executor class I made.

@lgray
Copy link
Collaborator

lgray commented Nov 22, 2021

OK if it's only those specific fields any way then that sorta settles it, we can skip the usermeta.
I suppose folks can complain later if it throws a wrench in their designs.

@lgray lgray merged commit dde73df into CoffeaTeam:master Nov 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants