fingerprinting base case handles empty __dict__ #1243
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #1242; copying reply from the issues thread
Problem
Using the cache with these two functions would create a collision and both would return
pd.Timestamp("2021-01-01")
on the 2nd execution (i.e., when retrieving values from cache)Solution
The source of the bug is an oddity of
pandas
. Some objects have an empty__dict__
attached. A one line condition check now properly displays the intended warning messages.When encountering this case, Hamilton gives a random UUID instead of hashing the value. Caching can still work because the cache key is
NODE_NAME-CODE_VERSION-DATA_VERSION
where data version is now a random UUIDInvestigation
dr.cache.code_versions
show thatfirst()
andsecond()
are produce differentcode_version
hashdr.cache.data_versions
produce the samedata_version
hash (source of the bug)hamilton.caching.fingerprinting
hash_pandas_obj()
expectspd.Series
orpd.DataFrame
and doesn't handlepd.Timestamp
(it would handle a series ofpd.Timestamp
values without collisions though)data_versions
match the result of the default implementationhash_value()
and falls underhash_value(pd.Timestamp(...).__dict__)
pd.Timestamp(...).__dict__ == {}
, which is an odd behavior by pandas (it shouldn't be defined instead of empty)hash_value()
handles objects without a__dict__
(i.e., uses slots), but doesn't account for empty.__dict__
attributeside notes