Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache key collision for simple functions in the same module #1242

Closed
poldpold opened this issue Nov 25, 2024 · 4 comments · Fixed by #1243
Closed

Cache key collision for simple functions in the same module #1242

poldpold opened this issue Nov 25, 2024 · 4 comments · Fixed by #1243
Labels
triage label for issues that need to be triaged.

Comments

@poldpold
Copy link

Current behavior

It appears that when activating caching for two functions with the same signature, in the same file, when those functions are similar enough, they are mapped onto the same cache key. This can be seen in the cache directory, where only one cache file is created, and on the rerun of the DAG where both nodes receive the same cached value.

Stack Traces

There is no crash.

Steps to replicate behavior

Create and run a jupyter notebook with the following cells. (the issue is also present in actual modules, outside jupyter)

%load_ext hamilton.plugins.jupyter_magic
%%cell_to_module -m test_module --display --rebuild-drivers

from hamilton.function_modifiers import cache
import pandas as pd


def first() -> pd.Timestamp:
    return pd.Timestamp("2021-01-01")

def second() -> pd.Timestamp:
    return pd.Timestamp("2021-01-02")
from hamilton import driver
import test_module

dr = (
    driver
    .Builder()
    .with_config({})
    .with_modules(test_module)
    .with_cache(path=".")
    .build()
)
dr.execute(final_vars=["first", "second"])
>> {'first': Timestamp('2021-01-01 00:00:00'),
>> 'second': Timestamp('2021-01-02 00:00:00')}
dr.execute(final_vars=["first", "second"])
>> {'first': Timestamp('2021-01-01 00:00:00'),
>>  'second': Timestamp('2021-01-01 00:00:00')}

As one can see, one the rerun the second timestamp gets the cache value of the first variable, as if function name was not part of the cache key.

Library & System Information

python=3.11.8, sf-hamilton=1.83.2

Expected behavior

I would expect the result

>> {'first': Timestamp('2021-01-01 00:00:00'),
>>  'second': Timestamp('2021-01-02 00:00:00')}
@poldpold poldpold added the triage label for issues that need to be triaged. label Nov 25, 2024
@zilto
Copy link
Collaborator

zilto commented Nov 25, 2024

Will investigate today! The issue is more likely related to the handling of pd.Timestamp objects than the functions first() and second() per se.

@zilto
Copy link
Collaborator

zilto commented Nov 25, 2024

Investigation

  • the reproduction worked on my machine
  • dr.cache.code_versions show that first() and second() are produce different code_version hash
  • dr.cache.data_versions produce the same data_version hash (source of the bug)
  • Lets look at the per-type hashing functions in hamilton.caching.fingerprinting
  • hash_pandas_obj() expects pd.Series or pd.DataFrame and doesn't handle pd.Timestamp (it would handle a series of pd.Timestamp values without collisions though)
  • The data_versions match the result of the default implementation hash_value() and falls under hash_value(pd.Timestamp(...).__dict__)
  • It seems that pd.Timestamp(...).__dict__ == {}, which is an odd behavior by pandas (it shouldn't be defined instead of empty)
  • Currently, hash_value() handles objects without a __dict__ (i.e., uses slots), but doesn't account for empty .__dict__ attribute

Solution

Change the check for the base case in hash_value() to handle .__dict__ that are empty. A one line condition check now properly display the intended warning messages. Instead of hashing the value, a random UUID is provided. Caching can still work because the cache key is NODE_NAME-CODE_VERSION-DATA_VERSION where data version is now a random UUID

image

side notes

  • Warnings are only shown when hashing the value, meaning it's typically only displayed on first execution (value is retrieved on subsequent executions). This is a desirable behavior.
  • it's important to clear the on-disk cache when debugging this; can use in-memory caching for simplicity

@poldpold
Copy link
Author

Thank you @zilto and @skrawcz for this fix and also #1240. Would you have any visibility on when the new version will make it to conda-forge?

@skrawcz
Copy link
Collaborator

skrawcz commented Nov 26, 2024

Thank you @zilto and @skrawcz for this fix and also #1240. Would you have any visibility on when the new version will make it to conda-forge?

We don't trigger anything. But judging by timestamps it seems like it might take 48 hours for it to show up there. I assume you're referring to https://anaconda.org/conda-forge/sf-hamilton/files ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage label for issues that need to be triaged.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants