Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fingerprinting base case handles empty __dict__ #1243

Merged
merged 1 commit into from
Nov 26, 2024

Conversation

zilto
Copy link
Collaborator

@zilto zilto commented Nov 25, 2024

Fixes #1242; copying reply from the issues thread

Problem

Using the cache with these two functions would create a collision and both would return pd.Timestamp("2021-01-01") on the 2nd execution (i.e., when retrieving values from cache)

import pandas as pd

def first() -> pd.Timestamp:
  return pd.Timestamp("2021-01-01")
  
def second() -> pd.Timestamp:
  return pd.Timestamp("2021-01-02")

Solution

The source of the bug is an oddity of pandas. Some objects have an empty __dict__ attached. A one line condition check now properly displays the intended warning messages.

When encountering this case, Hamilton gives a random UUID instead of hashing the value. Caching can still work because the cache key is NODE_NAME-CODE_VERSION-DATA_VERSION where data version is now a random UUID

image

Investigation

  • the reproduction worked on my machine
  • dr.cache.code_versions show that first() and second() are produce different code_version hash
  • dr.cache.data_versions produce the same data_version hash (source of the bug)
  • Lets look at the per-type hashing functions in hamilton.caching.fingerprinting
  • hash_pandas_obj() expects pd.Series or pd.DataFrame and doesn't handle pd.Timestamp (it would handle a series of pd.Timestamp values without collisions though)
  • The data_versions match the result of the default implementation hash_value() and falls under hash_value(pd.Timestamp(...).__dict__)
  • It seems that pd.Timestamp(...).__dict__ == {}, which is an odd behavior by pandas (it shouldn't be defined instead of empty)
  • Currently, hash_value() handles objects without a __dict__ (i.e., uses slots), but doesn't account for empty .__dict__ attribute

side notes

  • Warnings are only shown when hashing the value, meaning it's typically only displayed on first execution (value is retrieved on subsequent executions). This is a desirable behavior.
  • it's important to clear the on-disk cache when debugging this; can use in-memory caching for simplicity

@zilto zilto added bug Something isn't working caching labels Nov 25, 2024
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Reviewed everything up to c701383 in 5 seconds

More details
  • Looked at 50 lines of code in 2 files
  • Skipped 0 files when reviewing.
  • Skipped posting 2 drafted comments based on config settings.
1. tests/caching/test_fingerprinting.py:34
  • Draft comment:
    The test test_empty_dict_attr_is_unhashable correctly verifies that objects with an empty __dict__ are considered unhashable, aligning with the PR's intent to handle such cases. This ensures that the cache key collision issue is resolved.
  • Reason this comment was not posted:
    Confidence changes required: 0%
    The PR introduces a change in the hash_value function to handle objects with an empty __dict__. This change is reflected in the test test_empty_dict_attr_is_unhashable. The test ensures that objects with an empty __dict__ are considered unhashable, which aligns with the PR's intent to fix the cache key collision issue.
2. hamilton/caching/fingerprinting.py:78
  • Draft comment:
    Consider adding documentation in docs/ to explain the behavior of hash_value when encountering objects with an empty __dict__. This will help users understand why certain objects might be skipped during hashing.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The function hash_value is responsible for hashing objects, and it uses the __dict__ attribute to do so. However, the function does not handle cases where the __dict__ is empty, which can lead to issues with certain types like pd.Timestamp. The PR addresses this by checking if __dict__ is empty and skipping hashing in such cases. This change should be documented in the sphinx documentation under docs/ to inform users about this behavior.

Workflow ID: wflow_mJ8rUtPcjyQZ8hGb


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

@skrawcz skrawcz merged commit 47a5146 into main Nov 26, 2024
24 checks passed
@skrawcz skrawcz deleted the fix/fingerprint-pd-timestamp branch November 26, 2024 00:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working caching
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cache key collision for simple functions in the same module
3 participants