Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add caching dependency #479

Merged
merged 2 commits into from
Oct 4, 2023
Merged

Conversation

PhilippeMoussalli
Copy link
Contributor

Current implemented caching does not take into account the previous component cache key when estimating the cache key. This can be an issue in some scenarios, for example:

1st run: load_from_hub (input_dataset=x) -> download_images
2nd run: load_from_hub (input_dataset=y) -> download_images

In the second run, the download images component won't execute and will be cached since it might have the same input arguments as the first run. However, both runs start from different input datasets and thus the second run would actually end up loading in the dataset cached from the 1st run.

This PR addresses this by including the hash key of the dependent component in the component cache key estimation.

@RobbeSneyders
Copy link
Member

Are we using the input_manifest_path to calculate the hash key as well? I assumed we were, and I would expect it to change whenever the hash key of the previous component changes.

@PhilippeMoussalli
Copy link
Contributor Author

Are we using the input_manifest_path to calculate the hash key as well? I assumed we were, and I would expect it to change whenever the hash key of the previous component changes.

We're not using it. The path of the manifest does not change depending on the hash key and is always resolved to a path that contains the run id the current running pipeline.

The save path looks like this at the moment

save_path_base_path = (
    f"{manifest.base_path}/{manifest.pipeline_name}/{manifest.run_id}/"
    f"{manifest.component_id}/manifest.json"

Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks for the explanation. I had some misassumptions about how the caching works, but this seems logical. Small comment below.

for component_name, component in pipeline._graph.items():
component_op = component["fondant_component_op"]

if component_cache_key is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove the if/else check here, since we check it inside get_component_cache_key()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh yes, good catch

@PhilippeMoussalli PhilippeMoussalli force-pushed the add-component-dependency-caching branch 2 times, most recently from 0307787 to 8358787 Compare October 3, 2023 13:05
@RobbeSneyders RobbeSneyders merged commit d340b6a into main Oct 4, 2023
10 checks passed
@RobbeSneyders RobbeSneyders deleted the add-component-dependency-caching branch October 4, 2023 06:36
Hakimovich99 pushed a commit that referenced this pull request Oct 16, 2023
Current implemented caching does not take into account the previous
component cache key when estimating the cache key. This can be an issue
in some scenarios, for example:

 **1st run:** load_from_hub (input_dataset=x) -> download_images
 **2nd run:** load_from_hub (input_dataset=y) -> download_images
 
In the second run, the download images component won't execute and will
be cached since it might have the same input arguments as the first run.
However, both runs start from different input datasets and thus the
second run would actually end up loading in the dataset cached from the
1st run.

This PR addresses this by including the hash key of the dependent
component in the component cache key estimation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants