-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add caching dependency #479
Conversation
Are we using the |
We're not using it. The path of the manifest does not change depending on the hash key and is always resolved to a path that contains the run id the current running pipeline. The save path looks like this at the moment
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks for the explanation. I had some misassumptions about how the caching works, but this seems logical. Small comment below.
src/fondant/compiler.py
Outdated
for component_name, component in pipeline._graph.items(): | ||
component_op = component["fondant_component_op"] | ||
|
||
if component_cache_key is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove the if/else
check here, since we check it inside get_component_cache_key()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh yes, good catch
0307787
to
8358787
Compare
Current implemented caching does not take into account the previous component cache key when estimating the cache key. This can be an issue in some scenarios, for example: **1st run:** load_from_hub (input_dataset=x) -> download_images **2nd run:** load_from_hub (input_dataset=y) -> download_images In the second run, the download images component won't execute and will be cached since it might have the same input arguments as the first run. However, both runs start from different input datasets and thus the second run would actually end up loading in the dataset cached from the 1st run. This PR addresses this by including the hash key of the dependent component in the component cache key estimation.
Current implemented caching does not take into account the previous component cache key when estimating the cache key. This can be an issue in some scenarios, for example:
1st run: load_from_hub (input_dataset=x) -> download_images
2nd run: load_from_hub (input_dataset=y) -> download_images
In the second run, the download images component won't execute and will be cached since it might have the same input arguments as the first run. However, both runs start from different input datasets and thus the second run would actually end up loading in the dataset cached from the 1st run.
This PR addresses this by including the hash key of the dependent component in the component cache key estimation.