Add caching dependency #479

PhilippeMoussalli · 2023-10-02T15:39:55Z

Current implemented caching does not take into account the previous component cache key when estimating the cache key. This can be an issue in some scenarios, for example:

1st run: load_from_hub (input_dataset=x) -> download_images
2nd run: load_from_hub (input_dataset=y) -> download_images

In the second run, the download images component won't execute and will be cached since it might have the same input arguments as the first run. However, both runs start from different input datasets and thus the second run would actually end up loading in the dataset cached from the 1st run.

This PR addresses this by including the hash key of the dependent component in the component cache key estimation.

RobbeSneyders · 2023-10-03T07:59:16Z

Are we using the input_manifest_path to calculate the hash key as well? I assumed we were, and I would expect it to change whenever the hash key of the previous component changes.

PhilippeMoussalli · 2023-10-03T08:15:22Z

Are we using the input_manifest_path to calculate the hash key as well? I assumed we were, and I would expect it to change whenever the hash key of the previous component changes.

We're not using it. The path of the manifest does not change depending on the hash key and is always resolved to a path that contains the run id the current running pipeline.

The save path looks like this at the moment

save_path_base_path = (
    f"{manifest.base_path}/{manifest.pipeline_name}/{manifest.run_id}/"
    f"{manifest.component_id}/manifest.json"

RobbeSneyders

Ok, thanks for the explanation. I had some misassumptions about how the caching works, but this seems logical. Small comment below.

RobbeSneyders · 2023-10-03T09:04:06Z

src/fondant/compiler.py

        for component_name, component in pipeline._graph.items():
            component_op = component["fondant_component_op"]

+            if component_cache_key is None:


I think we can remove the if/else check here, since we check it inside get_component_cache_key()

ahh yes, good catch

Current implemented caching does not take into account the previous component cache key when estimating the cache key. This can be an issue in some scenarios, for example: **1st run:** load_from_hub (input_dataset=x) -> download_images **2nd run:** load_from_hub (input_dataset=y) -> download_images In the second run, the download images component won't execute and will be cached since it might have the same input arguments as the first run. However, both runs start from different input datasets and thus the second run would actually end up loading in the dataset cached from the 1st run. This PR addresses this by including the hash key of the dependent component in the component cache key estimation.

add caching depedency

6b7d7d5

PhilippeMoussalli requested a review from RobbeSneyders October 2, 2023 15:39

RobbeSneyders reviewed Oct 3, 2023

View reviewed changes

address PR feedback

8358787

PhilippeMoussalli force-pushed the add-component-dependency-caching branch 2 times, most recently from 0307787 to 8358787 Compare October 3, 2023 13:05

RobbeSneyders approved these changes Oct 4, 2023

View reviewed changes

RobbeSneyders merged commit d340b6a into main Oct 4, 2023
10 checks passed

RobbeSneyders deleted the add-component-dependency-caching branch October 4, 2023 06:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add caching dependency #479

Add caching dependency #479

PhilippeMoussalli commented Oct 2, 2023

RobbeSneyders commented Oct 3, 2023

PhilippeMoussalli commented Oct 3, 2023

RobbeSneyders left a comment

RobbeSneyders Oct 3, 2023

PhilippeMoussalli Oct 3, 2023

Add caching dependency #479

Add caching dependency #479

Conversation

PhilippeMoussalli commented Oct 2, 2023

RobbeSneyders commented Oct 3, 2023

PhilippeMoussalli commented Oct 3, 2023

RobbeSneyders left a comment

Choose a reason for hiding this comment

RobbeSneyders Oct 3, 2023

Choose a reason for hiding this comment

PhilippeMoussalli Oct 3, 2023

Choose a reason for hiding this comment