Add method to estimate caching key #318

PhilippeMoussalli · 2023-07-26T09:20:52Z

Related to #313 #292

The cache key is a unique identifier for the component that will be used to decide whether a component should be executed or not.

This is now estimated at compile time since it's easier to estimate the docker image digest outside of the component and pipelines (requires docker to be installed and authenticated). Only caveat there is that the image digest might change between the time the pipeline is compiled and the time it's run (e.g. image changes to a component is made after starting the run under a similar tag). However, I don't a straightforward way to extract the image digest from within the component itself:

Connecting to the docker API with curl -> requires passing an authentication key
Passing the digest as part of an environmental variable during the build process -> requires modification to the Docker image

GeorgesLorre · 2023-07-27T12:35:10Z

In the local runner we often build custom components on the fly. How will we handle this ?

PhilippeMoussalli · 2023-07-27T14:15:03Z

In the local runner we often build custom components on the fly. How will we handle this ?

Missed that one, it's a bit tricky because of two reasons:

The digest is actually different from registry to registry, images built locally and remotely will have different digests https://blog.aquasec.com/docker-image-tags. This means that if you want to resume a pipeline that was run remotely on the local runner (e.g. by specifying the base path to be remote), caching will not work.
Even if we were to have the same digest, we're estimating the digest before running which is when the build happens with docker-compose up

Given those constrains and the fact that it's probably more efficient to estimate the digest during runtime based on the reasons mentioned in the description, I think an ideal solution would be to have a unique digest that can be retrieved during runtime. Either by

Including it as an env variable during build time (downside is that we have to change the structure of the Dockerfile to add an environment)
Estimating it internally within the component (might require installing additional packages)

There is still the issue of having a unique digest, do you have any ideas on how to best approach this?

GeorgesLorre · 2023-08-01T08:15:43Z

src/fondant/pipeline.py

+        Returns:
+            dict: The parsed JSON manifest.
+        """
+        cmd = ["docker", "manifest", "inspect", image_ref]


This will make docker a hard requirement up until now it was more of a soft requirement for local running. If we do this we might want to use the docker python sdk

yeah indeed, I'm also not really happy with this but I don't think it's possible [link]. (https://stackoverflow.com/questions/56178911/how-to-obtain-docker-image-digest-from-tag).
For now the SDK seems to only support inspecting local images link. The docker manifest inspect seems to a new command so it might not have been integrated yet with the SDK.

Perhaps we can use curl but then it requires to know the url of different registries link

GeorgesLorre · 2023-08-01T08:16:45Z

src/fondant/pipeline.py

+        Returns:
+            str: The hash value (digest) of the Docker image.
+        """
+        manifest = self.get_image_manifest(image_ref)


I guess getting the digest of local images is easy, it is the remote part that makes this difficult

yeah It should be easy but we still have to somehow build the image -> get the digest -> compile
or we do it at runtime in which case we have to figure out how to pass the digest to the image itself. I checked a bit and it's not available in the image itself unless passed explicitly as an environmental variable. It's a bit of a chicken and an egg problem.

GeorgesLorre · 2023-08-01T08:18:48Z

src/fondant/pipeline.py

+        Returns:
+            dict: The parsed JSON manifest.
+        """
+        cmd = ["docker", "manifest", "inspect", image_ref]


I guess you need access to the registry to run this code ?

I think it should be fine for public images

GeorgesLorre · 2023-08-01T08:19:46Z

src/fondant/pipeline.py

+            """
+            if isinstance(input_dict, dict):
+                return json.dumps(
+                    {k: sorted_dict_to_json(v) for k, v in sorted(input_dict.items())},


Some recursion, cool!

GeorgesLorre · 2023-08-01T08:23:56Z

I think overall the code looks fine. I'm just not a big fan of how we have to compare the docker images but I have no better proposition.

RobbeSneyders · 2023-08-18T12:28:06Z

Thanks @PhilippeMoussalli!

I see a couple of issues with this:

This adds docker as a local dependency
This requires local authentication to the registry, which might not be available
The image digest might change after compilation

I'm wondering if we shouldn't simplify this further and calculate the hash key based on the image name and tag.

This has the downside that the underlying image might change, but that shouldn't really happen anyway. And as long as we provide a flag to disable the caching, the user can always work around it. We could even automatically disable it for tags that are known to change like 'latest'.

This downside is also already partially present in the current approach (see the image digest might change after compilation above).

But instead, we lose all the other disadvantages.

PhilippeMoussalli · 2023-08-21T08:45:37Z

Thanks @PhilippeMoussalli!

I see a couple of issues with this:

This adds docker as a local dependency

This requires local authentication to the registry, which might not be available

The image digest might change after compilation

I'm wondering if we shouldn't simplify this further and calculate the hash key based on the image name and tag.

This has the downside that the underlying image might change, but that shouldn't really happen anyway. And as long as we provide a flag to disable the caching, the user can always work around it. We could even automatically disable it for tags that are known to change like 'latest'.

This downside is also already partially present in the current approach (see the image digest might change after compilation above).

But instead, we lose all the other disadvantages.

I updated the PR accordingly, now the image hash is estimated in the component_spec_hash

Regarding the logic for latest, I would prefer to move this solution to the pipeline_level/component_op level:

At the componentOp level we will have a parameter disable_caching that defaults to False. This can be defined by the user and by us, at compile time, we check if the tag is latest and then set it to True if it's the case. This seems to be a cleaner and more explicit solution than just generating a random hash. This will be tackled in a separate PR

RobbeSneyders

Thanks @PhilippeMoussalli!

Looks good to me, 2 minor comments.

RobbeSneyders · 2023-08-21T09:00:25Z

src/fondant/pipeline.py

+            numbers, booleans, or None).
+
+            Args:
+            input_dict: The dictionary to be converted.


Nit: indentation in this docstring.

RobbeSneyders · 2023-08-21T09:02:40Z

src/fondant/pipeline.py

+            A sorted JSON string representing the dictionary.
+            """
+            if isinstance(input_dict, dict):
+                return json.dumps(


FYI, json.dumps offers a sort_keys argument which I think does the same thing :)

oh nice! the more you know :D

RobbeSneyders · 2023-08-21T11:45:49Z

src/fondant/pipeline.py

+        def sorted_dict_to_json(input_dict):
+            """Convert a dictionary to a sorted JSON string.
+
+            This function recursively converts nested dictionaries to ensure all dictionaries
+            are sorted and their values are JSON-compatible (e.g., lists, dictionaries, strings,
+            numbers, booleans, or None).
+
+            Args:
+                input_dict: The dictionary to be converted.
+
+            Returns:
+                A sorted JSON string representing the dictionary.
+            """
+            if isinstance(input_dict, dict):
+                return json.dumps(input_dict, sort_keys=True)
+
+            return input_dict


I think we can now remove this whole function and replace it with the json.dumps call.

RobbeSneyders · 2023-08-21T11:46:05Z

src/fondant/pipeline.py

+            Returns:
+                The hash value (MD5 digest) of the nested dictionary.
+            """
+            sorted_json_string = sorted_dict_to_json(input_dict)


Suggested change

sorted_json_string = sorted_dict_to_json(input_dict)

sorted_json_string = json.dumps(input_dict, sort_keys=True)

Related to #313 #292 The cache key is a unique identifier for the component that will be used to decide whether a component should be executed or not.

add method to estimate caching key

1f27bcd

PhilippeMoussalli added the Core Core framework label Jul 26, 2023

PhilippeMoussalli requested a review from GeorgesLorre July 26, 2023 09:20

PhilippeMoussalli self-assigned this Jul 26, 2023

PhilippeMoussalli linked an issue Jul 26, 2023 that may be closed by this pull request

Estimate cache key per component #314

Closed

add tests for parsing manifest response

e68fc27

PhilippeMoussalli force-pushed the estimate-cache-key branch from 9bae1ed to e68fc27 Compare July 31, 2023 13:58

GeorgesLorre reviewed Aug 1, 2023

View reviewed changes

PhilippeMoussalli mentioned this pull request Aug 1, 2023

Implement caching workflow #325

Closed

Merge branch 'main' into estimate-cache-key

3a06712

PhilippeMoussalli added 2 commits August 21, 2023 10:17

Merge branch 'main' into estimate-cache-key

67b6688

Remove image digest from cache key estimation

ea39dcd

RobbeSneyders reviewed Aug 21, 2023

View reviewed changes

PhilippeMoussalli added 2 commits August 21, 2023 11:43

implement PR feedback

1fc50bc

Merge branch 'main' into estimate-cache-key

109ceed

RobbeSneyders reviewed Aug 21, 2023

View reviewed changes

implement PR feedback

cd5efcf

RobbeSneyders approved these changes Aug 21, 2023

View reviewed changes

RobbeSneyders merged commit 9840649 into main Aug 21, 2023

RobbeSneyders deleted the estimate-cache-key branch August 21, 2023 13:42

Hakimovich99 pushed a commit that referenced this pull request Oct 16, 2023

Add method to estimate caching key (#318)

b040a30

Related to #313 #292 The cache key is a unique identifier for the component that will be used to decide whether a component should be executed or not.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add method to estimate caching key #318

Add method to estimate caching key #318

PhilippeMoussalli commented Jul 26, 2023

GeorgesLorre commented Jul 27, 2023

PhilippeMoussalli commented Jul 27, 2023

GeorgesLorre Aug 1, 2023

PhilippeMoussalli Aug 1, 2023

GeorgesLorre Aug 1, 2023

PhilippeMoussalli Aug 1, 2023

GeorgesLorre Aug 1, 2023

PhilippeMoussalli Aug 1, 2023

GeorgesLorre Aug 1, 2023

GeorgesLorre commented Aug 1, 2023

RobbeSneyders commented Aug 18, 2023

PhilippeMoussalli commented Aug 21, 2023 •

edited

Loading

RobbeSneyders left a comment

RobbeSneyders Aug 21, 2023

RobbeSneyders Aug 21, 2023

PhilippeMoussalli Aug 21, 2023

RobbeSneyders Aug 21, 2023

RobbeSneyders Aug 21, 2023

PhilippeMoussalli Aug 21, 2023

	sorted_json_string = sorted_dict_to_json(input_dict)
	sorted_json_string = json.dumps(input_dict, sort_keys=True)

Add method to estimate caching key #318

Add method to estimate caching key #318

Conversation

PhilippeMoussalli commented Jul 26, 2023

GeorgesLorre commented Jul 27, 2023

PhilippeMoussalli commented Jul 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GeorgesLorre commented Aug 1, 2023

RobbeSneyders commented Aug 18, 2023

PhilippeMoussalli commented Aug 21, 2023 • edited Loading

RobbeSneyders left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PhilippeMoussalli commented Aug 21, 2023 •

edited

Loading