-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add method to estimate caching key #318
Conversation
In the local runner we often build custom components on the fly. How will we handle this ? |
Missed that one, it's a bit tricky because of two reasons:
Given those constrains and the fact that it's probably more efficient to estimate the digest during runtime based on the reasons mentioned in the description, I think an ideal solution would be to have a unique digest that can be retrieved during runtime. Either by
There is still the issue of having a unique digest, do you have any ideas on how to best approach this? |
9bae1ed
to
e68fc27
Compare
src/fondant/pipeline.py
Outdated
Returns: | ||
dict: The parsed JSON manifest. | ||
""" | ||
cmd = ["docker", "manifest", "inspect", image_ref] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will make docker a hard requirement up until now it was more of a soft requirement for local running. If we do this we might want to use the docker python sdk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah indeed, I'm also not really happy with this but I don't think it's possible [link]. (https://stackoverflow.com/questions/56178911/how-to-obtain-docker-image-digest-from-tag).
For now the SDK seems to only support inspecting local images link. The docker manifest inspect
seems to a new command so it might not have been integrated yet with the SDK.
Perhaps we can use curl
but then it requires to know the url of different registries link
src/fondant/pipeline.py
Outdated
Returns: | ||
str: The hash value (digest) of the Docker image. | ||
""" | ||
manifest = self.get_image_manifest(image_ref) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess getting the digest of local images is easy, it is the remote part that makes this difficult
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah It should be easy but we still have to somehow build the image -> get the digest -> compile
or we do it at runtime in which case we have to figure out how to pass the digest to the image itself. I checked a bit and it's not available in the image itself unless passed explicitly as an environmental variable. It's a bit of a chicken and an egg problem.
src/fondant/pipeline.py
Outdated
Returns: | ||
dict: The parsed JSON manifest. | ||
""" | ||
cmd = ["docker", "manifest", "inspect", image_ref] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you need access to the registry to run this code ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be fine for public images
src/fondant/pipeline.py
Outdated
""" | ||
if isinstance(input_dict, dict): | ||
return json.dumps( | ||
{k: sorted_dict_to_json(v) for k, v in sorted(input_dict.items())}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some recursion, cool!
I think overall the code looks fine. I'm just not a big fan of how we have to compare the docker images but I have no better proposition. |
Thanks @PhilippeMoussalli! I see a couple of issues with this:
I'm wondering if we shouldn't simplify this further and calculate the hash key based on the image name and tag. This has the downside that the underlying image might change, but that shouldn't really happen anyway. And as long as we provide a flag to disable the caching, the user can always work around it. We could even automatically disable it for tags that are known to change like 'latest'. This downside is also already partially present in the current approach (see But instead, we lose all the other disadvantages. |
I updated the PR accordingly, now the image hash is estimated in the Regarding the logic for At the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @PhilippeMoussalli!
Looks good to me, 2 minor comments.
src/fondant/pipeline.py
Outdated
numbers, booleans, or None). | ||
|
||
Args: | ||
input_dict: The dictionary to be converted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: indentation in this docstring.
src/fondant/pipeline.py
Outdated
A sorted JSON string representing the dictionary. | ||
""" | ||
if isinstance(input_dict, dict): | ||
return json.dumps( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, json.dumps
offers a sort_keys
argument which I think does the same thing :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh nice! the more you know :D
src/fondant/pipeline.py
Outdated
def sorted_dict_to_json(input_dict): | ||
"""Convert a dictionary to a sorted JSON string. | ||
|
||
This function recursively converts nested dictionaries to ensure all dictionaries | ||
are sorted and their values are JSON-compatible (e.g., lists, dictionaries, strings, | ||
numbers, booleans, or None). | ||
|
||
Args: | ||
input_dict: The dictionary to be converted. | ||
|
||
Returns: | ||
A sorted JSON string representing the dictionary. | ||
""" | ||
if isinstance(input_dict, dict): | ||
return json.dumps(input_dict, sort_keys=True) | ||
|
||
return input_dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can now remove this whole function and replace it with the json.dumps
call.
src/fondant/pipeline.py
Outdated
Returns: | ||
The hash value (MD5 digest) of the nested dictionary. | ||
""" | ||
sorted_json_string = sorted_dict_to_json(input_dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorted_json_string = sorted_dict_to_json(input_dict) | |
sorted_json_string = json.dumps(input_dict, sort_keys=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Related to #313 #292
The cache key is a unique identifier for the component that will be used to decide whether a component should be executed or not.
This is now estimated at compile time since it's easier to estimate the docker image digest outside of the component and pipelines (requires
docker
to be installed and authenticated). Only caveat there is that the image digest might change between the time the pipeline is compiled and the time it's run (e.g. image changes to a component is made after starting the run under a similar tag). However, I don't a straightforward way to extract the image digest from within the component itself:curl
-> requires passing an authentication key