Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add method to estimate caching key #318

Merged
merged 8 commits into from
Aug 21, 2023
Merged

Add method to estimate caching key #318

merged 8 commits into from
Aug 21, 2023

Conversation

PhilippeMoussalli
Copy link
Contributor

Related to #313 #292

The cache key is a unique identifier for the component that will be used to decide whether a component should be executed or not.

This is now estimated at compile time since it's easier to estimate the docker image digest outside of the component and pipelines (requires docker to be installed and authenticated). Only caveat there is that the image digest might change between the time the pipeline is compiled and the time it's run (e.g. image changes to a component is made after starting the run under a similar tag). However, I don't a straightforward way to extract the image digest from within the component itself:

  • Connecting to the docker API with curl -> requires passing an authentication key
  • Passing the digest as part of an environmental variable during the build process -> requires modification to the Docker image

@PhilippeMoussalli PhilippeMoussalli added the Core Core framework label Jul 26, 2023
@PhilippeMoussalli PhilippeMoussalli self-assigned this Jul 26, 2023
@PhilippeMoussalli PhilippeMoussalli linked an issue Jul 26, 2023 that may be closed by this pull request
@GeorgesLorre
Copy link
Collaborator

In the local runner we often build custom components on the fly. How will we handle this ?

@PhilippeMoussalli
Copy link
Contributor Author

In the local runner we often build custom components on the fly. How will we handle this ?

Missed that one, it's a bit tricky because of two reasons:

  • The digest is actually different from registry to registry, images built locally and remotely will have different digests https://blog.aquasec.com/docker-image-tags. This means that if you want to resume a pipeline that was run remotely on the local runner (e.g. by specifying the base path to be remote), caching will not work.
  • Even if we were to have the same digest, we're estimating the digest before running which is when the build happens with docker-compose up

Given those constrains and the fact that it's probably more efficient to estimate the digest during runtime based on the reasons mentioned in the description, I think an ideal solution would be to have a unique digest that can be retrieved during runtime. Either by

  • Including it as an env variable during build time (downside is that we have to change the structure of the Dockerfile to add an environment)

  • Estimating it internally within the component (might require installing additional packages)

There is still the issue of having a unique digest, do you have any ideas on how to best approach this?

Returns:
dict: The parsed JSON manifest.
"""
cmd = ["docker", "manifest", "inspect", image_ref]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will make docker a hard requirement up until now it was more of a soft requirement for local running. If we do this we might want to use the docker python sdk

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah indeed, I'm also not really happy with this but I don't think it's possible [link]. (https://stackoverflow.com/questions/56178911/how-to-obtain-docker-image-digest-from-tag).
For now the SDK seems to only support inspecting local images link. The docker manifest inspect seems to a new command so it might not have been integrated yet with the SDK.

Perhaps we can use curl but then it requires to know the url of different registries link

Returns:
str: The hash value (digest) of the Docker image.
"""
manifest = self.get_image_manifest(image_ref)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess getting the digest of local images is easy, it is the remote part that makes this difficult

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah It should be easy but we still have to somehow build the image -> get the digest -> compile
or we do it at runtime in which case we have to figure out how to pass the digest to the image itself. I checked a bit and it's not available in the image itself unless passed explicitly as an environmental variable. It's a bit of a chicken and an egg problem.

Returns:
dict: The parsed JSON manifest.
"""
cmd = ["docker", "manifest", "inspect", image_ref]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you need access to the registry to run this code ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be fine for public images

"""
if isinstance(input_dict, dict):
return json.dumps(
{k: sorted_dict_to_json(v) for k, v in sorted(input_dict.items())},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some recursion, cool!

@GeorgesLorre
Copy link
Collaborator

I think overall the code looks fine. I'm just not a big fan of how we have to compare the docker images but I have no better proposition.

@RobbeSneyders
Copy link
Member

Thanks @PhilippeMoussalli!

I see a couple of issues with this:

  • This adds docker as a local dependency
  • This requires local authentication to the registry, which might not be available
  • The image digest might change after compilation

I'm wondering if we shouldn't simplify this further and calculate the hash key based on the image name and tag.

This has the downside that the underlying image might change, but that shouldn't really happen anyway. And as long as we provide a flag to disable the caching, the user can always work around it. We could even automatically disable it for tags that are known to change like 'latest'.

This downside is also already partially present in the current approach (see the image digest might change after compilation above).

But instead, we lose all the other disadvantages.

@PhilippeMoussalli
Copy link
Contributor Author

PhilippeMoussalli commented Aug 21, 2023

Thanks @PhilippeMoussalli!

I see a couple of issues with this:

  • This adds docker as a local dependency
  • This requires local authentication to the registry, which might not be available
  • The image digest might change after compilation

I'm wondering if we shouldn't simplify this further and calculate the hash key based on the image name and tag.

This has the downside that the underlying image might change, but that shouldn't really happen anyway. And as long as we provide a flag to disable the caching, the user can always work around it. We could even automatically disable it for tags that are known to change like 'latest'.

This downside is also already partially present in the current approach (see the image digest might change after compilation above).

But instead, we lose all the other disadvantages.

I updated the PR accordingly, now the image hash is estimated in the component_spec_hash

Regarding the logic for latest, I would prefer to move this solution to the pipeline_level/component_op level:

At the componentOp level we will have a parameter disable_caching that defaults to False. This can be defined by the user and by us, at compile time, we check if the tag is latest and then set it to True if it's the case. This seems to be a cleaner and more explicit solution than just generating a random hash. This will be tackled in a separate PR

Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @PhilippeMoussalli!

Looks good to me, 2 minor comments.

numbers, booleans, or None).

Args:
input_dict: The dictionary to be converted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: indentation in this docstring.

A sorted JSON string representing the dictionary.
"""
if isinstance(input_dict, dict):
return json.dumps(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, json.dumps offers a sort_keys argument which I think does the same thing :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh nice! the more you know :D

Comment on lines 155 to 171
def sorted_dict_to_json(input_dict):
"""Convert a dictionary to a sorted JSON string.

This function recursively converts nested dictionaries to ensure all dictionaries
are sorted and their values are JSON-compatible (e.g., lists, dictionaries, strings,
numbers, booleans, or None).

Args:
input_dict: The dictionary to be converted.

Returns:
A sorted JSON string representing the dictionary.
"""
if isinstance(input_dict, dict):
return json.dumps(input_dict, sort_keys=True)

return input_dict
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can now remove this whole function and replace it with the json.dumps call.

Returns:
The hash value (MD5 digest) of the nested dictionary.
"""
sorted_json_string = sorted_dict_to_json(input_dict)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sorted_json_string = sorted_dict_to_json(input_dict)
sorted_json_string = json.dumps(input_dict, sort_keys=True)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@RobbeSneyders RobbeSneyders merged commit 9840649 into main Aug 21, 2023
@RobbeSneyders RobbeSneyders deleted the estimate-cache-key branch August 21, 2023 13:42
Hakimovich99 pushed a commit that referenced this pull request Oct 16, 2023
Related to #313 #292 

The cache key is a unique identifier for the component that will be used
to decide whether a component should be executed or not.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core Core framework
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Estimate cache key per component
3 participants