Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

source: imageblob source implementation #4286

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tonistiigi
Copy link
Member

Image blob source in LLB allows addressing a single blob from a container image registry. The difference from the image source is that image source needs to point to a manifest that internally points to an array of layer blobs that are all extracted on top of each other to form a root FS. Contrary, image blob points to a single blob that is not extracted but downloaded as a single file into an empty snapshot, similarily how the HTTP source works.

The main use case for this source is to pin snapshots of HTTP URLs, upload the downloaded blob into container registry, and then use a source policy to map a HTTP URL (whose content might be changed) to the copy of the source as image blob to ensure immutability.

@tonistiigi
Copy link
Member Author

Relates to #4259

@AkihiroSuda Maybe we should call this artifact, although it is already overused term and may lead to even more confusion. Lmk if you have any opinions on the pURL etc.

Image blob source in LLB allows addressing a single blob
from a container image registry. The difference from the image
source is that image source needs to point to a manifest that
internally points to an array of layer blobs that are all extracted
on top of each other to form a root FS. Contrary, image blob
points to a single blob that is not extracted but downloaded
as a single file into an empty snapshot, similarily how
the HTTP source works.

The main use case for this source is to pin snapshots of
HTTP URLs, upload the downloaded blob into container registry,
and then use a source policy to map a HTTP URL (whose content
might be changed) to the copy of the source as image blob
to ensure immutability.

Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
@@ -83,6 +83,22 @@ func slsaMaterials(srcs Sources) ([]slsa.ProvenanceMaterial, error) {
out = append(out, material)
}

for _, s := range srcs.ImageBlobs {
uri, err := purl.RefToPURL("docker-blob", s.Ref, nil)
Copy link
Member

@neersighted neersighted Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surfacing from Slack:

Am I correct in that you’re inventing your own docker-blob PURL scheme?
I wouldn't invent a new PURL ecosystem; they're well-defined in https://github.com/package-url/purl-spec/blob/master/PURL-TYPES.rst; pkg:docker/ with a specifier (e.g. ?buildkit_type=blob) makes more sense to me.

@tonistiigi

I don’t know what else to put there. I’m extending the docker scheme because it is the same protocol. But I already have a problem that I want to be able to point to a local OCI layout this way as well that would need another name.


I'd like to suggest some alternatives as I don't think inventing our own scheme outside the PURL spec is a good idea:

# Embedding the original source as a URL + checksum:
pkg:generic/openssl@1.1.10g?download_url=https://openssl.org/source/openssl-1.1.0g.tar.gz&checksum=sha256:de4d501267da&buildkit_blob=docker.io/mynamespace/myrepo@sha256:...
# Extending the existing docker scheme:
pkg:docker/mynamespace/myrepo@sha256%3A244fd47e07d10?ref_type=blob

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tonis and I discussed this further; what likely makes more sense is passing the provenance all the way through (so, use the existing URL and hash even when doing a "replay" from "registry mirrored inputs"; and have some BuildKit-specific sourcepolicy-map file that records what verifiable mirrored input was used for the specific build).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for the PURL, we should be good to just reference the blob as in the second example:

pkg:docker/mynamespace/myrepo@sha256%3A244fd47e07d10?ref_type=blob

I don't even think we would need to add a ref_type, I can't see a good reason for why we'd need that. It doesn't seem strictly necessary to distinguish between the docker-blob and docker types, though maybe there's something I'm missing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's mostly because "try a digest as a manifest and fall back to trying it as a blob" isn't an existing behavior and doesn't really make sense in the general case; there are separate APIs to get this content (for better and for worse) and having a hint makes it more clear what is actually expected/prevents fragmentation among tools that return a 404 because they didn't fall back to blob.

This is all mostly hypothetical as this is intended to be an internal implementation detail of "replayable" builds, but nonetheless PURLs are meant to be interoperable and commonly understood...

If we use the latter type, we should probably PR it back to the PURL spec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/package-url/purl-spec/blob/5b15089c476d35b7facccaee47357df4f2b47fcb/PURL-TYPES.rst?plain=1#L211-L224

docker
------
``docker`` for Docker images:

- The default repository is ``https://hub.docker.com``.
- The ``namespace`` is the registry/user/organization if present.
- The version should be the image id sha256 or a tag. Since tags can be moved,
  a sha256 image id is preferred.
- Examples::

      pkg:docker/cassandra@latest
      pkg:docker/smartentry/debian@dc437cc87d10
      pkg:docker/customer/dockerimage@sha256%3A244fd47e07d10?repository_url=gcr.io

https://github.com/package-url/purl-spec/blob/5b15089c476d35b7facccaee47357df4f2b47fcb/PURL-TYPES.rst?plain=1#L416-L444

oci
------------
``oci`` for all artifacts stored in registries that conform to the
`OCI Distribution Specification <https://github.com/opencontainers/distribution-spec>`_,
including container images built by Docker and others:

- There is no canonical package repository for OCI artifacts. Therefore
  ``oci`` purls must be registry agnostic by default. To specify the repository,
  provide a ``repository_url`` value.
- OCI purls do not contain a ``namespace``, although, ``repository_url`` may
  contain a namespace as part of the physical location of the package.
- The ``name`` is not case sensitive and must be lowercased. The name is the
  last fragment of the repository name. For example if the repository
  name is ``library/debian`` then the ``name`` is ``debian``.
- The ``version`` is the ``sha256:hex_encoded_lowercase_digest`` of the
  artifact and is required to uniquely identify the artifact.
- Optional qualifiers may include:

  - ``arch``: key for a package architecture, when relevant.
  - ``repository_url``: A repository URL where the artifact may be found, but not
    intended as the only location. This value is encouraged to identify a
    location the content may be fetched.
  - ``tag``: artifact tag that may have been associated with the digest at the time.
- Examples::

      pkg:oci/debian@sha256%3A244fd47e07d10?repository_url=docker.io/library/debian&arch=amd64&tag=latest
      pkg:oci/debian@sha256%3A244fd47e07d10?repository_url=ghcr.io/debian&tag=bullseye
      pkg:oci/static@sha256%3A244fd47e07d10?repository_url=gcr.io/distroless/static&tag=latest
      pkg:oci/hello-wasm@sha256%3A244fd47e07d10?tag=v1

Since a blob is not an "image", using oci type may make more sense.

?ref_type=blob doesn't seem needed.

HTTPSScheme = "https"
OCIScheme = "oci-layout"
DockerImageScheme = "docker-image"
DockerImageBlobScheme = "docker-image-blob"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surfacing from Slack:

I think that docker-image-blob:// feels too generic; it seems like something more specific like buildkit-pined-artifact:// or something might be more straightforward, even if this is an implementation detail/not user facing, as you said.

@tonistiigi

This doesn’t say anything about the transport/storage. All the current source names define that.
But I already have a problem that I want to be able to point to a local OCI layout this way as well that would need another name.


Based on the above, what about docker-image+blob and oci-layout+blob for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oci-layout+blob

I'd definitely be interested in having this as well - making sure we can load them from disk makes this super useful for me. I don't think we should have both oci-layout and docker-image functionality merged into a new docker-image-blob, so the + sounds like a neat solution to me 🎉

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel like it's worth bikeshedding this too much: either docker-image+blob/oci-layout+blob or docker-image-blob/oci-layout-blob would work for me.

The + implies to me shared code / similar functionality to the base, which isn't quite true though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is though, from the perspective of a user: both come from a registry/an OCI bundle.

@tonistiigi tonistiigi marked this pull request as draft October 1, 2023 02:06
type ImageBlobIdentifier struct {
Reference reference.Spec
RecordType client.UsageRecordType
Filename string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity - might we want the ability to unpack a directory from an image blob?

For example, this works great for replacing sources that result in a single file, like HTTP, but what about for Git or Local? I guess we care less about Git, but there are some interesting use cases for capturing the Local directory, since it may not be reproducible between users (depending on the build process).

I suppose you would just be able to do this by having a single layered image with an empty config (though this would require an extra level of indirection).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better would be for the policy to provide multi-vertex LLB instead of single op in this case. In that case, you could pair it with a FileOp that does the extraction.

I don't really care about local much because there is no checksum that tracks locals anyway. Git is interesting but it would really be only valuable if we can validate the original checksum based on the commit sha. If we have Git source with sha1:abc and then some targz blob that we claim is equivalent with targz checksum sha256:def there is no way to validate it either visually or by buildkit. So it looks like if we want to store snapshot of git source outside of the real git repository we need some special way to import it that still validates the original commit sha (and makes sure there are no extra files etc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants