Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support WithProgress for remote Writes #967

Merged
merged 1 commit into from
Mar 22, 2021
Merged

Conversation

imjasonh
Copy link
Collaborator

Fixes #952

Shoutout to @DennisDenuto who paired with me on this 🎉

  • supported in remote.WriteLayer, remote.Write, remote.WriteIndex, remote.MultiWrite
  • given update chan is closed when write completes
  • an error is sent along the chan if there are any non-temporary errors uploading
  • if a layer already exists or is mounted, progress updates immediately to account for that
  • if layer upload fails and is retried, progress goes backward and goes back up

d7fe938 demonstrates how easy it is to consume this channel, by hacking in an upload progress bar for crane cp (not part of this PR)

Fixes #852 as well as a bonus ⏩

@codecov-io
Copy link

codecov-io commented Mar 16, 2021

Codecov Report

Merging #967 (7429ced) into main (4a92f6c) will decrease coverage by 0.50%.
The diff coverage is 58.68%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #967      +/-   ##
==========================================
- Coverage   75.19%   74.68%   -0.51%     
==========================================
  Files         107      107              
  Lines        4837     4994     +157     
==========================================
+ Hits         3637     3730      +93     
- Misses        669      710      +41     
- Partials      531      554      +23     
Impacted Files Coverage Δ
pkg/v1/remote/write.go 62.87% <57.85%> (-1.44%) ⬇️
pkg/v1/remote/multi_write.go 55.97% <58.33%> (-1.05%) ⬇️
pkg/v1/remote/options.go 59.18% <100.00%> (+2.66%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4a92f6c...7429ced. Read the comment docs.

- supported in WriteLayer, Write, WriteIndex, MultiWrite
- chan is closed when write completes
- an error is sent along the chan if there are any non-temporary errors uploading
- if a layer already exists or is mounted, progress updates immediately to account for that
- if layer upload fails and is retried, progress goes backward and goes back up
@deitch
Copy link
Collaborator

deitch commented Apr 23, 2021

I was just hunting for something like this (especially after having put together the tarball one at #675 :-) . I was stumped, as I did not see it in godoc or pkg.go.dev, and was going to start trying to think about doing it, when I found this PR. Apparently godoc only takes the latest semver release (if one exists).

Does this send v1.Update on the channel:

type Update struct {
	Total    int64
	Complete int64
	Error    error
}

so it has just those fields? Does it calculate all of the bytes before sending, across all of the manifests and layers (e.g. remote.Write() or remote.WriteIndex() (it looks like it from skimming the source), so Total really is the total target, and Complete is the amount sent?

I also saw discussion about how docker does its push, where it has a series of updates per blob sent (layers, config, etc.). Are we interested in going down that path?

I came across this with a remote.WriteIndex() that errors out once in a while, and it isn't clear how far it got or why. Having progress reports would go a long way.

@imjasonh
Copy link
Collaborator Author

I was just hunting for something like this (especially after having put together the tarball one at #675 :-) . I was stumped, as I did not see it in godoc or pkg.go.dev, and was going to start trying to think about doing it, when I found this PR. Apparently godoc only takes the latest semver release (if one exists).

I'm actually planning on doing a ggcr release today! In the meantime, https://pkg.go.dev/github.com/google/go-containerregistry@main/pkg/v1/remote should work.

Does this send v1.Update on the channel:

type Update struct {
	Total    int64
	Complete int64
	Error    error
}

so it has just those fields? Does it calculate all of the bytes before sending, across all of the manifests and layers (e.g. remote.Write() or remote.WriteIndex() (it looks like it from skimming the source), so Total really is the total target, and Complete is the amount sent?

That's correct. It sends v1.Update, having pre-calculated the Total, taking into account duplicate blobs and manifests.

I also saw discussion about how docker does its push, where it has a series of updates per blob sent (layers, config, etc.). Are we interested in going down that path?

I think that's something we could investigate and implement, yes. We could accomplish this by adding fields to the v1.Update type, but we haven't really looked into it much yet.

@deitch
Copy link
Collaborator

deitch commented Apr 23, 2021

OK, I will start with that. I have been very bogged down (in a positive sense) with various deliverables the last two weeks, otherwise I would take a stab at it. If I can at some point, I will.

@deitch
Copy link
Collaborator

deitch commented Apr 23, 2021

This question is slightly off topic, but triggered here. How do I get a v1.Layer from a file on disk? The reason I ask here is, I would normally do:

                        f, err := os.Open(somefile)
                        defer f.Close()
                        layer = stream.NewLayer(f)

But that then gives you a *stream.Layer, which cannot be used with WithProgress(). Is there a way to get a layer off an arbitrary local file that is compatible with this?

@imjasonh
Copy link
Collaborator Author

https://pkg.go.dev/github.com/google/go-containerregistry@main/pkg/v1/tarball#LayerFromFile is the best we have right now.

Ideally updates would be able to handle streaming layers by increasing the total as bytes are streamed, but this hasn't been done yet.

@deitch
Copy link
Collaborator

deitch commented Apr 23, 2021

True, but in this case, I actually know the total size (it is just a file), so no reason not to have proper updates.

Would that actually work? Or only with a tarball?

@imjasonh
Copy link
Collaborator Author

Would it make sense to have some stream.NewLayerWithSize that asserted the size? 🤔

@deitch
Copy link
Collaborator

deitch commented Apr 23, 2021

That certainly would make it easier to get WithProgress() working. I would think that the ability to just get a layer from a file - i.e. something that has a known size - would make sense. Then again, that is precisely what NewLayerWithSize does.

At some point, if/when we decide to have per-layer progress reports, we will need the size for each layer as we push it up, won't we? This problem becomes the same there.

Although, as I think about it, when we push an Image or Index, we have the sizes of each blob from the manifest parent, so WithSize actually should cover that use case.

@imjasonh
Copy link
Collaborator Author

A stream.Layer doesn't know its size or digest by default today, and so a v1.Image that contains a stream.Layer can't calculate its Manifest or Digest or basically anything until that stream has been consumed and its digest and size calculated (these are streamily calculated while the stream is being consumed).

Care is taken so that mutations are lazily applied, so you can mutate.Append a stream.Layer and remote.Write it, which will read and push the blob before committing the just-recently-calculatable manifest. It's complex and brittle and honestly kind of a headache for not much gain in a lot of cases.

The use case for stream.Layer is in being able to generate layer contents and stream them to a registry (etc.), not in being able to stream bytes from disk, where size and digest are pretty easily calculatable. tarball.LayerFromFile assumes the file contents are already tarred, but if that's not the case you can either tar it and write it to back disk temporarily, or tar+stream it through tarball.LayerFromOpener, or just implement v1.Layer however else you want. stream.Layer is probably not going to be a good fit if you already know the size and digest, at least not in its current form.

@deitch
Copy link
Collaborator

deitch commented Apr 23, 2021

The use case for stream.Layer is in being able to generate layer contents and stream them to a registry (etc.), not in being able to stream bytes from disk, where size and digest are pretty easily calculatable. tarball.LayerFromFile assumes the file contents are already tarred, but if that's not the case you can either tar it and write it to back disk temporarily, or tar+stream it through tarball.LayerFromOpener, or just implement v1.Layer however else you want. stream.Layer is probably not going to be a good fit if you already know the size and digest, at least not in its current form.

So thinking about the bigger picture, we either need a concept of presetting size and digest on stream.Layer, or having a different kind of layer that represents one whose digest and size are known.

What do layout.Image and remote.Image and tarball.Image and such do, when returning a v1.Layer?

@imjasonh
Copy link
Collaborator Author

remote.Image never returns a stream.Layer, it only returns a remote.remoteLayer which calls fetchBlob lazily when you request the blob contents. layout.Image and tarball.Image already assume they has the full manifest (as bytes or struct), so it also doesn't have to deal with stream.Layers directly.

Really the only time you'd have to deal with them is when you mutate.AppendLayer a stream.Layer on top of them, in which case you're actually dealing with the mutate.image, which handles all the lazy calculation for you, and fails if an appended stream.Layer hasn't been computed yet.

stream.Layer is intended for cases where the contents are generated on-the-fly, so size and digest are unknowable. If you already know those things, you probably want something less annoying to deal with, such as a tarball.Layer or some other v1.Layer implementation built for your purpose.

@jonjohnsonjr
Copy link
Collaborator

remote.Image never returns a stream.Layer, it only returns a remote.remoteLayer which calls fetchBlob lazily when you request the blob contents.

This is a slight oversimplification. If you've accessed a layer via remote.Image, we return a remoteImageLayer, which cheaply computes the digest, size, diffid, and mediaType by looking at the image manifest or config file.

If you access a layer via remote.Layer, we return a remoteLayer, which can get the size via a HEAD request, knows the digest because it was an input, and has the mediaType hardcoded (best we can do, really).

@deitch do you have this layer inside a layout? Or is it just a file? Where do you have this metadata stored? We'd probably want to implement something that knows how to read/parse this metadata and can return a layer with all of this pre-populated, rather than you having to supply each piece individually.

@imjasonh
Copy link
Collaborator Author

This is a slight oversimplification. If you've accessed a layer via remote.Image, we return a remoteImageLayer, which cheaply computes the digest, size, diffid, and mediaType by looking at the image manifest or config file.

Thanks for this factcheck, you're right. In either case, stream.Layer is not involved when you fetch an image using pkg/remote.

@jonjohnsonjr
Copy link
Collaborator

Some maybe useful example of similar situations...

cosign uploading signatures:

https://github.com/sigstore/cosign/blob/7fee630550dc8939dcbdf4f68898944844794569/pkg/cosign/remote.go#L86-L120

It's a very basic static v1.Layer implementation that just returns some bytes and computes its metadata.

It does this itself because in this case we don't want to compress the layer, which stream and tarball assume you do.

rules_docker translating from bazel's representation to v1.Layer:

https://github.com/bazelbuild/rules_docker/blob/8c3a8110a0c519929a7e79c39ac345a0f8c74d04/container/go/pkg/compat/image.go#L35-L115

loadHashes reads the digest and diffid from rules_docker's internal representation of these images, and Size/UncompressedSize just Stat the relevant file.

crane uploading a layer from a file:

func getLayer(path string) (v1.Layer, error) {
f, err := streamFile(path)
if err != nil {
return nil, err
}
if f != nil {
return stream.NewLayer(f), nil
}
return tarball.LayerFromFile(path)
}

If it's a streaming thing, we use stream.Layer, otherwise tarball.Layer. This doesn't have any metadata associated with it, so we just fall back to v1.Layer implementations that will compute all of this for you.

@deitch
Copy link
Collaborator

deitch commented Apr 25, 2021

Ah all of that makes sense and clarifies a lot. I didn't think about the implications when I had a file, so I just wrapped it all with a stream.Layer. Clearly, that wasn't the best option.

do you have this layer inside a layout? Or is it just a file?

Both use cases, actually. I am guessing that you are hinting that in a layout, I can just use the v1/layout properties to get a layer? Or even a whole image?

What about for just a file?

@deitch
Copy link
Collaborator

deitch commented Apr 25, 2021

As for the progress, I really would like to figure out a way to get WithProgress to report writing by blob and manifest (similar to docker push). I have had a number of cases where the write hangs, or sometimes returns an error, when I am writing an index with 3 manifests underneath it, each of which may have several decent-sized layers. I have no idea which one is causing the error. I imagine reporting by blob would help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add WithProgress option for remote.Write TestMultiWrite_Deep is slow
4 participants