Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Packages upload to ghcr #208

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Conversation

Hind-M
Copy link
Member

@Hind-M Hind-M commented Oct 25, 2022

Checklist

@conda-forge-linter
Copy link

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

@hmaarrfk
Copy link
Contributor

i'm mostly passing by, but can you share some context of what this is? is there an other issue open about this?

@Hind-M
Copy link
Member Author

Hind-M commented Oct 26, 2022

i'm mostly passing by, but can you share some context of what this is? is there an other issue open about this?

Hey! So this is related to the wanted feature to add packages upload to Github container registry in addition to anaconda.org.
This PR is also related and should probably be merged before.

Copy link
Member

@beckermr beckermr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This upload cannot go here. We have to do it via the webservices in order to verify the artifacts.

@Hind-M
Copy link
Member Author

Hind-M commented Nov 9, 2022

This upload cannot go here. We have to do it via the webservices in order to verify the artifacts.

Oh ok! Where exactly do you suggest doing it? Is it somewhere here or somewhere else or in a another repo?
Thanks!

@beckermr
Copy link
Member

beckermr commented Nov 9, 2022

Somewhere else completely. We'll need to do it either on the heroku server or using a dispatch to github actions.

cc @wolfv for viz

@wolfv
Copy link
Member

wolfv commented Nov 10, 2022

Yeah, I think there are quite some considerations that we still have to do in terms of where to put this functionality.

Regarding the verification one could also do that via repodata (which is not automatically generated at this point). So the package could be uploaded to the OCI registry, but only added to the repodata after passing the validation step (and otherwise be removed again from the OCI registry). Just a thought.

It would be cool though to start to put together a standalone feedstock that does the upload-after-build to the OCI registry.

If we want to do the upload in the Heroku server, then this is probably the code (https://github.com/conda-forge/conda-forge-webservices/blob/ac84983eb66239c8d3bd6f5fb8b3297f709d2f8d/conda_forge_webservices/webapp.py#L498)

@beckermr
Copy link
Member

So the heroku server can't itself do the upload. It'd grind to a halt. We'll need to dispatch out to another service. Or we need to stage into one OCI registry and copy to another via an api call.

@wolfv
Copy link
Member

wolfv commented Nov 10, 2022

We can also use tags (e.g. 0.25.2_blalba_staging) and then just change the tag.

@beckermr
Copy link
Member

As long as we don't ship repodata pointing to tags that'd be fine.

@beckermr
Copy link
Member

Actually I'm not sure labels/tags will work. We shouldn't have keys to upload to our registry in feedstocks out in the open. We need a staging area and then a secured copy.

@Hind-M
Copy link
Member Author

Hind-M commented Nov 22, 2022

IIUC, we could upload to ghcr.io the same way it is done with anaconda.org - using a staging area and then copy to the prod, couldn't we?
If so, we could/should keep the upload in the upload_or_check_non_existence.py in this repo and add the copy part (and maybe additional missing stuff) from cf-staging to conda-forge in the webservices (webapp.py)?

@beckermr
Copy link
Member

beckermr commented Nov 22, 2022

Yes, a staging area could work. However, remember that the copy from cf-staging to conda-forge on anaconda.org is a simple HTTP request made to anaconda.org once the package data has been validated. We never download and reupload packages. So to make the ghcr stuff work on our webservices instance, you'll need to find a similar HTTP API endpoint. A similar HTTP endpoint also needs to return the package hash for validation.

@DerThorsten
Copy link

I am trying to figure out what would be needed to move forward with the GitHub OCI upload:

Yes, a staging area could work. However, remember that the copy from cf-staging to conda-forge on anaconda.org is a simple HTTP request made to anaconda.org once the package data has been validated. We never download and reupload packages. So to make the ghcr stuff work on our webservices instance, you'll need to find a similar HTTP API endpoint. A similar HTTP endpoint also needs to return the package hash for validation.

I am relatively new in the world of OCI registries, so forgive me if I am confusing things :) but I tried to look into the specs to find such an API endpoint. The open-container spec mentions an endpoint which might be helpful to avoid a download-reupload

"If a necessary blob exists already in another repository within the same registry, it can be mounted into a different repository via a POST request [...]"
https://github.com/opencontainers/distribution-spec/blob/main/spec.md#mounting-a-blob-from-another-repository

@beckermr
Copy link
Member

Sure that looks promising but I know nothing about OCI registries. I'll leave this to you and @wolfv to work out. Ideally, we could wrap the copy in the conda oci package @wolfv has going so it is easy to use.

We have some security requirements here related to tokens that I will share with @wolfv privately once the copy is working.

@jaimergp
Copy link
Member

jaimergp commented Jun 2, 2023

I've been thinking about this and doing some research. This is not a definitive assessment but a work in progress. I am not saying all the following is a good idea, but at least it takes us to the realm of what's feasible today.


The main concern right now is how to do staging in a safe way. conda-forge uses the cf-staging Anaconda.org channel where all feedstocks upload their artifacts. If the artifacts pass validation, a webservice copies them from cf-staging to conda-forge. Anaconda.org services will then index all conda-forge packages in the corresponding repodata.json.

Staging serves two purposes then:

  • Limiting access to the main channel
  • Avoiding early publication of a problematic artifact

How do we do this with OCI artifacts? The limitations are:

  • We only have one organization channel-mirrors so far, so feedstocks would get access to the "main" channel. This might not be as problematic as it sounds, but we need to ensure that's the case. 1
  • Artifact metadata needs to be added before the upload, and I am not aware of a mechanism that allows metadata modification after the upload
  • Copying artifacts from one channel to other involves downloads, uploads and some API calls (definitely more expensive than the single COPY request to Anaconda.org) 2
  • We need to handle our own conda-index equivalent process with remote artifacts

So, all in all, I think that we can run everything off the channel-mirrors organization. We just need to devise a different staging mechanism. I suggest:

  1. We mimic what Homebrew does and publish our own repodata using a similar approach. 3
  2. Come up with a way to mark an artifact as ready for publication, after an upload. Annotations and labels seem to be pre-upload only, but maybe GH has a field we can use, like visibility or something. 4
  3. Let feedstocks upload (only upload) to channel-mirrors and have the validation service run the needed checks on the new artifacts.
  4. If it passes, the required metadata is modified accordingly, and the artifact will be published to the repodata in the next scheduled run.
  5. If it doesn't, the required metadata won't be present, and the package will be deleted in the next scheduled run (different workflow than in step 4). Accidental deletions can still be recovered in the 30-day window.

Footnotes

  1. Permission-wise, GH distinguishes between read, write and delete, which means that a properly scoped token used by feedstocks could maybe just write too many things, but in no way delete existing blobs. Note these tokens are NOT fine-grained.
    There’s also a 30-day restore window if necessary. Deleted packages are available in the Settings UI.
    Package overwriting shouldn't be possible (it would be a different hash anyway). The risks of a cross-feedstock publication are low as long as we have a validation process in-place.

  2. I read the OCI spec and apparently it supports the notion of “mounting blobs” from other registries. This means it could mimic the cf-staging to conda-forge setup in Anaconda.org. The Github Packages API doesn’t seem to support mounting though. There are also some issues online about it, and still open.

  3. See how homebrew does this with 15-min scheduled jobs; even the API is pre-generated JSON deployed to GH Pages in an environment. Their biggest payload is 20MB pure JSON though. These point to sha256 headers in GHCR.io. Search uses algolia too!

  4. See this for OCI annotations. I don't know if it can be added after an upload. What about tags? Can these be modified, added or removed? Right now tags encode the version and the build string. The UI does distinguish tagged vs untagged.

@Hind-M
Copy link
Member Author

Hind-M commented Jun 19, 2023

Come up with a way to mark an artifact as ready for publication, after an upload. Annotations and labels seem to be pre-upload only, but maybe GH has a field we can use, like visibility or something.

  • I believe labels were superseded by annotations (https://github.com/opencontainers/image-spec/blob/main/annotations.md#back-compatibility-with-label-schema) and these cannot indeed be edited after building the artifact, but there is this interesting solution where we could add annotations to existing artifacts creating a separate ORAS Artifact Manifest referring to the original one, having the same digest, and being in the same repository.
    I suppose that tags could also be a solution.

  • Visibility does exist apparently, see listing packages for an organization, and can be public, private or internal.

  • For the staging strategy, to be sure that I understood correctly, do you mean not using a staging area and distinguish the artifacts which are ready only with metadata/annotations?
    When you say running everything off the channel-mirrors organization, where would it be? (within the organization of the corresponding GH repository we are packaging for example?).

@jaimergp
Copy link
Member

but there is this interesting solution where we could add annotations to existing artifacts creating a separate ORAS Artifact Manifest referring to the original one, having the same digest, and being in the same repository.

That repo (johnsonshi/annotate-registry-artifacts) is indeed interesting. I am concerned about the permissions here, because in principle any feedstock could add the metadata bit to say "yea it is a valid artifact", unless we put that info somewhere else 🤔 Or maybe we need to check.


About visibility, I read a bit more into it and, while it could work, we must notice that:

Warning: Once you make a package public, you cannot make it private again.

So we would have to upload it as private, then run the validation and either publish as public or delete. I don't know if the amount of packages marked as "private" count towards some kind of quota but hopefully the number of artifacts that are marked as such at a given time is a small one.


For the staging strategy, to be sure that I understood correctly, do you mean not using a staging area and distinguish the artifacts which are ready only with metadata/annotations?

Correct, that's my proposal so far.

When you say running everything off the channel-mirrors organization, where would it be? (within the organization of the corresponding GH repository we are packaging for example?).

Maybe a repo like channel-mirrors/index or channel-mirrors/repodata. Maybe this can be published to the OCI registry too (instead of GH pages) but it needs to run on some sort of cronjob anyway and I am assuming the Homebrew folks decided for GH Pages for a good reason.

@jaimergp
Copy link
Member

jaimergp commented Jul 8, 2023

We discussed this approach in the monthly bot meeting and Matt raised a point I had not considered: the repodata.json schema doesn't allow external URLs for packages; it assumes that files will be co-located next to the repodata.json. So either:

  • we provide a thousand redirection endpoints in the GH Pages "channel" or...
  • we submit the necessary CEPs to adjust the repodata schema to (optionally) allow for external URLs, which would take precedence over the next-to-repodata assumption

@jaimergp
Copy link
Member

@Hind-M and I met with @wolfv today and discussed potential alternatives:

Staging:

  • Instead of a single organization (channel-mirrors), we can add a second one; e.g. channel-mirrors-staging
  • Feedstocks upload to staging, and do not have access to production
  • A cronjob at channel-mirrors will periodically run validation checks on staging, and promote the valid packages to production. If they don't pass, they are deleted.

Repodata publication:

  • Only for packages in channel-mirrors
  • It can be served as an OCI artifact, or in GH Pages à la brew.
  • Need to add a plugin to conda to handle OCI-backed channels. This plugin will also be responsible of "figuring out" the OCI URL for each package in the repodata.json. We might just need an "endpoint" URL in the repodata header instead of per-artifact url.

Some other notes:

  • $GITHUB_TOKEN can't be used in GHA workflows with packages due to scale problems. Instead, one need to supply a packages:write PAT secret to the workflow. In conda-forge, this is best done via our token app or a bot account so the PAT is not tied to a personal account.
  • OCI operations (e.g. artifact download) are authenticated. The PAT is only needed to request the token, but the actual operation happens with a different one. This means that we can (theoretically) add more conditions to the per-operation token minting and further reduce the scope to a single package name or stuff like that.

@isuruf
Copy link
Member

isuruf commented Oct 16, 2023

Instead of a single organization (channel-mirrors), we can add a second one; e.g. channel-mirrors-staging
A cronjob at channel-mirrors will periodically run validation checks on staging, and promote the valid packages to production. If they don't pass, they are deleted.

I'm not sure what the difference with this approach and using a cronjob at channel-mirros to download from anaconda.org and push to channel-mirrors org directly.

@Hind-M
Copy link
Member Author

Hind-M commented Jul 30, 2024

Instead of a single organization (channel-mirrors), we can add a second one; e.g. channel-mirrors-staging
A cronjob at channel-mirrors will periodically run validation checks on staging, and promote the valid packages to production. If they don't pass, they are deleted.

I'm not sure what the difference with this approach and using a cronjob at channel-mirros to download from anaconda.org and push to channel-mirrors org directly.

Because we want to do it independently from anaconda.org.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants