Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download: Fix support for Seqera Community Containers #3179

Open
MatthiasZepper opened this issue Sep 24, 2024 · 5 comments
Open

Download: Fix support for Seqera Community Containers #3179

MatthiasZepper opened this issue Sep 24, 2024 · 5 comments
Labels
bug Something isn't working high-priority

Comments

@MatthiasZepper
Copy link
Member

MatthiasZepper commented Sep 24, 2024

Description of the bug

While I got the impression that the adoption of Seqera Community Containers would be tightly linked to the new dynamically created container paths upon pipeline release, this does not hold true.

We now already have 70 modules (mainly local but also some in the module's directory), which have no support by nf-core (pipelines) download and may not run offline.

This shows mainly by two means:

  1. The available Singularity image specified with an oras:// protocol is simply ignored. Instead, the Docker version is converted, which takes much longer. Also, the converted image will not be found if the containerEngine is set to 'Singularity', because the SHA of the native Singularity image and the Docker converted image differ.
  2. A huge mess is created on the level of symlinks, because the registry is not cleanly trimmed and image names containing two registries are created: depot.galaxyproject.org-community.wave.seqera.io-library-python_numpy_pip_checkqc_interop-b5301d9801b8e66b.img. Evidently there is no use for this and if a user copies the files with cp -L, a lot of unessessary data duplication happens.

I would have preferred a synchronized and coordinated update, but since this stuff is now out in the wild, I reckon that nf-core (pipelines) download should be adapted to:

  1. Have a sample of such a declaration in the mock module container definitions and adapt the corresponding test test_find_container_images_modules().
  2. Ensure that gather_registries() hardcodes community.wave.seqera.io similar to depot.galaxyproject.org, so it is not required to explicitly configure this registry, e.g. with the --library / -l argument.
  3. Ensure that singularity_image_filenames() handles the filenames correctly. Mind the library subfolder, such that the resulting image name should be community.wave.seqera.io-library-python_pip_samshee-84a770c9853c725d.img for oras://community.wave.seqera.io/library/python_pip_samshee:84a770c9853c725d
  4. Enable prioritize_direct_download() (or write a dedicated function for this purpose) to also prioritize oras:// over the Docker images and adapt the corresponding test. Mind that the SHA sums differ, and this needs to be accounted for: oras://community.wave.seqera.io/library/python_pip_samshee:84a770c9853c725d and community.wave.seqera.io/library/python_pip_samshee:e8a5c47ec32efa42 are equivalent. Just ignoring the SHA sums however could prove risky, as there is no way to assure that the correct version of the tool is used then. Mind that it is not uncommon for a pipeline to use tools in different versions within the same release. Therefore, downloading both is possibly the only way forward without an API request to Seqera Community Containers.
  5. get_singularity_images() should sort the oras:// links to containers_pull. Ensure that the output path is correct, the pulling works. Lastly, fix the test.
  6. Ensure symlink_singularity_images() handles the images correctly and trims the hard-coded registries prior to building the symlinks. Update the test.

Thx

Command used and terminal output

No response

System information

No response

@MatthiasZepper MatthiasZepper added the bug Something isn't working label Sep 24, 2024
@MatthiasZepper MatthiasZepper added this to the 3.0 milestone Sep 24, 2024
@ewels
Copy link
Member

ewels commented Sep 24, 2024

Great write up, thanks @MatthiasZepper!

Two minor notes:

  • it's impossible to omit tags with Seqera containers. There is no default latest so it'll just fail without one.
  • it should be possible to have https downloads for singularity SIF files instead of oras:// if we want, meaning we can preserve the current ability to download without needing singularity to be installed. I need to check the status on this.

@ewels ewels changed the title Download: Pathetic support for Seqera Community Containers Download: Fix support for Seqera Community Containers Sep 24, 2024
@ewels ewels pinned this issue Sep 24, 2024
@ewels
Copy link
Member

ewels commented Sep 24, 2024

Also note that there are two things happening here: the first comments you linked to are for a PR that hasn't been merged. We have not yet adopted Seqera Containers so it's not unexpected that nf-core/tools has not yet been updated.

However, there is organic uptake of Seqera Containers by the community, putting them into old-style container declarations, which breaks stuff as you say. This is enough to make this a high priority task, even without the upcoming planned wider adoption. So we should get on with this ASAP.

@MatthiasZepper
Copy link
Member Author

MatthiasZepper commented Sep 24, 2024

it's impossible to omit tags with Seqera containers. There is no default latest so it'll just fail without one

That is not my problem, I actually think that the SHAs are really useful. The problem is, that they now differ! Previously, a container declaration looked like this:

container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
        'https://depot.galaxyproject.org/singularity/umi_tools:1.1.5--py39hf95cd2a_0' :
        'biocontainers/umi_tools:1.1.5--py39hf95cd2a_0' }"

Because the umi_tools:1.1.5--py39hf95cd2a_0 part was identical, it was easy to write a function prioritize_direct_download() that would kick out all Docker image definitions, if a matching direct https:// download was found. That was all that was needed:

def prioritize_direct_download(self, container_list):
        d = {}
        for c in container_list:
            if re.match(r"^$|(?!^http)", d.get(k := re.sub(".*/(.*)", "\\1", c), "")):
                log.debug(f"{c} matches and will be saved as {k}")
                d[k] = c
        return sorted(list(d.values()))

With the new Seqera Community Containers, the native Singularity image and the Docker image have different SHAs, because they are technically two different images:

 container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
  'oras://community.wave.seqera.io/library/samtools_pip_umi-tools:d8a86bb132628b96':
   'community.wave.seqera.io/library/samtools_pip_umi-tools:9cfcc6ebc1d12072' }"

But that makes it a lot harder to determine, which of the records should be given priority. Particularly when downloading multiple revisions of a pipeline at the same time, there are commonly multiple versions of the same tool in the raw list of container images that somehow need to be consolidated. This problem is anything but trivial and maybe downloading both is the only way forward. (Alternatively, making an API request to Seqera Containers to obtain the alternative URI would be better of course).

It should be possible to have https downloads for singularity SIF files instead of oras:// if we want, meaning we can preserve the current ability to download without needing singularity to be installed. I need to check the status on this.

I have not yet considered this aspect, actually. It is true, that it is nice to not need Singularity when downloading, but de facto we can assume that is it available in most cases. For MacOS, I wrote a virtual machine, so that works as well. I have no/few objections to pull the images instead of using https:// downloads, that is relatively simple to implement, if the priority and image SHA resolution is somehow solved.

Also note that there are two things happening here: the first comments you linked to are for a PR that hasn't been merged. We have not yet adopted Seqera Containers so it's not unexpected that nf-core/tools has not yet been updated. However, there is organic uptake of Seqera Containers by the community, putting them into old-style container declarations, which breaks stuff as you say.

Yep, but it is the very same people discussing the official proposal and approving the PRs for the "organic" updates to modules. Unfortunately, the offline capability of a pipeline is essentially an untestable feature, because I can't take a Github runner offline, even though I tried to catch some download issues during a test on the pipeline level. Once a pipeline is released, the code is unfortunately out in the wild and therefore, I wish that offline usability was not an afterthought of module or pipeline updates.

@ewels ewels removed this from the 3.0 milestone Sep 26, 2024
@ewels
Copy link
Member

ewels commented Sep 28, 2024

Writing up the blog post / migration guide for Seqera Containers adoption now. It should massively simplify the task for nf-core download. However, we will still need to handle old pipelines / container edge cases. I'm hoping that an updated nextflow inspect that can return all processes will be able to handle this though. Ideally this will mean that most of the complex logic that we have within nf-core download won't be needed.

Do you see any issues with this @MatthiasZepper ?

@MatthiasZepper
Copy link
Member Author

Writing up the blog post / migration guide for Seqera Containers adoption now. It should massively simplify the task for nf-core download.

I haven't read your blog post yet, so hard to tell if I see issues with it - where do I find the draft? But I already commented on your POC drafts that I really like the idea of having a central YAML/JSON that is being generated upon pipeline release and that contains all container URIs. That will indeed help tons, since one does not have to sift through all modules and configs to find container declarations.

However, we will still need to handle old pipelines / container edge cases.

I am starting to lean towards telling people to use an older version of tools for that purpose and just dropping the support entirely. If it needs to be, one could probably retain the old code and switch to it if the requested pipeline revision does not have said YAML yet.

I'm hoping that an updated nextflow inspect that can return all processes will be able to handle this though. Ideally, this will mean that most of the complex logic that we have within nf-core download won't be needed.

Yes, that would be much appreciated. I did not know that inspect is being developed into that direction, but yes: It's adoption into download was mostly hampered by the issue of specifying upfront how a particular pipeline will be run.

If it was able to return the containers URIs also for a different platform architecture than the one it currently runs on, that would be the cherry on the cake (e.g. I execute it on a Mac and ARM64 architecture, but I would like to get the container URIs for an amd64 Linux pipeline execution). Since it considers the configs, I think that should be possible already...I should try that out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working high-priority
Projects
No open projects
Status: Todo
Development

No branches or pull requests

2 participants