Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update images and add Ubuntu 20.04 #157

Merged
merged 4 commits into from
Mar 12, 2020

Conversation

jngrad
Copy link
Member

@jngrad jngrad commented Mar 12, 2020

Description of changes:

jngrad added 3 commits March 9, 2020 18:49
The issue in v0.16 and v0.17 seems to have been fixed in v0.17.1.
The 3.1 release has numerous bugs that can only be fixed upstream.
Features gcc 9, Clang 9, boost 1.71 and CMake 3.16.3.
@jngrad
Copy link
Member Author

jngrad commented Mar 12, 2020

A few notes on Clang 9:

  • many -Wdeprecated-declarations warnings are emitted for Cython code, because tp_print in /usr/include/python3.8/cpython/object.h is marked as deprecated with Py_DEPRECATED(3.8); they can be suppressed with -DCMAKE_CXX_FLAGS=-Wno-deprecated-declarations; they don't become errors with -DWARNING_ARE_ERRORS=ON
  • clang-tidy finds new issues (e.g. modernize-avoid-c-arrays, bugprone-narrowing-conversions, bugprone-macro-parentheses) in many places, we probably want to defer their treatment to limit interference on the walberla PR
  • debug builds can become extremely slow with UBSAN or ASAN (or both) activated, with around 10 tests taking between 300 and 900 seconds to run (usually integrator, thermostat and reaction ensemble tests); this slow-down can vary a lot between two runs of check_python. I'm not sure what causes this and strace -c ./pypresso testsuite/python/integrator_npt.py shows only 30 ms wasted in syscalls, and the number of calls for each type of syscall is the same. There is no backport of clang-9 in Ubuntu 16.04 so I could not compare them.

@jngrad
Copy link
Member Author

jngrad commented Mar 12, 2020

https://gitlab.icp.uni-stuttgart.de/espressomd/docker/-/jobs/214138
@mkuron the ROCm image is really testing my patience right now. apt-get install rocfft now puts librocfft.so in a different folder /opt/rocm-3.1.0/lib, while the other libs are installed in /opt/rocm/lib. The rocm/dev-ubuntu-18.04:3.0 base image did not change.

@fweik fweik requested a review from KaiSzuttor March 12, 2020 12:39
@KaiSzuttor KaiSzuttor removed their request for review March 12, 2020 12:55
@KaiSzuttor
Copy link
Member

@mkuron could you please have a look

@jngrad
Copy link
Member Author

jngrad commented Mar 12, 2020

@mkuron I've changed the line in /etc/apt/sources.list.d/rocm.list to get packages from 3.0 instead of latest, let's see if it changes anything locally

@KaiSzuttor
Copy link
Member

@jngrad please don't invest too much time on this unstable feature

The ROCm apt URL was pointing to the latest ROCm release, causing
a mix of ROCm 3.0 and ROCm 3.1 libraries to be installed and
breaking the /opt/rocm symlink to /opt/rocm-3.1.0.
@jngrad
Copy link
Member Author

jngrad commented Mar 12, 2020

This ROCm issue should have been investigated 10 days ago when the master CI was red. We were in the exact same situation in February with the kaniko and Intel images, and it bled into the coding day. We'll keep wasting precious time until we start taking CI seriously and fix failing master before it collides with the regular flow of docker PRs.

@KaiSzuttor
Copy link
Member

@fweik and me discussed about this. The CI of espresso should be totally independent on updating any images on the docker repository. We should pin the version of the images used for CI and only change the version via pull requests.

@jngrad
Copy link
Member Author

jngrad commented Mar 12, 2020

@RudolfWeeber and I discussed that as well last month, and we wanted to have CI images follow the release workflow of espresso, e.g.

  • 4.1-ubuntu:18.04
  • 4.2-ubuntu:20.04

This way we can fix the distro instead of having rocm:latest or centos:next that change at random.

@KaiSzuttor
Copy link
Member

but this is not related to releases. We should have an internal versioning all the time. This way you disentangle the docker container dev and the espresso CI.

@jngrad
Copy link
Member Author

jngrad commented Mar 12, 2020

so you mean ubuntu:18.04:1, ubuntu:18.04:2, ..., ubuntu:18.04:N?

@KaiSzuttor
Copy link
Member

yes

@jngrad
Copy link
Member Author

jngrad commented Mar 12, 2020

but then, how do we regenerate a corrupted/deleted image? We would have to pin every single apt and python package to guarantee we don't get anything newer, which is tricky if you want to install a new python package that depends on a newer version of a pinned package.

@fweik
Copy link
Contributor

fweik commented Mar 12, 2020

I don't think manual versioning is the right approach. It's possible to pull a specific hash,
which is what I think we should use. You say in the CI config ubuntu@DIGEST, which
refers to a specific state of the image. To change it, you make a PR with the new DIGEST, so
you can check that it works before updating. This way changes to the docker repo do not
automagically break the CI in the main repo, and you can update the CI image and the CI config/tests in one go.

@mkuron
Copy link
Member

mkuron commented Mar 12, 2020

This way changes to the docker repo do not automagically break the CI in the main repo

They never do. That's why we build and run Espresso tests in the docker repo too.

This ROCm issue should have been investigated 10 days ago when the master CI was red. We were in the exact same situation in February with the kaniko and Intel images, and it bled into the coding day.

This wouldn't have been avoided by any kind of version pinning. If CI in the docker repo is red and you urgently need to change a Dockerfile, you first need to make the master green again either way.

The only thing we could do is pin the Docker base images to specific SHA1 hashes. That's a terrible idea though as you are not guaranteed that untagged hashes will forever remain available on the Docker Hub. And distribution base images don't have a version number tag beyond the distribution number.

@jngrad
Copy link
Member Author

jngrad commented Mar 12, 2020

This wouldn't have been avoided by any kind of version pinning.

That's true, but after offline discussion, what @fweik meant was pinning the digest of the built docker image into the espressomd/espresso .gitlab-ci.yml file, such that nothing happening in espressomd/docker can have an impact over there without a PR. This would avoid CI deadlocks.

They never do.

This happened in the past:

  • when ROCFFT 2.9 broke in e2ddbcb we had to add AMD GPU tests in 518ffa4
  • when Sphinx 2.4.0 came out, we had to add Sphinx tests in 45e17cd
  • the Python packages for http requests caused CI deadlocks on espressomd/espresso twice

With @fweik's proposal we can disable the test and deploy stages in docker. The build stage already deploys the image to a test subfolder of the cache registry. This means we can open a PR on docker with a modified Dockerfile, open a PR on espresso to update the image digest in .gitlab-ci.yml, and if that espresso PR passes CI, we can merge both PRs. We also add an additional success stage in docker to flag good images to help during garbage collection in the cache registry. This also means that a failing master cannot prevent us from deploying new images, because the build stage in a PR already deploys even if one image in the batch fails to build.

This doesn't address all issues, though. For example, we have images that are used by multiple espresso releases, e.g. cuda-10.1 contains LaTeX packages because we need to be able to still test all commits in the 4.1 branch. The 4.1/4.2 prefix I proposed above would solve the issue by preventing name collision, but it's not an elegant solution. We could either deploy the images to subfolders in the cache registry, one for each minor release, or we could store the dockerfiles directly in espressomd/espresso. The latter solution can be made possible by having different CI pipelines triggered in GitLab-CI depending on which file changed, and the docker pipeline would not have a success job, such that you can only merge a PR if you add a second commit to update the image digest in .gitlab-ci.yml and pass the regular test pipeline.

Also, we stop using base images with tag latest that change at every new OS release.

@jngrad jngrad merged commit dc7f735 into espressomd:master Mar 12, 2020
@mkuron
Copy link
Member

mkuron commented Mar 12, 2020

pinning the digest of the built docker image into the espressomd/espresso .gitlab-ci.yml file

In the past, we have had to occasionally delete all images from the registry and re-run the build job because the registry got corrupted. This hasn't happened in at least a year, but it makes pinning the digest potentially fragile.

@jngrad
Copy link
Member Author

jngrad commented Mar 12, 2020

This hasn't happened in at least a year,

This happened last month: the cached layers got mixed with corrupted layers and I had to delete all of them. Some cached layers probably got silently corrupted and went live in the deploy stage; we can't know because these deployed images are not recycled.

it makes pinning the digest potentially fragile

True, we can't retry an old CI pipeline if the image is damaged or lost. But this is also true for our current infrastructure: our dockerfiles need to be valid for the duration of at least two espresso minor releases. We take a lot of care not to introduce breaking changes in a dockerfile, but we don't actually measure it. We would have to periodically retry old pipelines to make sure that our dockerfiles are fully backward-compatible, which is computationally prohibitive. How confident are we that the dockerfiles in master generate images that still work for the first merge commit of 4.1.0? Furthermore, a few images in master "disappeared" recently. They should re-appear in the planned 4.1 docker branch when 4.2.0 is released, but this bookkeeping is a waste of our developer time.

By having dockerfiles directly in espressomd/espresso and pinned images in .gitlab-ci.yml, we guarantee that docker images and espresso are always in sync. If we need to delete all images, we only need to retry the docker CI pipeline in espressomd/espresso on the latest merge commit in python and in every PR that changed a dockerfile. We can't retry old pipelines, but we don't really have a need for that, and we can't guarantee that this is possible (e.g. Sphinx 2.4.0 in cuda:10.1 prevents us from retrying all pipelines older than Feb 11 2020, espressomd/espresso@e0a0e7c7aae1).

By having dockerfiles directly in espressomd/espresso, we can also have an extra CMake command that generates a docker image suitable to build espresso. There will be no guarantee that this image will be from a dockerfile that is in sync with the current commit, but we could guarantee that by having .gitlab-ci.yml store both the image digest and the corresponding dockerfile commit.

@mkuron
Copy link
Member

mkuron commented Mar 12, 2020

By having dockerfiles directly in espressomd/espresso

We moved the Dockerfiles to a separate repository three years ago because @KaiSzuttor got annoyed with non-Espresso pull requests and files polluting the Espresso repository.

@KaiSzuttor
Copy link
Member

We moved the Dockerfiles to a separate repository three years ago because @KaiSzuttor got annoyed with non-Espresso pull requests and files polluting the Espresso repository.

so? This is a very socially inspired argument against it. Do you have any technical criteria pro or con?

@jngrad jngrad deleted the update-images-add-ubuntu20 branch April 18, 2022 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for ROCm 3.1.0
4 participants