Update images and add Ubuntu 20.04 #157

jngrad · 2020-03-12T11:06:33Z

Description of changes:

use the latest kaniko version
rollback to ROCm 3.0 (fixes Support for ROCm 3.1.0 #156)
add an Ubuntu 20.04 image (required for Interface Walberla LB espresso#2701 and WIP: Stokesian Dynamics espresso#3445)

The issue in v0.16 and v0.17 seems to have been fixed in v0.17.1.

The 3.1 release has numerous bugs that can only be fixed upstream.

Features gcc 9, Clang 9, boost 1.71 and CMake 3.16.3.

jngrad · 2020-03-12T11:25:03Z

A few notes on Clang 9:

many -Wdeprecated-declarations warnings are emitted for Cython code, because tp_print in /usr/include/python3.8/cpython/object.h is marked as deprecated with Py_DEPRECATED(3.8); they can be suppressed with -DCMAKE_CXX_FLAGS=-Wno-deprecated-declarations; they don't become errors with -DWARNING_ARE_ERRORS=ON
clang-tidy finds new issues (e.g. modernize-avoid-c-arrays, bugprone-narrowing-conversions, bugprone-macro-parentheses) in many places, we probably want to defer their treatment to limit interference on the walberla PR
debug builds can become extremely slow with UBSAN or ASAN (or both) activated, with around 10 tests taking between 300 and 900 seconds to run (usually integrator, thermostat and reaction ensemble tests); this slow-down can vary a lot between two runs of check_python. I'm not sure what causes this and strace -c ./pypresso testsuite/python/integrator_npt.py shows only 30 ms wasted in syscalls, and the number of calls for each type of syscall is the same. There is no backport of clang-9 in Ubuntu 16.04 so I could not compare them.

jngrad · 2020-03-12T12:23:24Z

https://gitlab.icp.uni-stuttgart.de/espressomd/docker/-/jobs/214138
@mkuron the ROCm image is really testing my patience right now. apt-get install rocfft now puts librocfft.so in a different folder /opt/rocm-3.1.0/lib, while the other libs are installed in /opt/rocm/lib. The rocm/dev-ubuntu-18.04:3.0 base image did not change.

KaiSzuttor · 2020-03-12T12:55:28Z

@mkuron could you please have a look

jngrad · 2020-03-12T12:58:25Z

@mkuron I've changed the line in /etc/apt/sources.list.d/rocm.list to get packages from 3.0 instead of latest, let's see if it changes anything locally

KaiSzuttor · 2020-03-12T12:59:53Z

@jngrad please don't invest too much time on this unstable feature

The ROCm apt URL was pointing to the latest ROCm release, causing a mix of ROCm 3.0 and ROCm 3.1 libraries to be installed and breaking the /opt/rocm symlink to /opt/rocm-3.1.0.

jngrad · 2020-03-12T13:20:06Z

This ROCm issue should have been investigated 10 days ago when the master CI was red. We were in the exact same situation in February with the kaniko and Intel images, and it bled into the coding day. We'll keep wasting precious time until we start taking CI seriously and fix failing master before it collides with the regular flow of docker PRs.

KaiSzuttor · 2020-03-12T13:26:51Z

@fweik and me discussed about this. The CI of espresso should be totally independent on updating any images on the docker repository. We should pin the version of the images used for CI and only change the version via pull requests.

jngrad · 2020-03-12T13:35:47Z

@RudolfWeeber and I discussed that as well last month, and we wanted to have CI images follow the release workflow of espresso, e.g.

4.1-ubuntu:18.04
4.2-ubuntu:20.04

This way we can fix the distro instead of having rocm:latest or centos:next that change at random.

KaiSzuttor · 2020-03-12T13:40:07Z

but this is not related to releases. We should have an internal versioning all the time. This way you disentangle the docker container dev and the espresso CI.

jngrad · 2020-03-12T13:46:19Z

so you mean ubuntu:18.04:1, ubuntu:18.04:2, ..., ubuntu:18.04:N?

KaiSzuttor · 2020-03-12T13:46:34Z

yes

jngrad · 2020-03-12T13:49:16Z

but then, how do we regenerate a corrupted/deleted image? We would have to pin every single apt and python package to guarantee we don't get anything newer, which is tricky if you want to install a new python package that depends on a newer version of a pinned package.

fweik · 2020-03-12T13:55:58Z

I don't think manual versioning is the right approach. It's possible to pull a specific hash,
which is what I think we should use. You say in the CI config ubuntu@DIGEST, which
refers to a specific state of the image. To change it, you make a PR with the new DIGEST, so
you can check that it works before updating. This way changes to the docker repo do not
automagically break the CI in the main repo, and you can update the CI image and the CI config/tests in one go.

mkuron · 2020-03-12T14:06:42Z

This way changes to the docker repo do not automagically break the CI in the main repo

They never do. That's why we build and run Espresso tests in the docker repo too.

This ROCm issue should have been investigated 10 days ago when the master CI was red. We were in the exact same situation in February with the kaniko and Intel images, and it bled into the coding day.

This wouldn't have been avoided by any kind of version pinning. If CI in the docker repo is red and you urgently need to change a Dockerfile, you first need to make the master green again either way.

The only thing we could do is pin the Docker base images to specific SHA1 hashes. That's a terrible idea though as you are not guaranteed that untagged hashes will forever remain available on the Docker Hub. And distribution base images don't have a version number tag beyond the distribution number.

jngrad · 2020-03-12T15:26:41Z

This wouldn't have been avoided by any kind of version pinning.

That's true, but after offline discussion, what @fweik meant was pinning the digest of the built docker image into the espressomd/espresso .gitlab-ci.yml file, such that nothing happening in espressomd/docker can have an impact over there without a PR. This would avoid CI deadlocks.

They never do.

This happened in the past:

when ROCFFT 2.9 broke in e2ddbcb we had to add AMD GPU tests in 518ffa4
when Sphinx 2.4.0 came out, we had to add Sphinx tests in 45e17cd
the Python packages for http requests caused CI deadlocks on espressomd/espresso twice

With @fweik's proposal we can disable the test and deploy stages in docker. The build stage already deploys the image to a test subfolder of the cache registry. This means we can open a PR on docker with a modified Dockerfile, open a PR on espresso to update the image digest in .gitlab-ci.yml, and if that espresso PR passes CI, we can merge both PRs. We also add an additional success stage in docker to flag good images to help during garbage collection in the cache registry. This also means that a failing master cannot prevent us from deploying new images, because the build stage in a PR already deploys even if one image in the batch fails to build.

This doesn't address all issues, though. For example, we have images that are used by multiple espresso releases, e.g. cuda-10.1 contains LaTeX packages because we need to be able to still test all commits in the 4.1 branch. The 4.1/4.2 prefix I proposed above would solve the issue by preventing name collision, but it's not an elegant solution. We could either deploy the images to subfolders in the cache registry, one for each minor release, or we could store the dockerfiles directly in espressomd/espresso. The latter solution can be made possible by having different CI pipelines triggered in GitLab-CI depending on which file changed, and the docker pipeline would not have a success job, such that you can only merge a PR if you add a second commit to update the image digest in .gitlab-ci.yml and pass the regular test pipeline.

Also, we stop using base images with tag latest that change at every new OS release.

mkuron · 2020-03-12T15:33:44Z

pinning the digest of the built docker image into the espressomd/espresso .gitlab-ci.yml file

In the past, we have had to occasionally delete all images from the registry and re-run the build job because the registry got corrupted. This hasn't happened in at least a year, but it makes pinning the digest potentially fragile.

jngrad · 2020-03-12T16:08:00Z

This hasn't happened in at least a year,

This happened last month: the cached layers got mixed with corrupted layers and I had to delete all of them. Some cached layers probably got silently corrupted and went live in the deploy stage; we can't know because these deployed images are not recycled.

it makes pinning the digest potentially fragile

True, we can't retry an old CI pipeline if the image is damaged or lost. But this is also true for our current infrastructure: our dockerfiles need to be valid for the duration of at least two espresso minor releases. We take a lot of care not to introduce breaking changes in a dockerfile, but we don't actually measure it. We would have to periodically retry old pipelines to make sure that our dockerfiles are fully backward-compatible, which is computationally prohibitive. How confident are we that the dockerfiles in master generate images that still work for the first merge commit of 4.1.0? Furthermore, a few images in master "disappeared" recently. They should re-appear in the planned 4.1 docker branch when 4.2.0 is released, but this bookkeeping is a waste of our developer time.

By having dockerfiles directly in espressomd/espresso and pinned images in .gitlab-ci.yml, we guarantee that docker images and espresso are always in sync. If we need to delete all images, we only need to retry the docker CI pipeline in espressomd/espresso on the latest merge commit in python and in every PR that changed a dockerfile. We can't retry old pipelines, but we don't really have a need for that, and we can't guarantee that this is possible (e.g. Sphinx 2.4.0 in cuda:10.1 prevents us from retrying all pipelines older than Feb 11 2020, espressomd/espresso@e0a0e7c7aae1).

By having dockerfiles directly in espressomd/espresso, we can also have an extra CMake command that generates a docker image suitable to build espresso. There will be no guarantee that this image will be from a dockerfile that is in sync with the current commit, but we could guarantee that by having .gitlab-ci.yml store both the image digest and the corresponding dockerfile commit.

mkuron · 2020-03-12T16:16:47Z

By having dockerfiles directly in espressomd/espresso

We moved the Dockerfiles to a separate repository three years ago because @KaiSzuttor got annoyed with non-Espresso pull requests and files polluting the Espresso repository.

KaiSzuttor · 2020-03-13T08:53:48Z

We moved the Dockerfiles to a separate repository three years ago because @KaiSzuttor got annoyed with non-Espresso pull requests and files polluting the Espresso repository.

so? This is a very socially inspired argument against it. Do you have any technical criteria pro or con?

jngrad added 3 commits March 9, 2020 18:49

Roll to kaniko v0.18

cc07858

The issue in v0.16 and v0.17 seems to have been fixed in v0.17.1.

Rollback to ROCm 3.0

233c952

The 3.1 release has numerous bugs that can only be fixed upstream.

Create Ubuntu 20.04 image

75e2258

Features gcc 9, Clang 9, boost 1.71 and CMake 3.16.3.

fweik requested a review from KaiSzuttor March 12, 2020 12:39

KaiSzuttor removed their request for review March 12, 2020 12:55

Fix bug in ROCm 3.0 image

352980c

The ROCm apt URL was pointing to the latest ROCm release, causing a mix of ROCm 3.0 and ROCm 3.1 libraries to be installed and breaking the /opt/rocm symlink to /opt/rocm-3.1.0.

mkuron approved these changes Mar 12, 2020

View reviewed changes

jngrad merged commit dc7f735 into espressomd:master Mar 12, 2020

jngrad mentioned this pull request Mar 12, 2020

HIP issue list as discussed in the offline meeting espressomd/espresso#2973

Closed

jngrad deleted the update-images-add-ubuntu20 branch April 18, 2022 16:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update images and add Ubuntu 20.04 #157

Update images and add Ubuntu 20.04 #157

jngrad commented Mar 12, 2020

jngrad commented Mar 12, 2020

jngrad commented Mar 12, 2020

KaiSzuttor commented Mar 12, 2020

jngrad commented Mar 12, 2020

KaiSzuttor commented Mar 12, 2020

jngrad commented Mar 12, 2020

KaiSzuttor commented Mar 12, 2020

jngrad commented Mar 12, 2020

KaiSzuttor commented Mar 12, 2020

jngrad commented Mar 12, 2020

KaiSzuttor commented Mar 12, 2020

jngrad commented Mar 12, 2020

fweik commented Mar 12, 2020

mkuron commented Mar 12, 2020

jngrad commented Mar 12, 2020

mkuron commented Mar 12, 2020

jngrad commented Mar 12, 2020

mkuron commented Mar 12, 2020

KaiSzuttor commented Mar 13, 2020

Update images and add Ubuntu 20.04 #157

Update images and add Ubuntu 20.04 #157

Conversation

jngrad commented Mar 12, 2020

jngrad commented Mar 12, 2020

jngrad commented Mar 12, 2020

KaiSzuttor commented Mar 12, 2020

jngrad commented Mar 12, 2020

KaiSzuttor commented Mar 12, 2020

jngrad commented Mar 12, 2020

KaiSzuttor commented Mar 12, 2020

jngrad commented Mar 12, 2020

KaiSzuttor commented Mar 12, 2020

jngrad commented Mar 12, 2020

KaiSzuttor commented Mar 12, 2020

jngrad commented Mar 12, 2020

fweik commented Mar 12, 2020

mkuron commented Mar 12, 2020

jngrad commented Mar 12, 2020

mkuron commented Mar 12, 2020

jngrad commented Mar 12, 2020

mkuron commented Mar 12, 2020

KaiSzuttor commented Mar 13, 2020