Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix failing workflows #502

Closed
Jose-Matsuda opened this issue Aug 15, 2023 · 5 comments · Fixed by #508
Closed

Fix failing workflows #502

Jose-Matsuda opened this issue Aug 15, 2023 · 5 comments · Fixed by #508
Assignees

Comments

@Jose-Matsuda
Copy link
Contributor

Some relevant information in --> #500 (comment)

I don't know what caused this, it doesnt seem to be the most recent changes Bryan did as I attempted reverting them but no difference

@Jose-Matsuda
Copy link
Contributor Author

Jose-Matsuda commented Aug 16, 2023

Bryan thought it could be something to do with memory, but the sizes of the images hasn't changed in a significant way since it started failing. I built jupyterlab-cpu locally from the main branch and it was just 7.74 GB

@bryanpaget
Copy link
Contributor

bryanpaget commented Aug 17, 2023

We are almost always able to build RStudio (which weighs in at 9.2GB) and sometimes we can build JupyterLab-CPU (which weighs in at 7.57GB). JupyterLab-Pytorch is ~18GB.

pagetbr@l-pagetbr-2:~/aaw-kubeflow-containers/output/rstudio$ docker images
REPOSITORY                     TAG                            IMAGE ID       CREATED         SIZE
rstudio                        latest                         1a7696064cd2   3 hours ago     9.2GB
jupyterlab-cpu                 latest                         1a0dfc4c6a1b   23 hours ago    7.57GB
jupyter/datascience-notebook   ed2908bbb62e                   3a31ab32404b   10 months ago   4.43GB

@bryanpaget
Copy link
Contributor

bryanpaget commented Aug 17, 2023

I have created a JIRA ticket to track this issue:

And I have posted an issue on the Github community message board:

While I wait for the Github community to respond to my post, I have the following courses of action.

  • Look into self-hosted Github Runners, perhaps look into hosting them on the AAW itself.
  • Look into the StatCan Gitlab instance and see if I can mirror the aaw-kubeflow-container repo into Gitlab and use the Gitlab CI/CD instead of the Github Runner.

The following tickets are blocked by this issue:

@bryanpaget
Copy link
Contributor

bryanpaget commented Aug 22, 2023

@vexingly discovered the version of Docker that Github Actions uses was updated right at the same time our builds started failing.

Runner

For the successful builds before August 04th 2023, we have the following image version:

Runner Image
  Image: ubuntu-22.04
  Version: 20230724.1.0
  Included Software: https://github.com/actions/runner-images/blob/ubuntu22/20230724.1/images/linux/Ubuntu2204-Readme.md

And when the builds started failing, we have a new version of the runner (still based on Ubuntu 22.04):

Image: ubuntu-22.04
  Version: 20230728.3.0
  Included Software: https://github.com/actions/runner-images/blob/ubuntu22/20230728.3/images/linux/Ubuntu2204-Readme.md

Docker

For the older, successful runs:

Checking docker version
  /usr/bin/docker version --format '{{.Server.APIVersion}}'
  '1.41'
  Docker daemon API version: '1.41'
  /usr/bin/docker version --format '{{.Client.APIVersion}}'
  '1.41'
  Docker client API version: '1.41'

The newer, failed runs:

Checking docker version
  /usr/bin/docker version --format '{{.Server.APIVersion}}'
  '1.42'
  Docker daemon API version: '1.42'
  /usr/bin/docker version --format '{{.Client.APIVersion}}'
  '1.42'
  Docker client API version: '1.42'

This led us to investigate the changelog (https://docs.docker.com/engine/api/version-history/) and the epiphany came from

GET /_ping and HEAD /_ping now return Builder-Version by default. This header contains the default builder to use, and is a recommendation as advertised by the daemon. However, it is up to the client to choose which builder to use.
The default value on Linux is version “2” (BuildKit), but the daemon can be configured to recommend version “1” (classic Builder). Windows does not yet support BuildKit for native Windows images, and uses “1” (classic builder) as a default.
This change is not versioned, and affects all API versions if the daemon has this patch.

So we added DOCKER_BUILDKIT=0 to our docker build command in the Makefile to disable BuildKit,

DOCKER_BUILDKIT=0 docker build $(DARGS) --rm --force-rm -t $$IMAGE_NAME ./output/$(notdir $@)

And now our builds are succeeding.

@bryanpaget bryanpaget linked a pull request Aug 22, 2023 that will close this issue
@bryanpaget
Copy link
Contributor

#496 will have to be done next sprint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants