Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: add support for linux-arm64 build #1214

Closed
wants to merge 48 commits into from
Closed

ci: add support for linux-arm64 build #1214

wants to merge 48 commits into from

Conversation

strophy
Copy link

@strophy strophy commented Mar 16, 2022

This PR (my first here!) contains GitHub Actions CI to build linux-arm64 artifacts in CI, partially addressing some of the issues raised in #1159.

While writing this I was unable to use (deprecated) docker-compose or the newer docker compose or docker bake system to build a compose file with depends_on. I think this may be due to different behaviour between Docker classic builds and buildx, but buildx is required for multi-arch images. As a result, I had to explicitly build and push each Dockerfile manually and in sequence so running the action will now also push updated prereq and protoc_plugin images to Docker Hub. Together with running in an emulator, this is quite slow (although caching improves this on subsequent runs), so I'm open to suggestions for improvements. It might also make sense to separate out building docker images and building a release, depending on how often you release.

Several underlying versions and dependencies had to be bumped in order to build successfully on arm64, and I switched from bazel to bazelisk, which was also necessary to build.

Dependency Updates:

  • All node base images now use node:16.15.0-bullseye
  • prepare stage now uses buildpack-deps:bullseye for build toolchain consistency, although I think this bump is not strictly necessary
  • Buildifier now at 5.1.0 because 1.0.0 does not support arm64
  • Bazel 4.1.0 replaced with Bazelisk 1.11.0 (installs latest Bazel, currently 5.2.0)
  • google-closure-compiler JS dependency bumped to ~20220301.0.0 to avoid build error
  • google-protobuf JS dependency bumped to 3.19.4 to match existing dependency used in prereqs image

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Mar 16, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

@strophy
Copy link
Author

strophy commented May 3, 2022

I believe this PR can now be improved with the techniques described here: https://www.docker.com/blog/dockerfiles-now-support-multiple-build-contexts/

Before I do this, is this project still maintained? Is this feature even wanted or likely to be merged? Just checking before I do any more work, since I haven't seen any interaction here.

@sampajano
Copy link
Collaborator

sampajano commented May 6, 2022

@strophy Thanks so much for the contrib! Apologies on the delay -- i saw this PR a while ago but i was unfamiliar with many of the concepts / tools used here so i didn't get around to it until now. Sorry again!

Pardon my ignorance, but i wonder how popular is linux-arm64? What's its most common use cases?

In general, we're aiming to provide pre-built binaries for the most popular platforms, and so i'm not sure we're willing to pay for the complexity of maintaining for less popular ones, especially if the complexity is high.

There are a few things in this PR that i'm not too sure about.. and i wonder if they can be addressed somehow:

  1. It's complicating the workflow by not using docker compose -- that might be able to be addressed if you completely create a new workflow rather than reusing the existing one (and make it an experimental one).

  2. Using some seemingly less official packages (e.g. tonistiigi/binfmt) and requires temporary hacks like Bazel does not run in emulated docker environment bazelbuild/bazel#11379 -- again showing the support is maybe too cutting-edge.

  3. Updating many dep versions (e.g. buildpack-deps, node, buildifier) which i'm not sure are all necessary and desired... (they maybe are, but again showing that by supporting linux-arm64 we're moving a lot closer to the cutting edge.. :))

Also, curious how is this PR related to #1159 as you've mentioned in the PR description? Do you think this will be a progressive step to getting us M1 support? (and if so how big is the gap there?)

In any case, thanks so much for the PR again!


Btw, just ran CI and it's failing early. Could you take a look at the failure (link)? Thanks!

@@ -49,10 +51,11 @@ RUN ./scripts/init_submodules.sh
######################################
# Stage 2: Copy source files and build
######################################
FROM node:12.22.6-stretch AS copy-and-build
FROM node:16-bullseye AS copy-and-build
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change prevents the following error under arm64 when running the closure compiler as follows:

#0 159.1 /github/grpc-web/packages/grpc-web/node_modules/.bin/google-closure-compiler --js=exports.js --js=../../javascript --js=node_modules/google-closure-library --entry_point=grpc.web.Exports --externs=externs.js --dependency_mode=PRUNE --compilation_level=ADVANCED_OPTIMIZATIONS --generate_exports --export_local_property_definitions --js_output_file=index.js
#0 179.0 npm ERR! code ENOENT
#0 179.0 npm ERR! syscall access
#0 179.0 npm ERR! path /github/grpc-web/packages/grpc-web/node_modules/grpc-web.js
#0 179.1 npm ERR! errno -2
#0 179.1 npm ERR! enoent ENOENT: no such file or directory, access '/github/grpc-web/packages/grpc-web/node_modules/grpc-web.js'
#0 179.1 npm ERR! enoent This is related to npm not being able to find a file.
#0 179.1 npm ERR! enoent 

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha thanks for the context!

Do you mind replacing all 12.22.6 references in this repo to the same node version tho (there are 3 files afaict)?

Ideally to a more specific dot release to ensure our test here will always be running the same node version as the node server.

./bazel-installer.sh && \
rm ./bazel-installer.sh
RUN wget -nv -O /usr/local/bin/bazelisk \
https://github.com/bazelbuild/bazelisk/releases/download/v$BAZELISK_VERSION/bazelisk-linux-$TARGETARCH && \
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use of bazelisk is required, since bazel-$BAZEL_VERSION-installer-linux-$arch does not exist for arm64

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation!

May i ask you a favor to list out all the dependency updates you've made in this PR in a bullet-point summary? (Feel free to mention all other notable changes you made as well!)

@@ -19,22 +19,24 @@
# Stage 1: Fetch binaries
######################################
# node:... Docker image is based on buildpack-deps:stretch
FROM buildpack-deps:stretch AS prepare
FROM buildpack-deps:bullseye AS prepare
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was made to match base image versions with the requirement to use node16:bullseye below


ARG BUILDIFIER_VERSION=1.0.0
ARG BUILDIFIER_VERSION=5.1.0
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is required because Buildifier 1.0.0 does not exist for arm64

@strophy
Copy link
Author

strophy commented May 15, 2022

Hello, I have refactored the PR to use docker bake and a .hcl file instead of docker compose and a docker-compose.yml file, since I believe that the logic used to build in the past incorrectly assumes that Docker Compose will respect depends_on when building (it does not). The build logic now actually waits for the prereqs stage to complete, instead of pulling a version based on a previous build from Docker Hub. We also now build in a matrix, so x86_64 and arm64 builds take place on separate runners, which greatly simplifies the Github Actions workflow.

I have documented the various build errors that occur if I do not bump the dependency versions. It might be possible to resolve some of these in another way, but I am tempted to just use newer versions that take non-x86 targets into account rather than debug dependencies.

I still occasionally face the following issues:

  • prereqs build randomly fails in CI with following error when running bazelisk build. This seems to point to an OOM problem, but I am unable to reliably reproduce it and memory use never exceeds 2GB when running on a Github Actions runner. I would be curious if you see this error as well.
#29 177.7 Server terminated abruptly (error code: 14, error message: 'Socket closed', log file: '/root/.cache/bazel/_bazel_root/94ede7e4be7ee112c06d5210902f7b7d/server/jvm.out')
#29 177.7 
#29 ERROR: process "/bin/sh -c bazelisk build --jobs=2 javascript/net/grpc/web/generator:protoc-gen-grpc-web &&   cp $(bazelisk info bazel-genfiles)/javascript/net/grpc/web/generator/protoc-gen-grpc-web   /usr/local/bin/protoc-gen-grpc-web" did not complete successfully: exit code: 37
  • Cache does not always hit, we should maybe consider including a cache mount
  • Output binaries still need testing

@strophy strophy requested a review from sampajano May 15, 2022 09:27
@sampajano
Copy link
Collaborator

Hi thanks for the reply!

Thanks for taking a look at this! I agree with all of the points you raised, this PR currently has too many hacks to merge at the moment. The point you raise in your response (1) above is the key of the issue though - in my testing, the current docker-compose or docker compose based build system also does not work, so I'm curious about how you used this to successfully build in the past. I am experiencing the following problem:

  1. Clone the project and run docker compose build prereqs protoc-plugin
  2. We would expect that Docker Compose respects the depends_on statements in docker-compose.yml, but this is not the case. depends_on is only respected by the up and stop commands, it is ignored by the build command. Documentation. Discussion.

Oh aha i see!! Thanks for providing the context!

Actually, historically i've always been using docker-compose (instead of docker compose) for both build and up, and as i tried it just now, it seems that docker compose has a significant different behavior in that it no longer respect the depends_on statement as you've pointed out.

This is actually something an interesting point that you've brought up because i've noticed the mismatch between build-time v.s. runtime dependency before, but i didn't know if there's an alternative way of specifying build-time dependency other than using depends_on.. (guess it won't work now either 🙃)

  1. The result is that instead of waiting for grpcweb/prereqs to be built, it instead immediately pulls the already existing old version from Docker Hub and uses this to start the grpcweb/protoc-plugin build in parallel with grpcweb/prereqs. This causes my build to fail, because the Docker Hub version of the image is incompatible with the changes my build requires.
  2. Instead, if I "manually" target the steps in two commands like docker compose build prereqs && docker compose build protoc-plugin then the build is successful, because grpcweb/prereqs will exist in the locally available images when the second command runs.

This is why I separated the build CI into steps and stopped using Docker Compose. Can you clarify if this is expected behavior for you? Have you been able to get docker compose build to respect the depends_on statements somehow? If not, what do you think about using the following approach as a way forward: https://github.com/docker/buildx/blob/master/docs/reference/buildx_bake.md#using-a-result-of-one-target-as-a-base-image-in-another-target

I really appreciate you finding an alternative solution using the buildx here! Although, i'm a bit hesitant in doing this across the board (other than just in this workflow), because we have a bunch of documentations in various sites that we need to update too (example) if we decide to do that, and i'm not yet convinced that docker buildx bake + HCL is broadly adopted enough for that purpose.. (maybe it is? :))

For that reason, may I ask you to "fork" the workflow into a separate make-plugin-linux-arm64.yml config? So that it remains a bit "experimental" until we eventually decide to converge? Thanks!

Copy link
Collaborator

@sampajano sampajano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks so much for the updates! Just a few comments! 😃

Mainly, as mentioned in a previous comment, would you mind forking out the workflow changes into a separate file for now? And will consider merging as we adopt buildx + hcl more broadly. Thanks!!


UPDATE: Ahh there's a CI error too :)

Step 7/35 : RUN case $TARGETARCH in arm64) arch="aarch_64"; ;; amd64) arch="x86_64"; ;; *) echo "ERROR: Machine type $TARGETARCH not supported."; ;; esac && curl -sSL https://github.com/protocolbuffers/protobuf/releases/download/v$PROTOBUF_VERSION/protoc-$PROTOBUF_VERSION-linux-$arch.zip -o protoc.zip &&   unzip -qq protoc.zip &&   cp ./bin/protoc /usr/local/bin/protoc
 ---> Running in 4f88b2ef6fea
ERROR: Machine type  not supported.
[protoc.zip]
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of protoc.zip or
        protoc.zip.zip, and cannot find protoc.zip.ZIP, period.
Service 'prereqs' failed to build: The command '/bin/sh -c case $TARGETARCH in arm64) arch="aarch_64"; ;; amd64) arch="x86_64"; ;; *) echo "ERROR: Machine type $TARGETARCH not supported."; ;; esac && curl -sSL https://github.com/protocolbuffers/protobuf/releases/download/v$PROTOBUF_VERSION/protoc-$PROTOBUF_VERSION-linux-$arch.zip -o protoc.zip &&   unzip -qq protoc.zip &&   cp ./bin/protoc /usr/local/bin/protoc' returned a non-zero code: 9

@@ -49,10 +51,11 @@ RUN ./scripts/init_submodules.sh
######################################
# Stage 2: Copy source files and build
######################################
FROM node:12.22.6-stretch AS copy-and-build
FROM node:16-bullseye AS copy-and-build
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha thanks for the context!

Do you mind replacing all 12.22.6 references in this repo to the same node version tho (there are 3 files afaict)?

Ideally to a more specific dot release to ensure our test here will always be running the same node version as the node server.

./bazel-installer.sh && \
rm ./bazel-installer.sh
RUN wget -nv -O /usr/local/bin/bazelisk \
https://github.com/bazelbuild/bazelisk/releases/download/v$BAZELISK_VERSION/bazelisk-linux-$TARGETARCH && \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation!

May i ask you a favor to list out all the dependency updates you've made in this PR in a bullet-point summary? (Feel free to mention all other notable changes you made as well!)

release.hcl Outdated Show resolved Hide resolved
@strophy
Copy link
Author

strophy commented May 18, 2022

Thanks for reviewing! I've made the requested changes, but any testing is incredibly frustrating due to random failures running Bazelisk. I can't reproduce it reliably, but it only occurs when running the build of a whole file, stopping the build just before the problematic step and then running the command directly inside the partially built image always works perfectly, so there is no easy way for me to read the contents of the error log. I think it is best if we pause here and wait for the release of Bazel 5.2.0 which should contain some fixes relating to running in an emulator, allowing us to remove some of the hacks in this PR.

I think it's also best if we stop working on caching solutions and do that in a separate PR, since we can now build arm64 in a separate experimental workflow, speed is no longer a priority.

List of updated dependencies:

  • All node base images now use node:16.15.0-bullseye
  • prepare stage now uses buildpack-deps:bullseye for build toolchain consistency, although I think this bump is not strictly necessary
  • Buildifier now at 5.1.0 because 1.0.0 does not support arm64
  • Bazel 4.1.0 replaced with Bazelisk 1.11.0 (installs Bazel 5.1.1, currently broken, wait for 5.2.0)
  • google-closure-compiler JS dependency bumped to ~20220301.0.0 to avoid build error
  • google-protobuf JS dependency bumped to 3.19.4 to match existing dependency used in prereqs image

I'm unable to spend much more time working on this, and I don't know how your CI system (kokoro?) works. From the log output I can see it is using the Docker classic builder to attempt to process a Dockerfile that requires buildx to build, so some changes will be necessary here.

Also interesting to note how building from a compose file will likely be handled in the future here, tl;dr it will parse the compose yml file, generate .hcl files and delegate everything to buildx bake anyway.

@sampajano
Copy link
Collaborator

Thanks for reviewing! I've made the requested changes, but any testing is incredibly frustrating due to random failures running Bazelisk. I can't reproduce it reliably, but it only occurs when running the build of a whole file, stopping the build just before the problematic step and then running the command directly inside the partially built image always works perfectly, so there is no easy way for me to read the contents of the error log. I think it is best if we pause here and wait for the release of Bazel 5.2.0 which should contain some fixes relating to running in an emulator, allowing us to remove some of the hacks in this PR.

Oh ok i see!! Very sorry to hear about the frustrations! I'm perfectly fine to pause here and wait for the 5.2.0 release! 😃

I think it's also best if we stop working on caching solutions and do that in a separate PR, since we can now build arm64 in a separate experimental workflow, speed is no longer a priority.

Sounds great to me too! Speed is indeed not an issue here at all (since we only do this every x months really..) 😃

List of updated dependencies:

  • All node base images now use node:16.15.0-bullseye
  • prepare stage now uses buildpack-deps:bullseye for build toolchain consistency, although I think this bump is not strictly necessary
  • Buildifier now at 5.1.0 because 1.0.0 does not support arm64
  • Bazel 4.1.0 replaced with Bazelisk 1.11.0 (installs Bazel 5.1.1, currently broken, wait for 5.2.0)
  • google-closure-compiler JS dependency bumped to ~20220301.0.0 to avoid build error
  • google-protobuf JS dependency bumped to 3.19.4 to match existing dependency used in prereqs image

Thanks so much for the list of deps! I'll edit the PR summary to add these 😃

I'm unable to spend much more time working on this, and I don't know how your CI system (kokoro?) works. From the log output I can see it is using the Docker classic builder to attempt to process a Dockerfile that requires buildx to build, so some changes will be necessary here.

Do you think the Bazel 5.2.0 will help resolve the issue here?

I think our kokoro CI essentially just runs the basic and interop scripts. Do they work locally for you?

Also interesting to note how building from a compose file will likely be handled in the future here, tl;dr it will parse the compose yml file, generate .hcl files and delegate everything to buildx bake anyway.

Oh wow! That's awesome! Would probably get a great improvement when it happens!

@strophy
Copy link
Author

strophy commented Jun 8, 2022

Hi @sampajano Bazel 5.2.0 was released with a lot of ARM fixes, but I am still seeing the same error (Server terminated abruptly (error code: 14, error message: 'Socket closed', log file: '/root/.cache/bazel/_bazel_root/94ede7e4be7ee112c06d5210902f7b7d/server/jvm.out')) when running in an emulator. I see you guys already completed arm64 CI using Zig which comes with an impressive speedup, so I think it's probably best to give up on Bazel for building arm64 for now! You can merge this as an alternative approach, or feel free to close this PR 😃

@sampajano
Copy link
Collaborator

Hi @sampajano Bazel 5.2.0 was released with a lot of ARM fixes, but I am still seeing the same error (Server terminated abruptly (error code: 14, error message: 'Socket closed', log file: '/root/.cache/bazel/_bazel_root/94ede7e4be7ee112c06d5210902f7b7d/server/jvm.out')) when running in an emulator. I see you guys already completed arm64 CI using Zig which comes with an impressive speedup, so I think it's probably best to give up on Bazel for building arm64 for now! You can merge this as an alternative approach, or feel free to close this PR 😃

@strophy Thanks so much for checking again! Indeed the Zig solution is already checked in and seems to be a more self-contained change overall (and seems easy to apply to all existing platforms too). 😃

Thank you so much for the contributions here and iterations again! 😃 I'll close this PR for now but i think whatever discussion/change we have here would help provide a great context should we ever need to reevaluate our build method! 😃

@sampajano sampajano closed this Jun 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants