Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for major changes to (and expansion of) containerised build system #1982

Closed
wants to merge 7 commits into from

Conversation

rvagg
Copy link
Member

@rvagg rvagg commented Oct 22, 2019

This PR is part of a complex set of changes that I'm proposing. I have enough of this in place that it runs and I can demonstrate it, but not such that it's baked in. None of this touches existing Jenkins config and It would be straightforward to undo the changes I've made.

Grab a coffee, I'd like your thoughts @nodejs/build, this is much more than just node-test-commit-linux-containered.

Basic proposal

I'm proposing that we introduce something that's more like Travis than what we have now for our containerised builds: control over what is run and how is external to Jenkins and mostly embedded in the nodejs/node repo itself. This means that the build configurations are specific to the release line and don't need VersionSelectorScript or any other mechanism to get involved. What's more, it's very simple to extend to encompasses more than what we do in node-test-commit-linux-containered, we can move anything Linux x64 into a container that we want, potentially eliminating a lot of our VMs where we are prepared to move from VM testing to container testing.

In addition to possible simplification of our infrastructure ("simplification" may be in the eye of the beholder, I'd like some objective opinions on this), we put configs in the hands of the collaborators. Issues filed in this repository to add some new test configuration (e.g. --ninja & --debug --enable-asan) become unnecessary in most cases as it would mean editing a YAML file to add the new build and test commands in a new entry. Adding new distro versions (e.g. Alpine, Fedora, Ubuntu non-LTS) become much easier and could mostly happen without Build WG involvement if enterprising collaborators notice the need and step up.

Technical summary

  1. nodejs/node is equipped with a new top-level file, .ci.yml. You can see it here: build,test: add .ci.yml for containered tests (WIP proposal) node#30057. This file contains a list of tests that need to be run for that branch, each test has an "image" (mapping to a Docker container), "label" (used for reporting status back to the PR) and an "execute" block (with the required Bash to execute the test, could be as little as make run-ci -j $JOBS).
  2. https://github.com/rvagg/node-ci-containers/ (name & location unimportant for now) contains a set of containers that the "image" entries map to. They're published on Docker Hub (rebuilds triggered on push, but new images need to be manually wired up in Docker Hub).
  3. Jenkins has a new job, https://ci.nodejs.org/job/rv-test-commit-linux-docker/ (name unimportant for now), that would replace node-test-commit-linux-containered and some of node-test-commit-linux and node-test-commit-custom-suites-freestyle too. The innards of this job are intentionally simple and can mostly be controlled outside of Jenkins.
    a. It uses a YAML Axis plugin to parse .ci.yml and use CONTAINER_TEST linux_x64_container_suite as a matrix axis, so the number of jobs run is controlled by .ci.yml.
    b. It uses curl -sL https://raw.githubusercontent.com/rvagg/node-ci-containers/master/execute.sh | bash - to execute the tests, which is also fairly simple and mainly just pulling out pieces of .ci.yml for each test run.
    c. It also uses the "label" property of each test to report back to the GitHub Status API, which makes it quite descriptive, see build,test: add .ci.yml for containered tests (WIP proposal) node#30057 for example - it might be a bit messy when you add back in the normal node-test-commit-* though.
  4. Our 4 existing Docker hosts, which run a Jenkins node inside each of ~8 containers each, would be reset with the configuration in this PR. The playbooks/jenkins/docker-host.yaml and some friends would be removed and the hosts become a little more like the existing Jenkins nodes, running a single node but allowing for parallelism of ~ncpus/2.
    a. docker-node-exec.sh in this PR does the "run Docker" work and is the only thing the iojs user is allowed to sudo. It's patterned off a docker-node-exec.sh that we use on the Raspberry Pis and on the Scaleway ARMv7 machines (but in this case it docker runs, not docker execs).
    b. Their workspaces become fatter, and we may need to control the disk usage more carefully than I currently am (perhaps a git clean -fdx at the end of each build).
    c. Together, the 4 hosts (maybe we add more when we free up space by removing some other VMs) and their ~ncpu/2 workers create a pool that new builds are distributed across. The size of this pool will be dictated by the branch on nodejs/node with the most complicated .ci.yml.

You can see more documentation in .ci.yml, execute.sh, docker-node-exec.sh, and the node-ci-containers repo. I've been pretty liberal with docs so this is as clear as I can make it.

Complexity reduction?

I'm not eliminating as much as I'd like in this, but I believe I'm lessening the maintenance burden on Build by quite a bit. Some thoughts:

  • node-test-commit-linux-containered is only maintained by me, as are the Docker hosts, and its complexity is hidden inside the Jenkins job. This change pushes the complexity out of Jenkins and into .ci.yml in the hands of collaborators and other places that are open source and documented and available for tweaking. The docker hosts become a bit more "normal" and hopefully not as intimidating for other Build WG members.
  • VersionSelectorScript is simplified a little because it does branch:builder, but .ci.yml is on-branch, select-compiler.sh doesn't need to get involved because it's all in Dockerfiles. However, I am introducing a third vector for selecting what gets run for what version of Node. So I'm trading different complexities here. We could extend .ci.yml too, potentially making it the source of truth for versioning?
  • Dockerfiles are easier than Ansible and more approachable by the average dev. I think .ci.yml is not difficult to grok either, it's much simpler than Travis and GitHub Actions but in a similar style. It's also documented, as are the other parts of this system. The invitation to edit is right there.
  • node-test-commit-custom-suites-freestyle can stop getting involved in node-test-commit, you can see in my .ci.yml I have a --worker test runner in there.
  • Alpine can move in to .ci.yml.
  • We've resisted "container all the things" in favour of trying to test closer to "native". The biggest problem with container testing everything is that you have no kernel variability, so you miss some permutation coverage. But, I think we might be at the point where we can slide that dial a bit to reduce infra complexity.
    • The most obvious targets for reduction in infra are Fedora, non-LTS Ubuntu, and even Debian. I've also added CentOS in there but am hesitant to suggest whole-hog with CentOS since we build releases on it. node-test-commit-linux could be reduced significantly in favour of containers. We retain a mix but get to control how far we go to one side.
  • This could be expanded beyond x86.

set -e

OPTIND=1
image_base="rvagg/node-ci-containers"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably create a user for the Node.js project and use that instead of rvagg right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your comment elsewhere about the path not being important so seems like you have already acknowledged we should do something on that front.

@mhdawson
Copy link
Member

I'll need more time to review/think but overall it looks interesting and an improvement over our current containerized builds. We have talked (Node.js team in IBM) about how we might leverage docker to achieve more diversity in the Community CI without having to have more machines for Linux on Z and Linux on P and this looks like a good possibility.

@rvagg
Copy link
Member Author

rvagg commented Oct 23, 2019

To clarify, all references to rvagg in this would need to change if/when this is landed and adopted. Mainly that means finding a home for node-ci-containers and a publish point on Docker Hub. It's been too long since I attempted to do org-level stuff on Docker Hub so don't know complicated it's going to be to pull it out of nodejs/ but I assume that's possible. I still have access to an 'iojs' org in Docker Hub from pre-merge! Nothing going on in there atm and not sure if it's usable for repos in this org.

Also, when you're considering this, the approach is really portable outside of Node. We could easily do this same thing for libuv, node-gyp, and others. The Jenkins job is more complicated now than it needs to be because of test.tap and GitHub status reporting. Strip that out and it's fairly basic.

@rvagg
Copy link
Member Author

rvagg commented Oct 23, 2019

Here's what we could do with the current node-test-commit-linux nodes, retaining only 4 nodes that are most commonly in use, relatively easy to maintain, and/or are used for release builds.

Moved to new containers:

  • alpine-last-latest-x64
  • alpine-latest-x64
  • debian8-64 (Node <=12)
  • debian9-64
  • fedora-last-latest-x64
  • fedora-latest-x64
  • ubuntu1404-64

Retired EOY with Node 8:

  • centos6-64-gcc48
  • centos7-64-gcc48
  • debian8-x86
  • ubuntu1404-32
  • ubuntu1604-32

Retained:

  • centos6-64-gcc6
  • centos7-64-gcc6
  • ubuntu1604-64
  • ubuntu1804-64

CentOS 6 should be retired with Node 12, but we need to add CentOS 8 in the coming months.

@rvagg
Copy link
Member Author

rvagg commented Oct 24, 2019

I'm ironing out bugs and it's starting to work pretty well now. I also have configs for 12.x and 10.x and just need to add openssl1.0.2 for 8.x (seems pointless for 2 months worth of life left in 8.x but it's not too hard to add).

I have one outstanding problem I haven't resolved. With some randomness, the containers will stop compiling. Roughly 1 in every 20 runs or so. I haven't seen it in the Alpine containers yet but it's been distributed across the rest so far. In the middle of a compile it'll just stop, like this one. No error, no warnings, just stop and the build gets killed due to a timeout set in Jenkins. execing into the container and running a ps they look like this:

  PID TTY      STAT   TIME COMMAND
    1 ?        Ss     0:00 /bin/bash -xec cd /home/iojs/workspace && . ./node-ci-exec.sh
    8 ?        S      0:00 make run-ci -j 2 V=
   66 ?        S      0:00 make
  112 ?        S      0:01 make -C out BUILDTYPE=Release V=
20411 ?        Z      0:00 [sh] <defunct>
20412 ?        Z      0:00 [sh] <defunct>

With the 2 <defunct> processes, which I assume were orphaned from make somehow. I can't tell what they were doing and why make didn't get the message that they're dead, it just stops; I assume it thinks that the last process it spawned is still running.

I've had these containers run with --init too, so it's not a zombie pruning problem (the containers with docker-init still have the <defunct> processes in them).

I can't really recommend pushing this out if we're going to have random timeouts that look to collaborators like the compile was too slow. So I'm left scratching my head wondering what to try next to sort this out.

@rvagg rvagg mentioned this pull request Oct 25, 2019
@rvagg
Copy link
Member Author

rvagg commented Oct 30, 2019

Struggling to deal with hanging containers. I've seen it on all the varieties now. Related to the spawning of child processes from make. They hang because make seems to think it's waiting for child processes to finish but they seem to be already finish and are just sitting there <defunct>. I've tried all sorts of things with docker-init (--init), I've tried switching out the root process from bash to sh because I know bash does some more sophisticated things with process reaping.

  1. I don't know what's causing the zombification of these child processes, there really shouldn't be a problem with the compiles and I can't see any errors, it looks like they just detach and the exit signal isn't sent upward.
  2. I can't get proper zombie reaping working in these containers no matter how I configure them, I always get <defunct> processes.

This is the closest thing I've found so far describing the problem: https://forums.docker.com/t/docker-container-started-with-init-still-leaves-zombie-processes/54729 but with no feedback the author ends with:

As workaround I created images that automatically start an SSH daemon and modified the ‘docker exec’ to start commands via “ssh 127.0.0.1 ${COMMAND}”.

I'd rather not have to go that far.

The only other alternative I haven't tried is a custom init, the one Docker comes with, Tini, just might not be doing its job quite right. But then I have to get that into all of the various containers and that's a pain. Yelp has one that seems relatively popular.

@rvagg
Copy link
Member Author

rvagg commented Nov 4, 2019

Update on this is that I'm close to giving up. I can't get any level of parallelism of compiles without make locking up. It seems to be getting into a bad state where it's not doing a proper wait() on its forked children (always non-compile commands like mkdir and sh). There are some open issues on the make issue tracker that regard deadlock states that have gone unanswered for some time, maybe that's what we're hitting here.
It happens with enough regularity that this isn't going to be acceptable to collaborators because they'll need to rerun. I've tried different init systems, different parallelism, lots of combinations of Docker arguments and execution models (including docker run ... tail -f /dev/null followed by docker exec ... and docker stop). The only thing that stops is it to not invoke make with parallelism (i.e. -j1 or no -j), which leads to slow compiles when the cache is clean.
I haven't tried a different host OS, I'm wondering if it's to do with the kernel or maybe Docker version + kernel, I might give that a go when I have time but I'm all out of ideas.

@sam-github
Copy link
Contributor

Have you considered using ninja, or is that too much of a detour?

@rvagg
Copy link
Member Author

rvagg commented Nov 5, 2019

ninja works fine and is run as part of the build set https://github.com/rvagg/io.js/blob/rvagg/testing-containers/.ci.yml#L250-L259 but is only one part of it.

The point of this is to test build configurations on various platforms and make is the standard build tool that we need to be testing. I'd even be tempted to ship in a custom version of make if that fixed the issue but then even that's defeating part of the purpose here.

@john-yan
Copy link

Hello @rvagg, I might be able to spend some time on this issue but I am not very familiar with the Node.js CI (from IBM V8 team). Would you be able to share some info on how this can be reproduced in a local environment? eg. Dockerfile? scripts?

@rvagg
Copy link
Member Author

rvagg commented Nov 28, 2019

Just tried this on Ubuntu 19.10 but same thing, had to be killed after timeout https://ci.nodejs.org/view/All/job/rv-test-commit-linux-docker/label=docker-host-x64,linux_x64_container_suite=fedora30/196/console

@john-yan replicating outside of jenkins is a little awkward as all the scripting assumes being called from jenkins. There may even be something specific about this being called from a Java process. I'll try and find some time to document a separate process to replicate the process locally though because it would be interesting to try and isolate it.

@rvagg
Copy link
Member Author

rvagg commented Feb 11, 2020

FYI I haven't totally dropped this (yet). My next plan was to try different host distros, Debian, maybe Arch, maybe CentOS. I fear that the problems a more fundamental than host, however, I'm just still surprised that this isn't something that others experience ("surely I'm missing something obvious!").

@rvagg
Copy link
Member Author

rvagg commented Jun 29, 2020

I've started again with this, trying Ubuntu 20.04 and the latest Docker with 🤞.

So far I haven't experienced any timeouts, but I've only done a few runs. Everything but the linter fails on test-wasi on current master, in all the different types of containers - debian, ubuntu, centos, fedora, alpine. So I guess it's probably to do with the docker host and its kernel, or some dependency that leaks in to the container. The fact this isn't showing up on our normal containered hosts makes me suspect as much since we run 18.04 hosts there.

The closest thing I could find is: nodejs/node#31230

Here's what the output looks like:

Assertion failed: 0 == link(OLD, NEW) (c/link.c: main: 13)
/home/iojs/workspace/test/common/index.js:605
const crashOnUnhandledRejection = (err) => { throw err; };
                                             ^

RuntimeError: unreachable
    at <anonymous>:wasm-function[14]:0x4f4
    at <anonymous>:wasm-function[31]:0x2e45
    at <anonymous>:wasm-function[12]:0x3f4
    at _start (<anonymous>:wasm-function[11]:0x2b9)
    at WASI.start (wasi.js:122:15)
    at /home/iojs/workspace/test/wasi/test-wasi.js:32:10

assert.js:103
  throw new AssertionError(obj);
  ^

AssertionError [ERR_ASSERTION]: Expected values to be strictly equal:

1 !== 0

    at runWASI (/home/iojs/workspace/test/wasi/test-wasi.js:60:12)
    at Object.<anonymous> (/home/iojs/workspace/test/wasi/test-wasi.js:81:3)
    at Module._compile (internal/modules/cjs/loader.js:1217:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:1238:10)
    at Module.load (internal/modules/cjs/loader.js:1066:32)
    at Function.Module._load (internal/modules/cjs/loader.js:954:14)
    at Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:71:12)
    at internal/main/run_main_module.js:17:47 {
  generatedMessage: true,
  code: 'ERR_ASSERTION',
  actual: 1,
  expected: 0,
  operator: 'strictEqual'
}

@cjihrig any chance I could get some help on this? We're not really doing anything special than running the test suite in containers on a 20.04 host and the current Docker. The commit being run here is @ nodejs/node#30057 which doesn't touch anything other than adding in a file in the top-level.

@cjihrig
Copy link
Contributor

cjihrig commented Jun 29, 2020

The closest thing I could find is: nodejs/node#31230

I don't think this is related.

Just looking at the output, it's not clear why link() would be failing. It would be helpful to know what errno is after the link() call. That would require updating the test's C code and regenerating the WASM file (or adding debugging statements to Node itself).

I can take a look later tonight, but I'll need some way to reproduce it. Is there a CI job I can run that I can also point at my own fork?

@rvagg
Copy link
Member Author

rvagg commented Jun 30, 2020

Thanks @cjihrig, that would be appreciated. You can run in https://ci.nodejs.org/view/All/job/rv-test-commit-linux-docker/ but you'll need to have the commits at the HEAD of nodejs/node#30057 in your branch too (you can squash those down if you like, there's a lot of fixup cruft in there, it's just a .ci.yml file and an entry in .gitignore).

@rvagg
Copy link
Member Author

rvagg commented Jun 30, 2020

I've tried to reproduce it on one of the 20.04 hosts, and in a docker container on one of them, and then in a detached container using exec, but can't get the same error, even mounting the workspace. I've also tried reverting the docker run mode from the more complicated detach+exec to the simpler direct run but still the same error through Jenkins. Current guess is that this is related to being attached through the Jenkins node Java process tree, we've seen errors that are only reproducible via that in the past.

@rvagg
Copy link
Member Author

rvagg commented Jul 3, 2020

I thought it was working so well on the new 20.04 hosts, but alas, this just now on an alpine container:

13:35:14 Build timed out (after 60 minutes). Marking the build as failed.

in the middle of a compile. Something in make just can't handle the depth job queueing while inside these containers and I have absolutely no idea why that would be or how to solve that. So I think I'm going to bail on this whole experiment.

@rvagg rvagg closed this Jul 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants