-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for major changes to (and expansion of) containerised build system #1982
Conversation
set -e | ||
|
||
OPTIND=1 | ||
image_base="rvagg/node-ci-containers" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably create a user for the Node.js project and use that instead of rvagg right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your comment elsewhere about the path not being important so seems like you have already acknowledged we should do something on that front.
I'll need more time to review/think but overall it looks interesting and an improvement over our current containerized builds. We have talked (Node.js team in IBM) about how we might leverage docker to achieve more diversity in the Community CI without having to have more machines for Linux on Z and Linux on P and this looks like a good possibility. |
To clarify, all references to Also, when you're considering this, the approach is really portable outside of Node. We could easily do this same thing for libuv, node-gyp, and others. The Jenkins job is more complicated now than it needs to be because of test.tap and GitHub status reporting. Strip that out and it's fairly basic. |
Here's what we could do with the current node-test-commit-linux nodes, retaining only 4 nodes that are most commonly in use, relatively easy to maintain, and/or are used for release builds. Moved to new containers:
Retired EOY with Node 8:
Retained:
CentOS 6 should be retired with Node 12, but we need to add CentOS 8 in the coming months. |
I'm ironing out bugs and it's starting to work pretty well now. I also have configs for 12.x and 10.x and just need to add openssl1.0.2 for 8.x (seems pointless for 2 months worth of life left in 8.x but it's not too hard to add). I have one outstanding problem I haven't resolved. With some randomness, the containers will stop compiling. Roughly 1 in every 20 runs or so. I haven't seen it in the Alpine containers yet but it's been distributed across the rest so far. In the middle of a compile it'll just stop, like this one. No error, no warnings, just stop and the build gets killed due to a timeout set in Jenkins.
With the 2 I've had these containers run with I can't really recommend pushing this out if we're going to have random timeouts that look to collaborators like the compile was too slow. So I'm left scratching my head wondering what to try next to sort this out. |
Struggling to deal with hanging containers. I've seen it on all the varieties now. Related to the spawning of child processes from
This is the closest thing I've found so far describing the problem: https://forums.docker.com/t/docker-container-started-with-init-still-leaves-zombie-processes/54729 but with no feedback the author ends with:
I'd rather not have to go that far. The only other alternative I haven't tried is a custom |
Update on this is that I'm close to giving up. I can't get any level of parallelism of compiles without |
Have you considered using ninja, or is that too much of a detour? |
ninja works fine and is run as part of the build set https://github.com/rvagg/io.js/blob/rvagg/testing-containers/.ci.yml#L250-L259 but is only one part of it. The point of this is to test build configurations on various platforms and |
Hello @rvagg, I might be able to spend some time on this issue but I am not very familiar with the Node.js CI (from IBM V8 team). Would you be able to share some info on how this can be reproduced in a local environment? eg. Dockerfile? scripts? |
Just tried this on Ubuntu 19.10 but same thing, had to be killed after timeout https://ci.nodejs.org/view/All/job/rv-test-commit-linux-docker/label=docker-host-x64,linux_x64_container_suite=fedora30/196/console @john-yan replicating outside of jenkins is a little awkward as all the scripting assumes being called from jenkins. There may even be something specific about this being called from a Java process. I'll try and find some time to document a separate process to replicate the process locally though because it would be interesting to try and isolate it. |
FYI I haven't totally dropped this (yet). My next plan was to try different host distros, Debian, maybe Arch, maybe CentOS. I fear that the problems a more fundamental than host, however, I'm just still surprised that this isn't something that others experience ("surely I'm missing something obvious!"). |
I've started again with this, trying Ubuntu 20.04 and the latest Docker with 🤞. So far I haven't experienced any timeouts, but I've only done a few runs. Everything but the linter fails on test-wasi on current The closest thing I could find is: nodejs/node#31230 Here's what the output looks like:
@cjihrig any chance I could get some help on this? We're not really doing anything special than running the test suite in containers on a 20.04 host and the current Docker. The commit being run here is @ nodejs/node#30057 which doesn't touch anything other than adding in a file in the top-level. |
I don't think this is related. Just looking at the output, it's not clear why I can take a look later tonight, but I'll need some way to reproduce it. Is there a CI job I can run that I can also point at my own fork? |
Thanks @cjihrig, that would be appreciated. You can run in https://ci.nodejs.org/view/All/job/rv-test-commit-linux-docker/ but you'll need to have the commits at the HEAD of nodejs/node#30057 in your branch too (you can squash those down if you like, there's a lot of fixup cruft in there, it's just a .ci.yml file and an entry in .gitignore). |
I've tried to reproduce it on one of the 20.04 hosts, and in a docker container on one of them, and then in a detached container using |
I thought it was working so well on the new 20.04 hosts, but alas, this just now on an alpine container:
in the middle of a compile. Something in |
This PR is part of a complex set of changes that I'm proposing. I have enough of this in place that it runs and I can demonstrate it, but not such that it's baked in. None of this touches existing Jenkins config and It would be straightforward to undo the changes I've made.
Grab a coffee, I'd like your thoughts @nodejs/build, this is much more than just node-test-commit-linux-containered.
Basic proposal
I'm proposing that we introduce something that's more like Travis than what we have now for our containerised builds: control over what is run and how is external to Jenkins and mostly embedded in the nodejs/node repo itself. This means that the build configurations are specific to the release line and don't need VersionSelectorScript or any other mechanism to get involved. What's more, it's very simple to extend to encompasses more than what we do in node-test-commit-linux-containered, we can move anything Linux x64 into a container that we want, potentially eliminating a lot of our VMs where we are prepared to move from VM testing to container testing.
In addition to possible simplification of our infrastructure ("simplification" may be in the eye of the beholder, I'd like some objective opinions on this), we put configs in the hands of the collaborators. Issues filed in this repository to add some new test configuration (e.g.
--ninja
&--debug --enable-asan
) become unnecessary in most cases as it would mean editing a YAML file to add the new build and test commands in a new entry. Adding new distro versions (e.g. Alpine, Fedora, Ubuntu non-LTS) become much easier and could mostly happen without Build WG involvement if enterprising collaborators notice the need and step up.Technical summary
.ci.yml
. You can see it here: build,test: add .ci.yml for containered tests (WIP proposal) node#30057. This file contains a list of tests that need to be run for that branch, each test has an "image" (mapping to a Docker container), "label" (used for reporting status back to the PR) and an "execute" block (with the required Bash to execute the test, could be as little asmake run-ci -j $JOBS
).a. It uses a YAML Axis plugin to parse .ci.yml and use
CONTAINER_TEST
linux_x64_container_suite
as a matrix axis, so the number of jobs run is controlled by .ci.yml.b. It uses
curl -sL https://raw.githubusercontent.com/rvagg/node-ci-containers/master/execute.sh | bash -
to execute the tests, which is also fairly simple and mainly just pulling out pieces of .ci.yml for each test run.c. It also uses the "label" property of each test to report back to the GitHub Status API, which makes it quite descriptive, see build,test: add .ci.yml for containered tests (WIP proposal) node#30057 for example - it might be a bit messy when you add back in the normal
node-test-commit-*
though.a. docker-node-exec.sh in this PR does the "run Docker" work and is the only thing the iojs user is allowed to
sudo
. It's patterned off a docker-node-exec.sh that we use on the Raspberry Pis and on the Scaleway ARMv7 machines (but in this case itdocker run
s, notdocker exec
s).b. Their workspaces become fatter, and we may need to control the disk usage more carefully than I currently am (perhaps a
git clean -fdx
at the end of each build).c. Together, the 4 hosts (maybe we add more when we free up space by removing some other VMs) and their ~ncpu/2 workers create a pool that new builds are distributed across. The size of this pool will be dictated by the branch on nodejs/node with the most complicated .ci.yml.
You can see more documentation in .ci.yml, execute.sh, docker-node-exec.sh, and the node-ci-containers repo. I've been pretty liberal with docs so this is as clear as I can make it.
Complexity reduction?
I'm not eliminating as much as I'd like in this, but I believe I'm lessening the maintenance burden on Build by quite a bit. Some thoughts:
--worker
test runner in there.