Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark machine maintenance #2656

Closed
aduh95 opened this issue May 24, 2021 · 36 comments
Closed

Benchmark machine maintenance #2656

aduh95 opened this issue May 24, 2021 · 36 comments
Assignees

Comments

@aduh95
Copy link
Contributor

aduh95 commented May 24, 2021

Is there something wrong with the benchmark machine? For some reason it keeps stalling/hanging or maybe the connection to Jenkins is silently being severed? I've tried restarting the job a couple times now. Normally the async_hooks benchmarks should only take about 50 minutes or less on a decent modern machine....

Originally posted by @mscdex in nodejs/node#38785 (comment)

The above comment is referencing https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1027/, https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1028/, https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1029/, ... It seems that any benchmark using http.createServer gets disconnected.
Also it seems the machine is using Ubuntu 16.04, maybe it's due to an update. Let me know if there's something I can do to help with that.

@Linkgoron
Copy link
Member

Linkgoron commented May 24, 2021

This also reproduces on my local machine (macOS Catalina)

@Linkgoron
Copy link
Member

Linkgoron commented May 24, 2021

I assume that it's a similar issue to nodejs/node#36871. There's a timeout issue that's getting emitted on the request in _test-double-benchmarker. A simple workaround that worked locally for me was to add an error handler on the http request (which doesn't exist, only on the response), which was also added to http2 a few months ago. I thought that there might be an issue with the server.close, but at least for me awaiting it didn't solve the problem (I also think that the issue was emitted before the server closed).

IMO there's probably a real underlying issue in HTTP, and this is just putting a band-aid on the problem, but it would at least allow benchmarking again.

@rvagg
Copy link
Member

rvagg commented May 25, 2021

I've done a big update and cleanup on both benchmark machines, including clearing out workspaces and temp files (although I have a bad feeling I might have been too liberal with my removals because these machines have some very specific workflows that may be putting things into unexpected places 🤞). They've been rebooted so let's see if they behave any differently now.

We could upgrade to 18.04, but that might take input from maintainers of the benchmarking work - is a jump in OS likely to have any meaningful impact on benchmark numbers? Does it matter?

@aduh95
Copy link
Contributor Author

aduh95 commented May 25, 2021

We could upgrade to 18.04, but that might take input from maintainers of the benchmarking work - is a jump in OS likely to have any meaningful impact on benchmark numbers? Does it matter?

I might be wrong, it seems the (only?) benchmark CI that is run on nodejs/node PRs is benchmark-node-micro-benchmarks which gives the relative perf difference that one PR introduce. In this case, bumping the OS should not be a problem. Could we go straight to v20.04 maybe?
While we're on the topic of maintenance of the benchmark CI, do you know if the script that spawns the benchmark CI run is on this repo? I couldn't find it, I'd like to change it to do a shallow git clone instead of a deep one..

They've been rebooted so let's see if they behave any differently now.

I've spawn https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1030/, let's see how it performs.

@rvagg
Copy link
Member

rvagg commented May 25, 2021

Options are in https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/configure if you have access to see what's going on.

It does a git clone https://github.com/nodejs/benchmarking.git - is that the one you want to be shallow? Beyond that it runs . benchmarking/experimental/benchmarks/community-benchmark/run.sh from that clone so any further git operations are executed from that script.

@aduh95
Copy link
Contributor Author

aduh95 commented May 25, 2021

Thanks for the info, the script I was looking for is this one: https://github.com/nodejs/benchmarking/blob/master/experimental/benchmarks/community-benchmark/run.sh. The repo is read-only, I've asked in nodejs/TSC#822 (comment) where we can move this script.

I've spawn https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1030/, let's see how it performs.

The job seems to be stuck again, it hasn't output anything in the last hour...

@rvagg
Copy link
Member

rvagg commented May 25, 2021

root@test-nearform--intel-ubuntu1604-x64-2:~# ps auxww | grep ^iojs
iojs       1900  0.8  0.3 9239832 210320 ?      Ssl  04:14   4:08 /usr/bin/java -Xmx128m -jar /home/iojs/slave.jar -jnlpUrl https://ci.nodejs.org/computer/test-nearform_intel-ubuntu1604-x64-2/slave-agent.jnlp -secret 1c49efc3534392967acc0521aa0d82594c7ce9e54bdff5ff6df669e8337ce79d
iojs      11016  0.0  0.0  11356  3188 ?        S    09:17   0:00 bash -xe /tmp/jenkins7987259599734232559.sh
iojs      95339  0.0  0.0 596216 40300 ?        Sl   09:29   0:01 ./node-master benchmark/compare.js --old ./node-master --new ./node-pr -- async_hooks
iojs      95340  0.0  0.0   6012   664 ?        S    09:29   0:00 tee output250521-092939.csv
iojs     133265  0.0  0.0 331720 34016 ?        Sl   09:57   0:00 ./node-master /w/bnch-comp/node/benchmark/async_hooks/async-resource-vs-destroy.js
iojs     133288  0.1  0.0 670556 55736 ?        Sl   09:57   0:09 /w/bnch-comp/node/node-master /w/bnch-comp/node/benchmark/async_hooks/async-resource-vs-destroy.js n=1000000 duration=5 connections=500 path=/ asyncMethod=callbacks type=async-resource

@sxa
Copy link
Member

sxa commented May 25, 2021

I had a look on the machine and it looked like your job had indeed "stalled" by some definition - the load on the machine was effectively zero. While it was going I attempted to initiate another run from another user account, and it looks like that has caused an port conflict in your job which has caused it to end. This suggests that your job was, in fact, still progressing in some fashion, even though it wasn't visibily using much CPU or producing additional output so it's possible it would have eventually run to completion.

12:15:26 node:events:371
12:15:26       throw er; // Unhandled 'error' event
12:15:26       ^
12:15:26 
12:15:26 Error: listen EADDRINUSE: address already in use :::12346
12:15:26     at Server.setupListenHandle [as _listen2] (node:net:1306:16)
12:15:26     at listenInCluster (node:net:1354:12)
12:15:26     at Server.listen (node:net:1441:7)
12:15:26     at main (/w/bnch-comp/node/benchmark/async_hooks/async-resource-vs-destroy.js:175:6)
12:15:26     at /w/bnch-comp/node/benchmark/common.js:42:9
12:15:26     at processTicksAndRejections (node:internal/process/task_queues:78:11)
12:15:26 Emitted 'error' event on Server instance at:
12:15:26     at emitErrorNT (node:net:1333:8)
12:15:26     at processTicksAndRejections (node:internal/process/task_queues:83:21) {
12:15:26   code: 'EADDRINUSE',
12:15:26   errno: -98,
12:15:26   syscall: 'listen',
12:15:26   address: '::',
12:15:26   port: 12346
12:15:26 }
12:15:26 ++ cat output250521-092939.csv
12:15:26 ++ Rscript benchmark/compare.R

I've re-initiated your job as https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1031/console which is now running on the -1 performance machine (since I marked -2 offline for now) and we'll see how that one progresses and whether it succumbs to the same sort of stalling. I'll continue with some experiments on the -2 machine for now ...

I would definitely be in favour of upgrading the machine to 20.04 in principle, although the two benchmarking machines are of a specific type so we'd need to be certain that they could be upgraded cleanly...

@targos
Copy link
Member

targos commented May 25, 2021

@aduh95 Have you tried to run the benchmarks locally? Maybe something is broken and they cannot end.

@aduh95
Copy link
Contributor Author

aduh95 commented May 25, 2021

For some reason the benchmark involving http cannot even start on my machine, (probably some issue with my config), but I think @mcollina is able to run them on a personal server.

@targos
Copy link
Member

targos commented May 25, 2021

@Linkgoron
Copy link
Member

Linkgoron commented May 25, 2021

Just to clarify what I've stated earlier - at least locally, for me, I see similar issues with the http-server benchmark in the async_hooks benchmarks. There are two issues here that I see - one, that the benchmark doesn't wait for the server close before starting the next benchmark, and the other that sometimes it looks like we get timeouts from the request which throws (emits error on the request) in the child process which causes issues.

@aduh95
Copy link
Contributor Author

aduh95 commented May 25, 2021

It's the async_hooks benchmarks that hang in https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1031/console

Only the async_hooks benchmarks that involve http (benchmark/async_hooks/async-resource-vs-destroy.js and benchmark/async_hooks/http-server.js). I've noticed the same behaviour with benchmark/http/cluster.js and benchmark/http/simple.js.

@aduh95
Copy link
Contributor Author

aduh95 commented Jun 13, 2021

Is there something I can do to help this move forward? I think upgrading to 20.04 would be a good first step, even if it doesn't solve the stalling issue.

@targos
Copy link
Member

targos commented Jun 22, 2021

Also happy to help, if I can. #2681 is probably blocked on this.

@rvagg
Copy link
Member

rvagg commented Jun 23, 2021

OK, so upgrading to 20.04 is relatively straightforward to do, but we need to make allowance for mess-ups. I'm happy to do the work but need to be told when's a good day to be doing it. My midday is usually downtime for everyone else (1pm for me now, 3am UTC), but is there a good day of the week for this to happen? Or is any day good enough for this machine? I don't really know what its usage pattern is these days or how critical it might be.

@mhdawson
Copy link
Member

@rvagg I think there are 2 benchmark machines at NearForm. https://ci.nodejs.org/label/benchmark-ubuntu1604-intel-64/. I'm thinking we could do one at a time to limit impact. Most of the jobs seems to run on the same -2 machine, likely because 1 machine can handle the typical load in terms of concurrent requests.

@aduh95
Copy link
Contributor Author

aduh95 commented Jul 21, 2021

@rvagg would it be possible for you to do the upgrade after the next v14.x release (scheduled on the 2021-07-27)? In case something wrong happen, it should give us more time to fix/rollback before the next release.

@rvagg
Copy link
Member

rvagg commented Jul 22, 2021

sure, no problems, you might have to remind me though @aduh95, I'd already forgotten about this

@dany-on-demand
Copy link

Is this issue why https://benchmarking.nodejs.org/ hasn't seen an update since Dec 2020?

@richardlau
Copy link
Member

@dany-on-demand No, it isn't. https://benchmarking.nodejs.org/ hasn't been updated since the Benchmarking Working Group was wound down due to lack of participation (nodejs/TSC#822).

@mhdawson
Copy link
Member

I still have on my list to remove benchmarking.nodejs.org, will try to take another look this week.

@aduh95
Copy link
Contributor Author

aduh95 commented Aug 11, 2021

@rvagg now that the security release are out, would it be a good time for you to take care of the update? (No planned release until 2021-08-17 on nodejs/release repo)

@aduh95
Copy link
Contributor Author

aduh95 commented May 22, 2022

Can we try to get back to that? I think we can upgrade straight to Ubuntu 22.04 now.

@github-actions
Copy link

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

@github-actions github-actions bot added the stale label Mar 19, 2023
@targos targos self-assigned this Apr 7, 2023
@targos
Copy link
Member

targos commented Apr 7, 2023

I just took https://ci.nodejs.org/manage/computer/test-nearform_intel-ubuntu1804-x64-2/ offline and will try to upgrade it to Ubuntu 22.04.

@targos
Copy link
Member

targos commented Apr 7, 2023

Here's what I did:

ssh test-nearform_intel-ubuntu1804-x64-2
apt update
apt upgrade
apt dist-upgrade
do-release-upgrade
# Follow the prompts to install Ubuntu 20.04, restart

ssh test-nearform_intel-ubuntu1804-x64-2
apt update
apt upgrade
apt autoremove

# later
ssh test-nearform_intel-ubuntu1804-x64-2
do-release-upgrade
# Follow the prompts to install Ubuntu 22.04, restart

# During update:
# WARNING: PV /dev/sda5 in VG nodebench02-vg is using an old PV header, modify the VG to update.
ssh test-nearform_intel-ubuntu1804-x64-2
pvs
vgck --updatemetadata nodebench02-vg
reboot

ssh test-nearform_intel-ubuntu1804-x64-2
resolvectl dns eno1 1.1.1.1 8.8.8.8
apt update
apt upgrade
apt autoremove
exit

# Updated the node in the inventory and in Jenkins
ansible-playbook --limit test-nearform_intel-ubuntu2204-x64-2 ansible/playbooks/jenkins/worker/create.yml
ssh test-nearform_intel-ubuntu2204-x64-2
systemctl daemon-reload
systemctl restart jenkins

System is up and running, but there seems to be an issue with DNS resolution. I'm not sure what to do:

# curl https://github.com
curl: (6) Could not resolve host: github.com

Edit: Fixed it with resolvectl dns eno1 1.1.1.1 8.8.8.8

@targos
Copy link
Member

targos commented Apr 7, 2023

Trying node-test-commit-v8-linux

@richardlau
Copy link
Member

Trying node-test-commit-v8-linux

Suspect that will fail because of #3206.

@targos
Copy link
Member

targos commented Apr 7, 2023

Something aborted the build 🤔

@targos
Copy link
Member

targos commented Apr 7, 2023

@richardlau
Copy link
Member

Another one is running though: https://ci.nodejs.org/job/node-test-commit-v8-linux/nodes=benchmark-ubuntu2204-intel-64,v8test=v8test/5282/console

That's from me adding a request-ci label to nodejs/node#47239 (comment). That shouldn't normally cause existing builds to be aborted though.

@targos
Copy link
Member

targos commented Apr 7, 2023

BTW, while monitoring the Ubuntu updates, the download speed seemed quite slow, and that is also the case with git clones in these runs.

@github-actions github-actions bot removed the stale label Apr 8, 2023
targos added a commit to targos/nodejs-build that referenced this issue Apr 8, 2023
targos added a commit to targos/nodejs-build that referenced this issue Apr 8, 2023
@targos
Copy link
Member

targos commented Apr 10, 2023

Doing the other machine now.

@targos
Copy link
Member

targos commented Apr 10, 2023

Finished

@targos targos closed this as completed Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants