Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coverage stats have plummeted #37368

Closed
Trott opened this issue Feb 14, 2021 · 16 comments
Closed

coverage stats have plummeted #37368

Trott opened this issue Feb 14, 2021 · 16 comments

Comments

@Trott
Copy link
Member

Trott commented Feb 14, 2021

Not sure what happened but both coverage.nodejs.org and codecov.io show our code coverage plummeting recently.

image

Not sure if this is a problem in our code coverage task in Makefile/vcbuild.bat or if it's a build/infra problem or what.

@bcoe @nodejs/build

@richardlau
Copy link
Member

I've rerun https://ci.nodejs.org/job/node-test-commit-linux-coverage-daily/ against 6ea9af9906cd74ed and this has returned coverage numbers in line with the expected:
https://ci.nodejs.org/job/node-test-commit-linux-coverage-daily/785/nodes=benchmark/console

18:16:58 ++ grep -B1 Lines coverage/index.html
18:16:58 ++ head -n1
18:16:58 ++ grep -o '[0-9\.]*'
18:16:58 + JSCOVERAGE=96.83
18:16:58 ++ grep -A3 Lines coverage/cxxcoverage.html
18:16:58 ++ grep style
18:16:58 ++ grep -o '[0-9]\{1,3\}\.[0-9]\{1,2\}'
18:16:58 + CXXCOVERAGE=89.0

this compares closely to the original coverage build for 6ea9af9906cd74ed in https://ci.nodejs.org/job/node-test-commit-linux-coverage-daily/779/nodes=benchmark/console of

04:11:33 ++ grep -B1 Lines coverage/index.html
04:11:33 ++ head -n1
04:11:33 ++ grep -o '[0-9\.]*'
04:11:33 + JSCOVERAGE=96.86
04:11:33 ++ grep -A3 Lines coverage/cxxcoverage.html
04:11:33 ++ grep style
04:11:33 ++ grep -o '[0-9]\{1,3\}\.[0-9]\{1,2\}'
04:11:33 + CXXCOVERAGE=89.0

(I don't know why the numbers aren't exactly the same)

Starting with 88d9268d087eea0d we appear to be getting an error during compilation:
https://ci.nodejs.org/job/node-test-commit-linux-coverage-daily/780/nodes=benchmark/consoleFull

04:10:47 tools/v8_gypfiles/v8_base_without_compiler.target.mk:780: recipe for target '/home/iojs/build/workspace/node-test-commit-linux-coverage-daily/nodes/benchmark/out/Release/obj.target/tools/v8_gypfiles/libv8_base_without_compiler.a' failed
04:10:47 make[3]: *** [/home/iojs/build/workspace/node-test-commit-linux-coverage-daily/nodes/benchmark/out/Release/obj.target/tools/v8_gypfiles/libv8_base_without_compiler.a] Error 1
04:10:47 make[3]: *** Waiting for unfinished jobs....
04:10:47 rm 81bcec8e3ea789c39023d21af4c2ed088c7cb407.intermediate 188683238e11018dbaa82380de9ee7bd274c680c.intermediate 074f577f56908e1a7d53a6ab0be7697f5cd089e7.intermediate
04:10:47 Makefile:104: recipe for target 'node' failed
04:10:47 make[2]: *** [node] Error 2
04:10:47 Makefile:319: recipe for target 'test-cov' failed
04:10:47 make[1]: *** [test-cov] Error 2
04:10:47 Makefile:235: recipe for target 'coverage-test' failed
04:10:47 make: [coverage-test] Error 2 (ignored)
04:10:47 make coverage-report-js

Commits between 6ea9af9...88d9268

$ git log --oneline 6ea9af9906cd74ed...88d9268d087eea0d

88d9268 errors: align source-map stacks with spec
2272713 child_process: fix bad abort signal leak
5bf2737 tools: avoid pending deprecation in doc generator
521c08d module: make synthetic module evaluation steps return a Promise to support top level await
c65e492 doc: refactor fs docs structure
aecd5eb module: only set cache when finding module succeeds
f2c2615 doc: fix description of hasSubscribers
fa3997d test: mark test-return-on-exit as flaky
896ae96 test: mark WASI's test-return-on-exit as flaky
79da253 tools: update V8 gypfiles for 8.8
530ef91 deps: V8: cherry-pick 0c8b6e415c30
73e0245 deps: V8: cherry-pick fe191e8d05cc
577ff9f deps: V8: cherry-pick deb0813166f3
00e1c7e deps: V8: cherry-pick 9a6a22874c81
ee01d6b deps: V8: cherry-pick 2059ee813359
31a46f8 deps: V8: cherry-pick dfcdf7837e23
a74b769 deps: V8: backport 4bf051d536a1
2dad8d4 deps: V8: cherry-pick bde7ee5473d6
3046131 deps: V8: cherry-pick 9a712984025e
d178d07 deps: V8: cherry-pick 0b96e5b0bfb2
5c71ea1 deps: V8: cherry-pick fbb28902e049
5c4be11 deps: V8: cherry-pick 86991d0587a1
c8e15cd deps: V8: cherry-pick 821fb3883a8e
b0d6742 deps: workaround stod() limitations on SmartOS
c8a658a deps: fix V8 build issue with inline methods
153b8ce deps: patch V8 to run on Xcode 8
a785984 deps: V8: silence irrelevant warnings
246c9b8 deps: make v8.h compatible with VS2015
96a567f deps: V8: forward declaration of Rtl*FunctionTable
e74383c deps: V8: patch register-arm64.h
732847f deps: patch V8 to run on older XCode versions
70171d1 deps: V8: un-cherry-pick bd019bd
5259d17 src: update NODE_MODULE_VERSION to 91
8423898 build: reset embedder string to "-node.0"
c7b3292 deps: update V8 to 8.8.278.17

Perhaps it's the V8 8.8 update?

@richardlau
Copy link
Member

richardlau commented Feb 16, 2021

Build against fa3997d also shows the coverage drop:
https://ci.nodejs.org/job/node-test-commit-linux-coverage-daily/786/nodes=benchmark/console

19:52:18 ++ grep -B1 Lines coverage/index.html
19:52:18 ++ head -n1
19:52:18 ++ grep -o '[0-9\.]*'
19:52:18 + JSCOVERAGE=47.43
19:52:18 ++ grep -A3 Lines coverage/cxxcoverage.html
19:52:18 ++ grep style
19:52:18 ++ grep -o '[0-9]\{1,3\}\.[0-9]\{1,2\}'
19:52:18 + CXXCOVERAGE=38.4

I'm not sure it's easily possible to narrow down the commits further as the remaining range was the V8 8.8 update.

@bcoe
Copy link
Contributor

bcoe commented Feb 17, 2021

Looking at the GitHub Actions, they seem healthy:

https://github.com/nodejs/node/pull/37402/checks?check_run_id=1913015747
https://github.com/nodejs/node/pull/37406/checks?check_run_id=1913893666

Do we think the low coverage is due to the compilation issue, vs., a regression with our coverage collection?

@richardlau
Copy link
Member

Do we think the low coverage is due to the compilation issue, vs., a regression with our coverage collection?

It's the compilation issue. Specifically it appears to be failing to compile during the recursive make call to jstest

node/Makefile

Line 323 in 9b32762

CI_SKIP_TESTS=$(COV_SKIP_TESTS) $(MAKE) jstest
which means no JavaScript tests are run. It looks like we did manage to successfully build node earlier in the build and use it to build the addons (i.e.

node/Makefile

Lines 319 to 322 in 9b32762

$(MAKE) build-addons
$(MAKE) build-js-native-api-tests
$(MAKE) build-node-api-tests
$(MAKE) cctest
succeeded) which is probably why there's any coverage at all.

The GitHub Actions coverage workflow isn't using the coverage target

run: make build-ci -j2 V=1 CONFIG_FLAGS="--error-on-warn --coverage"
# TODO(bcoe): fix the couple tests that fail with the inspector enabled.
# The cause is most likely coverage's use of the inspector.
- name: Test
run: NODE_V8_COVERAGE=coverage/tmp make test-cov -j2 V=1 TEST_CI_ARGS="-p dots" || exit 0
which I think eliminates a layer or two of make recursiveness.

@targos
Copy link
Member

targos commented Feb 17, 2021

Is make coverage enough to reproduce it?

@aduh95
Copy link
Contributor

aduh95 commented Feb 17, 2021

Looking at the GitHub Actions, they seem healthy:

https://github.com/nodejs/node/pull/37402/checks?check_run_id=1913015747
https://github.com/nodejs/node/pull/37406/checks?check_run_id=1913893666

The fact that tests are now taking several hours to complete on master doesn't seem very healthy (4h on master, only 10 minutes on the v15.x branch). May this be related? Maybe some kind of timeout?

@richardlau
Copy link
Member

Is make coverage enough to reproduce it?

It's what https://ci.nodejs.org/job/node-test-commit-linux-coverage-daily/ runs (to be exact, it runs

NODE_TEST_DIR=${HOME}/node-tmp PYTHON=python make coverage -j $(getconf _NPROCESSORS_ONLN)

I tried throttling the number of jobs ($(getconf _NPROCESSORS_ONLN)) from 88 to 2 but that hasn't made any difference.

I've not been able to reproduce locally (Fedora 33, gcc/g++ 10).

@richardlau
Copy link
Member

I logged into test-nearform--intel-ubuntu1604-x64-1 and ran the coverage build steps in the workspace for https://ci.nodejs.org/job/node-test-commit-linux-coverage-daily/ manually and... didn't recreate the problem we're seeing (i.e. no compilation problem, the JS test suites ran and we ended up with

Javascript coverage %: 96.43%
C++ coverage %: 89.2%

🤔

I'll try restarting (/upgrading) the Jenkins agent on the machine.

@richardlau
Copy link
Member

I'll try restarting (/upgrading) the Jenkins agent on the machine.

I ran the playbooks/jenkins/worker/upgrade-jar.yml playbook against the machine to update and restart the Jenkins agent (from agent version 3.33 to 4.5) but that hasn't made a difference --
https://ci.nodejs.org/job/node-test-commit-linux-coverage-daily/793/nodes=benchmark/console still errors during the jstest target.

The playbooks/jenkins/worker/create.yml playbook fails to run:

$  ansible-playbook playbooks/jenkins/worker/create.yml -l test-nearform_intel-ubuntu1604-x64-1
[DEPRECATION WARNING]: DEFAULT_SQUASH_ACTIONS option, Loop squashing is deprecated and this configuration will no longer be used , use a list directly with
 the module argument instead. This feature will be removed in version 2.11. Deprecation warnings can be disabled by setting deprecation_warnings=False in
ansible.cfg.
[DEPRECATION WARNING]: The use of 'static' has been deprecated. Use 'import_tasks' for static inclusion, or 'include_tasks' for dynamic inclusion. This
feature will be removed in version 2.12. Deprecation warnings can be disabled by setting deprecation_warnings=False in ansible.cfg.

PLAY [test,release,infra-softlayer-ubuntu1404-x64-2,!*-win*] ***********************************************************************************************

TASK [check if secret is properly set] *********************************************************************************************************************
ok: [test-nearform_intel-ubuntu1604-x64-1]

TASK [run os-specific bootstrap] ***************************************************************************************************************************
included: /home/rlau/sandbox/github/build/ansible/roles/bootstrap/tasks/partials/ubuntu1604.yml for test-nearform_intel-ubuntu1604-x64-1

TASK [bootstrap : check for python] ************************************************************************************************************************
changed: [test-nearform_intel-ubuntu1604-x64-1]

TASK [bootstrap : check for aptitude] **********************************************************************************************************************
changed: [test-nearform_intel-ubuntu1604-x64-1]

TASK [bootstrap : install python and aptitude] *************************************************************************************************************
skipping: [test-nearform_intel-ubuntu1604-x64-1]

TASK [run raspberry pi bootstrap] **************************************************************************************************************************
skipping: [test-nearform_intel-ubuntu1604-x64-1]

TASK [package-upgrade : include package manager tasks] *****************************************************************************************************
included: /home/rlau/sandbox/github/build/ansible/roles/package-upgrade/tasks/partials/apt.yml for test-nearform_intel-ubuntu1604-x64-1

TASK [package-upgrade : upgrade installed packages] ********************************************************************************************************
[DEPRECATION WARNING]: Distribution Ubuntu 16.04 on host test-nearform_intel-ubuntu1604-x64-1 should use /usr/bin/python3, but is using /usr/bin/python for
 backward compatibility with prior Ansible releases. A future Ansible release will default to using the discovered platform python for this host. See
https://docs.ansible.com/ansible/2.9/reference_appendices/interpreter_discovery.html for more information. This feature will be removed in version 2.12.
Deprecation warnings can be disabled by setting deprecation_warnings=False in ansible.cfg.
fatal: [test-nearform_intel-ubuntu1604-x64-1]: FAILED! => {"ansible_facts": {"discovered_interpreter_python": "/usr/bin/python"}, "changed": false, "msg": "Failed to update apt cache: "}

PLAY RECAP *************************************************************************************************************************************************
test-nearform_intel-ubuntu1604-x64-1 : ok=5    changed=2    unreachable=0    failed=1    skipped=2    rescued=0    ignored=0

$

@richardlau
Copy link
Member

The playbooks/jenkins/worker/create.yml playbook fails to run:

Failing to update packages via ansible on test-nearform_intel-ubuntu1604-x64-1 was already reported in nodejs/build#2531.

@mhdawson
Copy link
Member

There was an issue to switch to codecov.io (subject to figuring out some access issues). I think the only reason the job runs on a specific machine is so that we can generate/transfer the results to the machine were the data is pulled for coverage.nodejs.org. If we believe we are ready to move to codecov.io only, maybe we just let the job run on a different machine and stop using coverage.nodejs.org ?

@richardlau
Copy link
Member

Summary so far:

  • Problem occurs after the V8 8.8 update and is consistent on the CI on test-nearform_intel-ubuntu1604-x64-1 -- the node binary gets rebuilt several times and during the make jstest step it fails to rebuild V8.
  • It doesn't happen on test-nearform_intel-ubuntu1604-x64-1 when built manually. Nor on two local Fedora 33/g++ 10 installs.

I believe it's also related to #36139 (comment) and #37429 in that when the make jstest part of the build fails it is recompiling V8 again where it did not previous to the V8 8.8 update.

@bcoe
Copy link
Contributor

bcoe commented Feb 22, 2021

If we believe we are ready to move to codecov.io only, maybe we just let the job run on a different machine and stop using coverage.nodejs.org

I've been itching for us to get to this point, because it does mean we don't have to maintain the bespoke coverage machines.

Also, I think it's worth noting that it's much easier for folks to run coverage locally now.

Even though coverage takes several hours (40% slower than the normal test run right now) it still means we're getting regular metrics shipped to codecov.io.


I still think there are some hiccups to sort out on codecov.io before I'd fully endorse it mind you, e.g., lack of commit context:

Screen Shot 2021-02-22 at 9 22 15 AM

@mhdawson
Copy link
Member

It doesn't happen on test-nearform_intel-ubuntu1604-x64-1 when built manually. Nor on two local Fedora 33/g++ 10 installs.

What is the difference between running manually and running the the code coverage jenkins job? I would have thought they should be doing the same thing + some extra scripting to capture and provide the data.

@richardlau
Copy link
Member

It doesn't happen on test-nearform_intel-ubuntu1604-x64-1 when built manually. Nor on two local Fedora 33/g++ 10 installs.

What is the difference between running manually and running the the code coverage jenkins job? I would have thought they should be doing the same thing + some extra scripting to capture and provide the data.

Running under the Jenkins agent? The job doesn't get to the extra scripting as it's failing during make coverage (but doesn't when I run the same command on the host).

In any case #37502 avoids the unnecessary V8 rebuilds and the coverage job with that no longer fails to build during the jstest part of the build and we get the expected coverage figures:
e.g. https://ci.nodejs.org/job/node-test-commit-linux-coverage-daily/803/nodes=benchmark/console

14:48:37 ++ grep -B1 Lines coverage/index.html
14:48:37 ++ head -n1
14:48:37 ++ grep -o '[0-9\.]*'
14:48:37 + JSCOVERAGE=96.77
14:48:37 ++ grep -A3 Lines coverage/cxxcoverage.html
14:48:37 ++ grep style
14:48:37 ++ grep -o '[0-9]\{1,3\}\.[0-9]\{1,2\}'
14:48:37 + CXXCOVERAGE=89.2

@richardlau
Copy link
Member

richardlau commented Feb 24, 2021

In any case #37502 avoids the unnecessary V8 rebuilds and the coverage job with that no longer fails to build during the jstest part of the build and we get the expected coverage figures:

Or #37505, which cherry-picks a fix for torque from upstream V8:

https://ci.nodejs.org/job/node-test-commit-linux-coverage-daily/805/nodes=benchmark/console

19:06:14 ++ grep -B1 Lines coverage/index.html
19:06:14 ++ head -n1
19:06:14 ++ grep -o '[0-9\.]*'
19:06:14 + JSCOVERAGE=96.76
19:06:14 ++ grep -A3 Lines coverage/cxxcoverage.html
19:06:14 ++ grep style
19:06:14 ++ grep -o '[0-9]\{1,3\}\.[0-9]\{1,2\}'
19:06:14 + CXXCOVERAGE=89.2

@aduh95 aduh95 closed this as completed in 448de26 Feb 25, 2021
targos pushed a commit that referenced this issue Feb 28, 2021
Original commit message:

    [torque] Don't replace unmodified empty files

    To improve incremental builds.

    Bug: v8:7793
    Change-Id: I6990a97e058d22d34acd1f609167cd30ca7518ad
    Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2596789
    Reviewed-by: Nico Hartmann <nicohartmann@chromium.org>
    Commit-Queue: Seth Brenith <seth.brenith@microsoft.com>
    Cr-Commit-Position: refs/heads/master@{#72053}

Refs: v8/v8@373f4ae

PR-URL: #37505
Fixes: #37368
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
Reviewed-By: Rich Trott <rtrott@gmail.com>
Reviewed-By: Michaël Zasso <targos@protonmail.com>
Reviewed-By: Jiawen Geng <technicalcute@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants