Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: fix flaky test-worker-prof #37372

Merged
merged 1 commit into from
Feb 20, 2021
Merged

Conversation

Trott
Copy link
Member

@Trott Trott commented Feb 14, 2021

Fixes: #26401

@nodejs-github-bot nodejs-github-bot added the test Issues and PRs related to the tests. label Feb 14, 2021
@nodejs-github-bot
Copy link
Collaborator

@Trott Trott force-pushed the fix-test-worker-prof branch 2 times, most recently from 79cb50a to 311bfc1 Compare February 15, 2021 01:02
@nodejs-github-bot
Copy link
Collaborator

@nodejs-github-bot
Copy link
Collaborator

@targos targos added the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label Feb 15, 2021
@Trott Trott removed the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label Feb 15, 2021
@Trott
Copy link
Member Author

Trott commented Feb 15, 2021

I removed the author ready label because I want to run some stress tests to make sure this really fixes things.

@Trott
Copy link
Member Author

Trott commented Feb 16, 2021

@gireeshpunathil @addaleax

@Trott
Copy link
Member Author

Trott commented Feb 16, 2021

@Trott
Copy link
Member Author

Trott commented Feb 17, 2021

Stress test show it happening less but still happening. I'm going to move this to draft while I investigate more.

@Trott Trott marked this pull request as draft February 17, 2021 03:21
@Trott
Copy link
Member Author

Trott commented Feb 17, 2021

Extending the timeout is the wrong solution. It's not that the test needs more time. It's that there is a race condition so the child process never exits.

@gireeshpunathil
Copy link
Member

@Trott - from my earlier observation (I will soon search and find out where and what it was), I guess it was to do with CPU starvation on systems with very less CPU:

  • if the worker thread is grabbing the CPU, it uses the CPU slice in its entirety so the parent does not get enough ticks
  • if the parent grabs the CPU, it does the same
  • so either case, one of the profile ticks were insufficient
  • (I?) modified this by doing a pingpong where both threads are forced to use CPU to complete the work
  • however, from the TIMEOUT, it looks like the work becomes too heavy to handle within the timeout of the harness, in certain systems.

Do you know how many CPUs the failing system has?

I think

fs.readFileSync(m.toString()).slice(0, 1024 * 1024));
is the key. We are reading the whole node executable file and converting that into a single char array that proves to be too slow at times? Should we either read a small part of it, or change the process.execPath to __filename? WDYT?

@gireeshpunathil
Copy link
Member

ok, these are the relevant comments where I had observations on the CPU starvation patterns:
#26401 (comment)
#26401 (comment)

@Trott
Copy link
Member Author

Trott commented Feb 17, 2021

Should we either read a small part of it, or change the process.execPath to __filename? WDYT?

Let's try changing process.execPath to __filename. (If that doesn't work, I wonder if skipping the test when common.enoughTestCpu is false will work.)

@Trott Trott marked this pull request as ready for review February 17, 2021 04:27
@nodejs-github-bot
Copy link
Collaborator

@Trott
Copy link
Member Author

Trott commented Feb 17, 2021

Stress test with this PR: https://ci.nodejs.org/job/node-stress-single-test/215/

@Trott
Copy link
Member Author

Trott commented Feb 17, 2021

Should we either read a small part of it, or change the process.execPath to __filename? WDYT?

Let's try changing process.execPath to __filename. (If that doesn't work, I wonder if skipping the test when common.enoughTestCpu is false will work.)

Test failed on macOS in CI because there were only 8 ticks rather than more than 15.

    // Test that at least 15 ticks have been recorded for both parent and child
    // threads. When not tracking Worker threads, only 1 or 2 ticks would
    // have been recorded.
    // When running locally on x64 Linux, this number is usually at least 200
    // for both threads, so 15 seems like a very safe threshold.
    assert(ticks >= 15, `${ticks} >= 15`);

@gireeshpunathil
Copy link
Member

Test failed on macOS in CI because there were only 8 ticks rather than more than 15.

ok, so this time it was too lightweight to produce enough work. How about increasing the pingpong count from 10 to say 100? given one unit of work is pretty small, increasing the loop iteration should be ok I guess.

@Trott
Copy link
Member Author

Trott commented Feb 17, 2021

ok, so this time it was too lightweight to produce enough work. How about increasing the pingpong count from 10 to say 100? given one unit of work is pretty small, increasing the loop iteration should be ok I guess.

On my local machine, there was no noticeable difference in ticks between n=10 and n=100. But up around n=1000, we start to see a ticks go from around 80 to around 200. So, trying n=1024....

@nodejs-github-bot

This comment has been minimized.

@Trott
Copy link
Member Author

Trott commented Feb 17, 2021

@nodejs-github-bot
Copy link
Collaborator

@nodejs-github-bot
Copy link
Collaborator

@nodejs-github-bot
Copy link
Collaborator

@nodejs-github-bot
Copy link
Collaborator

Fixes: nodejs#26401
Co-authored-by: Gireesh Punathil <gpunathi@in.ibm.com>

PR-URL: nodejs#37372
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
Reviewed-By: Michaël Zasso <targos@protonmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
@Trott
Copy link
Member Author

Trott commented Feb 20, 2021

Landed in 04fb597

@Trott Trott merged commit 04fb597 into nodejs:master Feb 20, 2021
@Trott Trott deleted the fix-test-worker-prof branch February 20, 2021 18:42
targos pushed a commit that referenced this pull request Feb 28, 2021
Fixes: #26401
Co-authored-by: Gireesh Punathil <gpunathi@in.ibm.com>

PR-URL: #37372
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
Reviewed-By: Michaël Zasso <targos@protonmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
@kapouer
Copy link
Contributor

kapouer commented Mar 19, 2021

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Issues and PRs related to the tests.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate flaky test-worker-prof
8 participants