-
Notifications
You must be signed in to change notification settings - Fork 29.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: fix test-cluster-disconnect-leak.js for AIX #4674
Conversation
lowered the number of spawned workers as EAGAIN errors were being thrown.
Would you be able to confirm that the modified test you are proposing here still fails on AIX if the three lines added to If you have nvm or some other convenient way to switch between Node versions, you can simply check that the test fires an
|
Hi, it seems that the issue is unrelated to the fix.
Without my change set and increasing the maxuproc to 150 with
allows the test to pass without my changes. So I guess the question is does the number have to be that high for when we have more than 8 cores? Or should I say why is there a need for 16 worker forks per cpu? |
@iwuzhere it's just an empiric value: it was chosen so that the test wuold fail without the changes in |
Alternatively, maybe we could also try switching to an exponential ramp-up in the test so that we're not relying on calibrating the number like this. The magic number may change as hardware and operating systems evolve, so it might be slightly more future-proof to do it that way. |
This is also happening in test-cluster-disconnect-suicide-race.js. The test fails with EAGAIN on AIX for spawning 128 ( 64 * 2 ) workers. |
I think I've found a faster and more reliable/deterministic way to test this issue. PR coming in a bit... |
See alternative at #4736 |
Previously, test-cluster-disconnect-leak had two issues: * Magic numbers: How many times to spawn a worker was determined through empirical experimentation. This means that as new platforms and new CPU/RAM configurations are tested, the magic numbers require more and more refinement. This brings us to... * Non-determinism: The test *seems* to fail all the time when the bug it tests for is present, but it's really a judgment based on sampling. "Oh, with 8 workers per CPU, it fails about 80% of the time. Let's try 16..." This revised version of the test takes a different approach. The fix for the bug that the test was written for means that the `disconnect` event will fire reliably for a single worker. So we check for that and the test still fails when the fix is not in the code base and succeeds when it is. Advantages of this approach include: * The test runs much faster. * The test now works on Windows. The previous version skipped Windows. * The test should be reliable on any new platform regardless of CPU and RAM. Ref: nodejs#4674
Previously, test-cluster-disconnect-suicide-race had two issues: * Magic numbers: How many times to spawn a worker was determined through empirical experimentation. This means that as new platforms and new CPU/RAM configurations are tested, the magic numbers require more and more refinement. This brings us to... * Non-determinism: The test seems to fail all the time when the bug it tests for is present, but it's really a judgment based on sampling. "Oh, with 8 workers per CPU, it fails about 80% of the time. Let's try 16..." This revised version of the test takes a different approach. The fix for the bug that the test was written for means that the disconnect event will fire on a subsequent tick. So we check for that and the test still fails when the fix is not in the code base and succeeds when it is. Advantages of this approach include: * The test runs much faster. * The test should be reliable on any new platform regardless of CPU and RAM. Ref: nodejs#4674 cc @santigimeno @iwuzhere
Previously, test-cluster-disconnect-suicide-race had two issues: * Magic numbers: How many times to spawn a worker was determined through empirical experimentation. This means that as new platforms and new CPU/RAM configurations are tested, the magic numbers require more and more refinement. This brings us to... * Non-determinism: The test seems to fail all the time when the bug it tests for is present, but it's really a judgment based on sampling. "Oh, with 8 workers per CPU, it fails about 80% of the time. Let's try 16..." This revised version of the test takes a different approach. The fix for the bug that the test was written for means that the disconnect event will fire on a subsequent tick. So we check for that and the test still fails when the fix is not in the code base and succeeds when it is. Advantages of this approach include: * The test runs much faster. * The test should be reliable on any new platform regardless of CPU and RAM. Ref: nodejs#4674 cc @santigimeno @iwuzhere
Closing in favor of #4736. If that doesn't work out, we can reopen this one. |
Previously, test-cluster-disconnect-leak had two issues: * Magic numbers: How many times to spawn a worker was determined through empirical experimentation. This means that as new platforms and new CPU/RAM configurations are tested, the magic numbers require more and more refinement. This brings us to... * Non-determinism: The test *seems* to fail all the time when the bug it tests for is present, but it's really a judgment based on sampling. "Oh, with 8 workers per CPU, it fails about 80% of the time. Let's try 16..." This revised version of the test takes a different approach. The fix for the bug that the test was written for means that the `disconnect` event will fire reliably for a single worker. So we check for that and the test still fails when the fix is not in the code base and succeeds when it is. Advantages of this approach include: * The test runs much faster. * The test now works on Windows. The previous version skipped Windows. * The test should be reliable on any new platform regardless of CPU and RAM. Ref: #4674 PR-URL: #4736 Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Johan Bergström <bugs@bergstroem.nu>
Previously, test-cluster-disconnect-leak had two issues: * Magic numbers: How many times to spawn a worker was determined through empirical experimentation. This means that as new platforms and new CPU/RAM configurations are tested, the magic numbers require more and more refinement. This brings us to... * Non-determinism: The test *seems* to fail all the time when the bug it tests for is present, but it's really a judgment based on sampling. "Oh, with 8 workers per CPU, it fails about 80% of the time. Let's try 16..." This revised version of the test takes a different approach. The fix for the bug that the test was written for means that the `disconnect` event will fire reliably for a single worker. So we check for that and the test still fails when the fix is not in the code base and succeeds when it is. Advantages of this approach include: * The test runs much faster. * The test now works on Windows. The previous version skipped Windows. * The test should be reliable on any new platform regardless of CPU and RAM. Ref: #4674 PR-URL: #4736 Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Johan Bergström <bugs@bergstroem.nu>
Previously, test-cluster-disconnect-suicide-race had two issues: * Magic numbers: How many times to spawn a worker was determined through empirical experimentation. This means that as new platforms and new CPU/RAM configurations are tested, the magic numbers require more and more refinement. This brings us to... * Non-determinism: The test seems to fail all the time when the bug it tests for is present, but it's really a judgment based on sampling. "Oh, with 8 workers per CPU, it fails about 80% of the time. Let's try 16..." This revised version of the test takes a different approach. The fix for the bug that the test was written for means that the disconnect event will fire on a subsequent tick. So we check for that and the test still fails when the fix is not in the code base and succeeds when it is. Advantages of this approach include: * The test runs much faster. * The test should be reliable on any new platform regardless of CPU and RAM. PR-URL: #4739 Ref: #4674 Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: James M Snell <jasnell@gmail.com>
Previously, test-cluster-disconnect-leak had two issues: * Magic numbers: How many times to spawn a worker was determined through empirical experimentation. This means that as new platforms and new CPU/RAM configurations are tested, the magic numbers require more and more refinement. This brings us to... * Non-determinism: The test *seems* to fail all the time when the bug it tests for is present, but it's really a judgment based on sampling. "Oh, with 8 workers per CPU, it fails about 80% of the time. Let's try 16..." This revised version of the test takes a different approach. The fix for the bug that the test was written for means that the `disconnect` event will fire reliably for a single worker. So we check for that and the test still fails when the fix is not in the code base and succeeds when it is. Advantages of this approach include: * The test runs much faster. * The test now works on Windows. The previous version skipped Windows. * The test should be reliable on any new platform regardless of CPU and RAM. Ref: #4674 PR-URL: #4736 Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Johan Bergström <bugs@bergstroem.nu>
Previously, test-cluster-disconnect-suicide-race had two issues: * Magic numbers: How many times to spawn a worker was determined through empirical experimentation. This means that as new platforms and new CPU/RAM configurations are tested, the magic numbers require more and more refinement. This brings us to... * Non-determinism: The test seems to fail all the time when the bug it tests for is present, but it's really a judgment based on sampling. "Oh, with 8 workers per CPU, it fails about 80% of the time. Let's try 16..." This revised version of the test takes a different approach. The fix for the bug that the test was written for means that the disconnect event will fire on a subsequent tick. So we check for that and the test still fails when the fix is not in the code base and succeeds when it is. Advantages of this approach include: * The test runs much faster. * The test should be reliable on any new platform regardless of CPU and RAM. PR-URL: #4739 Ref: #4674 Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: James M Snell <jasnell@gmail.com>
Previously, test-cluster-disconnect-leak had two issues: * Magic numbers: How many times to spawn a worker was determined through empirical experimentation. This means that as new platforms and new CPU/RAM configurations are tested, the magic numbers require more and more refinement. This brings us to... * Non-determinism: The test *seems* to fail all the time when the bug it tests for is present, but it's really a judgment based on sampling. "Oh, with 8 workers per CPU, it fails about 80% of the time. Let's try 16..." This revised version of the test takes a different approach. The fix for the bug that the test was written for means that the `disconnect` event will fire reliably for a single worker. So we check for that and the test still fails when the fix is not in the code base and succeeds when it is. Advantages of this approach include: * The test runs much faster. * The test now works on Windows. The previous version skipped Windows. * The test should be reliable on any new platform regardless of CPU and RAM. Ref: #4674 PR-URL: #4736 Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Johan Bergström <bugs@bergstroem.nu>
Previously, test-cluster-disconnect-suicide-race had two issues: * Magic numbers: How many times to spawn a worker was determined through empirical experimentation. This means that as new platforms and new CPU/RAM configurations are tested, the magic numbers require more and more refinement. This brings us to... * Non-determinism: The test seems to fail all the time when the bug it tests for is present, but it's really a judgment based on sampling. "Oh, with 8 workers per CPU, it fails about 80% of the time. Let's try 16..." This revised version of the test takes a different approach. The fix for the bug that the test was written for means that the disconnect event will fire on a subsequent tick. So we check for that and the test still fails when the fix is not in the code base and succeeds when it is. Advantages of this approach include: * The test runs much faster. * The test should be reliable on any new platform regardless of CPU and RAM. PR-URL: #4739 Ref: #4674 Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: James M Snell <jasnell@gmail.com>
Previously, test-cluster-disconnect-leak had two issues: * Magic numbers: How many times to spawn a worker was determined through empirical experimentation. This means that as new platforms and new CPU/RAM configurations are tested, the magic numbers require more and more refinement. This brings us to... * Non-determinism: The test *seems* to fail all the time when the bug it tests for is present, but it's really a judgment based on sampling. "Oh, with 8 workers per CPU, it fails about 80% of the time. Let's try 16..." This revised version of the test takes a different approach. The fix for the bug that the test was written for means that the `disconnect` event will fire reliably for a single worker. So we check for that and the test still fails when the fix is not in the code base and succeeds when it is. Advantages of this approach include: * The test runs much faster. * The test now works on Windows. The previous version skipped Windows. * The test should be reliable on any new platform regardless of CPU and RAM. Ref: #4674 PR-URL: #4736 Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Johan Bergström <bugs@bergstroem.nu>
Previously, test-cluster-disconnect-suicide-race had two issues: * Magic numbers: How many times to spawn a worker was determined through empirical experimentation. This means that as new platforms and new CPU/RAM configurations are tested, the magic numbers require more and more refinement. This brings us to... * Non-determinism: The test seems to fail all the time when the bug it tests for is present, but it's really a judgment based on sampling. "Oh, with 8 workers per CPU, it fails about 80% of the time. Let's try 16..." This revised version of the test takes a different approach. The fix for the bug that the test was written for means that the disconnect event will fire on a subsequent tick. So we check for that and the test still fails when the fix is not in the code base and succeeds when it is. Advantages of this approach include: * The test runs much faster. * The test should be reliable on any new platform regardless of CPU and RAM. PR-URL: #4739 Ref: #4674 Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: James M Snell <jasnell@gmail.com>
Previously, test-cluster-disconnect-leak had two issues: * Magic numbers: How many times to spawn a worker was determined through empirical experimentation. This means that as new platforms and new CPU/RAM configurations are tested, the magic numbers require more and more refinement. This brings us to... * Non-determinism: The test *seems* to fail all the time when the bug it tests for is present, but it's really a judgment based on sampling. "Oh, with 8 workers per CPU, it fails about 80% of the time. Let's try 16..." This revised version of the test takes a different approach. The fix for the bug that the test was written for means that the `disconnect` event will fire reliably for a single worker. So we check for that and the test still fails when the fix is not in the code base and succeeds when it is. Advantages of this approach include: * The test runs much faster. * The test now works on Windows. The previous version skipped Windows. * The test should be reliable on any new platform regardless of CPU and RAM. Ref: nodejs#4674 PR-URL: nodejs#4736 Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Johan Bergström <bugs@bergstroem.nu>
Previously, test-cluster-disconnect-suicide-race had two issues: * Magic numbers: How many times to spawn a worker was determined through empirical experimentation. This means that as new platforms and new CPU/RAM configurations are tested, the magic numbers require more and more refinement. This brings us to... * Non-determinism: The test seems to fail all the time when the bug it tests for is present, but it's really a judgment based on sampling. "Oh, with 8 workers per CPU, it fails about 80% of the time. Let's try 16..." This revised version of the test takes a different approach. The fix for the bug that the test was written for means that the disconnect event will fire on a subsequent tick. So we check for that and the test still fails when the fix is not in the code base and succeeds when it is. Advantages of this approach include: * The test runs much faster. * The test should be reliable on any new platform regardless of CPU and RAM. PR-URL: nodejs#4739 Ref: nodejs#4674 Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: James M Snell <jasnell@gmail.com>
Previously, test-cluster-disconnect-leak had two issues: * Magic numbers: How many times to spawn a worker was determined through empirical experimentation. This means that as new platforms and new CPU/RAM configurations are tested, the magic numbers require more and more refinement. This brings us to... * Non-determinism: The test *seems* to fail all the time when the bug it tests for is present, but it's really a judgment based on sampling. "Oh, with 8 workers per CPU, it fails about 80% of the time. Let's try 16..." This revised version of the test takes a different approach. The fix for the bug that the test was written for means that the `disconnect` event will fire reliably for a single worker. So we check for that and the test still fails when the fix is not in the code base and succeeds when it is. Advantages of this approach include: * The test runs much faster. * The test now works on Windows. The previous version skipped Windows. * The test should be reliable on any new platform regardless of CPU and RAM. Ref: nodejs#4674 PR-URL: nodejs#4736 Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Johan Bergström <bugs@bergstroem.nu>
Previously, test-cluster-disconnect-suicide-race had two issues: * Magic numbers: How many times to spawn a worker was determined through empirical experimentation. This means that as new platforms and new CPU/RAM configurations are tested, the magic numbers require more and more refinement. This brings us to... * Non-determinism: The test seems to fail all the time when the bug it tests for is present, but it's really a judgment based on sampling. "Oh, with 8 workers per CPU, it fails about 80% of the time. Let's try 16..." This revised version of the test takes a different approach. The fix for the bug that the test was written for means that the disconnect event will fire on a subsequent tick. So we check for that and the test still fails when the fix is not in the code base and succeeds when it is. Advantages of this approach include: * The test runs much faster. * The test should be reliable on any new platform regardless of CPU and RAM. PR-URL: nodejs#4739 Ref: nodejs#4674 Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: James M Snell <jasnell@gmail.com>
Previously, test-cluster-disconnect-leak had two issues: * Magic numbers: How many times to spawn a worker was determined through empirical experimentation. This means that as new platforms and new CPU/RAM configurations are tested, the magic numbers require more and more refinement. This brings us to... * Non-determinism: The test *seems* to fail all the time when the bug it tests for is present, but it's really a judgment based on sampling. "Oh, with 8 workers per CPU, it fails about 80% of the time. Let's try 16..." This revised version of the test takes a different approach. The fix for the bug that the test was written for means that the `disconnect` event will fire reliably for a single worker. So we check for that and the test still fails when the fix is not in the code base and succeeds when it is. Advantages of this approach include: * The test runs much faster. * The test now works on Windows. The previous version skipped Windows. * The test should be reliable on any new platform regardless of CPU and RAM. Ref: nodejs#4674 PR-URL: nodejs#4736 Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Johan Bergström <bugs@bergstroem.nu>
Previously, test-cluster-disconnect-suicide-race had two issues: * Magic numbers: How many times to spawn a worker was determined through empirical experimentation. This means that as new platforms and new CPU/RAM configurations are tested, the magic numbers require more and more refinement. This brings us to... * Non-determinism: The test seems to fail all the time when the bug it tests for is present, but it's really a judgment based on sampling. "Oh, with 8 workers per CPU, it fails about 80% of the time. Let's try 16..." This revised version of the test takes a different approach. The fix for the bug that the test was written for means that the disconnect event will fire on a subsequent tick. So we check for that and the test still fails when the fix is not in the code base and succeeds when it is. Advantages of this approach include: * The test runs much faster. * The test should be reliable on any new platform regardless of CPU and RAM. PR-URL: nodejs#4739 Ref: nodejs#4674 Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: James M Snell <jasnell@gmail.com>
Previously, test-cluster-disconnect-leak had two issues: * Magic numbers: How many times to spawn a worker was determined through empirical experimentation. This means that as new platforms and new CPU/RAM configurations are tested, the magic numbers require more and more refinement. This brings us to... * Non-determinism: The test *seems* to fail all the time when the bug it tests for is present, but it's really a judgment based on sampling. "Oh, with 8 workers per CPU, it fails about 80% of the time. Let's try 16..." This revised version of the test takes a different approach. The fix for the bug that the test was written for means that the `disconnect` event will fire reliably for a single worker. So we check for that and the test still fails when the fix is not in the code base and succeeds when it is. Advantages of this approach include: * The test runs much faster. * The test now works on Windows. The previous version skipped Windows. * The test should be reliable on any new platform regardless of CPU and RAM. Ref: nodejs#4674 PR-URL: nodejs#4736 Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Johan Bergström <bugs@bergstroem.nu>
Previously, test-cluster-disconnect-suicide-race had two issues: * Magic numbers: How many times to spawn a worker was determined through empirical experimentation. This means that as new platforms and new CPU/RAM configurations are tested, the magic numbers require more and more refinement. This brings us to... * Non-determinism: The test seems to fail all the time when the bug it tests for is present, but it's really a judgment based on sampling. "Oh, with 8 workers per CPU, it fails about 80% of the time. Let's try 16..." This revised version of the test takes a different approach. The fix for the bug that the test was written for means that the disconnect event will fire on a subsequent tick. So we check for that and the test still fails when the fix is not in the code base and succeeds when it is. Advantages of this approach include: * The test runs much faster. * The test should be reliable on any new platform regardless of CPU and RAM. PR-URL: nodejs#4739 Ref: nodejs#4674 Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: James M Snell <jasnell@gmail.com>
lowered the number of spawned workers as EAGAIN errors were
being thrown.