Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: add uv threadpool congestion regression test #23099

Merged

Conversation

gireeshpunathil
Copy link
Member

Validate that massive dns lookups do not block filesytem I/O (or any fast I/O for that matter).
Prior to libuv/libuv#1845 few back-to-back dns lookup were sufficient to engage libuv threadpool workers in a blocking manner, throttling other work items that need the pool. this test acts as a regression test for the same.

Start slow and fast I/Os together, and make sure fast I/O can complete in at least in 1/100th of time for slow I/O.

Refs: libuv/libuv#1845
Refs: #8436

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • documentation is changed or added
  • commit message follows commit guidelines

@nodejs-github-bot nodejs-github-bot added the test Issues and PRs related to the tests. label Sep 26, 2018
@gireeshpunathil gireeshpunathil added dns Issues and PRs related to the dns subsystem. libuv Issues and PRs related to the libuv dependency or the uv binding. labels Sep 26, 2018
@Trott
Copy link
Member

Trott commented Sep 26, 2018

This test fails without an internet connection so it probably needs to go in test/internet rather than test/sequential.

Here's what I get if I disconnect from the network:

$ tools/test.py test/sequential/test-uv-threadpool-schedule.js 
=== release test-uv-threadpool-schedule ===                    
Path: sequential/test-uv-threadpool-schedule
assert.js:349
    throw err;
    ^

AssertionError [ERR_ASSERTION]: fast I/O took longer to complete, actual: 10, expected: 0.57
    at GetAddrInfoReqWrap.dns.lookup.common.mustCall (/Users/trott/io.js/test/sequential/test-uv-threadpool-schedule.js:45:14)
    at GetAddrInfoReqWrap.callback (/Users/trott/io.js/test/common/index.js:349:15)
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:58:17)
Command: out/Release/node /Users/trott/io.js/test/sequential/test-uv-threadpool-schedule.js
[00:00|% 100|+   0|-   1]: Done  
$

@cjihrig
Copy link
Contributor

cjihrig commented Sep 26, 2018

This also seems like it will have the potential to be flakey when run on the CI.

}));
}

fs.readFile(__filename, common.mustCall((e, d) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused variables.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thefourtheye - removed, thanks.

// We need to refresh the domain string everytime,
// otherwise the TCP stack that cache the previous lookup
// returns result from memory, breaking all our Math.
dns.lookup(`${randomDomain()}.com`, {}, common.mustCall((e, a, f) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throw in an assert.ifError maybe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, didn't follow you - can you please elaborate? all these lookups are destined to fail, we are latching on the time they take for un-optimized full blown lookups (so as to maximize the threadpool worker thread's engagement)

for (let i = 0; i < slowIOmax; i++) {
// We need to refresh the domain string everytime,
// otherwise the TCP stack that cache the previous lookup
// returns result from memory, breaking all our Math.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, using a base string and then keep incrementing an offset would guarantee that there will be no repetition in the domain names used, right? Because, the random numbers might repeat (even though odds of that happening is less)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestion, followed your suggestion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thefourtheye - I just spotted a drawback of doing this: if you run the test back-to-back, it fails the second time onwards - for obvious reasons (the domain strings being static, they get resolved faster). I don't know if it affects our CI tests though.

function randomDomain() {
const d = Buffer.alloc(10);
for (let i = 0; i < 10; i++)
d[i] = 92 + (Math.round(Math.random() * 13247)) % 26;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to use 97 ('a') instead of 92 ('\')?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the method altogether. (yes, I meant a , don't know how I kept it at 92. ;) )

@gireeshpunathil
Copy link
Member Author

@Trott - moved this to internet folder. I didn't know the rules for placement; I see many tests that perform dns lookup on public websites in sequential folder, but looks like none of them are time sensitive.

@cjihrig - I acknowledge that the test is sensitive to response times. For that matter I took these precautions:

  • simplified the test by making only two most essential assertions (easy for spotting)
  • ideally the fastI/O should not take even 1/1000th of slowI/O (even those do not hold a dependency relation in any ways), but given the plurality of CI system types, I kept a conservative duration of 1/100th of that of slowI/O. If we relax it any further (either by increasing the dns lookup count or by reducing the expected ratio), I fear it violates the purpose of the test.
  • sprinkled comments abundantly so that if it fails one can easily reason out.

If you think any specific construct / logic is prone to flake, please let me know, I am happy to improve on it.

// We need to refresh the domain string everytime,
// otherwise the TCP stack that cache the previous lookup
// returns result from memory, breaking all our Math.
dns.lookup(`${randomDomain}${i}.com`, {}, common.mustCall(() => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the test still work if we use one of the reserved TLDs (e.g. .test)? https://tools.ietf.org/html/rfc2606

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-  dns.lookup(`${randomDomain}${i}.com`, {}, common.mustCall(() => {
+  dns.lookup(`${randomDomain}${i}.test`, {}, common.mustCall(() => {

@richardlau - I tested with this change (if that is what you meant) and it still works. But what would be an advantage of it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reserved TLDs are specifically reserved to avoid conflicts with actual registered domain names, e.g. nonexistent0.com might exist in the future.

They can also avoid unnecessary load on the DNS servers, see https://tools.ietf.org/html/rfc6761 section 6.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardlau : ok - let me clarify: the purpose of the test is to engage (half of) the libuv threadpool workers as long as possible while making sure at least one thread is free and available to serve the file I/Os. So in that context, what we want is a list of domain names that are unique, and were unresolved earlier (resolved names are cached so retrieved faster than we wanted them to), existent or non-existent is not a consideration for the test. If the name is confusing, I can change it to unique0.com etc.

Am I missing anything?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If caching is an issue, we might need to generate domain names that are unique per-test-run, e.g. based on Date.now() or Math.random()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes (my first commit 1d0fcefec8a42b71166e572de14b308e1dc97e58) had it based on Math.random(). I guess I will re-instate that, thanks.

@gireeshpunathil
Copy link
Member Author

@Trott
Copy link
Member

Trott commented Sep 30, 2018

I suppose pinging @nodejs/libuv to review may be appropriate here.

@gireeshpunathil
Copy link
Member Author

thanks @Trott . Also cc @nodejs/testing ?

@gireeshpunathil
Copy link
Member Author

re-instated the random domain function, else the test runs the risk of looking up previously resolved domain names and messing up with the time calculation.

new CI: https://ci.nodejs.org/job/node-test-pull-request/17620/

@gireeshpunathil
Copy link
Member Author

arm failure is unrelated:

03:20:28 not ok 1816 parallel/test-tls-generic-stream
03:20:28   ---
03:20:28   duration_ms: 0.949
03:20:28   severity: fail
03:20:28   exitcode: 1
03:20:28   stack: |-
03:20:28     events.js:167
03:20:28           throw er; // Unhandled 'error' event
03:20:28           ^
03:20:28     
03:20:28     Error: 281473541976064:error:140943FC:SSL routines:ssl3_read_bytes:sslv3 alert bad record mac:../deps/openssl/openssl/ssl/record/rec_layer_s3.c:1407:SSL alert number 20
03:20:28     
03:20:28         at DuplexSocket.ondata (internal/wrap_js_stream.js:64:22)
03:20:28         at DuplexSocket.emit (events.js:182:13)
03:20:28         at addChunk (_stream_readable.js:283:12)
03:20:28         at readableAddChunk (_stream_readable.js:264:11)
03:20:28         at DuplexSocket.Readable.push (_stream_readable.js:219:10)
03:20:28         at DuplexSocket._write (/home/iojs/build/workspace/node-test-commit-arm/nodes/ubuntu1604-arm64/test/common/duplexpair.js:28:22)
03:20:28         at doWrite (_stream_writable.js:410:12)
03:20:28         at clearBuffer (_stream_writable.js:540:7)
03:20:28         at DuplexSocket.Writable.uncork (_stream_writable.js:314:7)
03:20:28         at JSStreamWrap.doWrite (internal/wrap_js_stream.js:153:17)
03:20:28     Emitted 'error' event at:
03:20:28         at TLSSocket._emitTLSError (_tls_wrap.js:601:10)
03:20:28         at TLSWrap.onerror (_tls_wrap.js:267:11)
03:20:28         at DuplexSocket.ondata (internal/wrap_js_stream.js:64:22)
03:20:28         [... lines matching original stack trace ...]
03:20:28         at clearBuffer (_stream_writable.js:540:7)
03:20:28   ...

'use strict';

// test to validate massive dns lookups do not block filesytem I/O
// (or any fast I/O).Prior to https://github.com/libuv/libuv/pull/1845
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Space after the dot.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @thefourtheye - done.

// We need to refresh the domain string everytime,
// otherwise the TCP stack that cache the previous lookup
// returns result from memory, breaking all our Math.
dns.lookup(`${randomDomain()}.com`, {}, common.mustCall(() => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The callback function can be moved out and used with common.mustCall(..., slowIOmax).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the callback body out of the loop. However, issue with attaching expected call count to the common.mustCall is: the loop executes 100 times, and each time when we invoke common.mustCall with slowIOmax, the total expectation becomes 10K. So I left it as is (default:1). Hope this is fine with you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about something like

const onResolve = common.mustCall(..., slowIOmax);
...
dns.lookup(..., onResolve);

But your current change is intuitive. Let's go with that.

@Trott
Copy link
Member

Trott commented Oct 4, 2018

@Trott
Copy link
Member

Trott commented Oct 4, 2018

Full CI re-run post-fixup: https://ci.nodejs.org/job/node-test-pull-request/17627/

Copy link
Member

@addaleax addaleax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but we might need to be careful about landing this because the libuv patch may be at fault for a regression in our tests, see nodejs/reliability#18 (comment)

@gireeshpunathil
Copy link
Member Author

thanks @addaleax for the info. sure, let me hold this until the libuv refix lands in node.

@gireeshpunathil gireeshpunathil changed the title test: add uv threadpool congestion regression test [do not land] test: add uv threadpool congestion regression test Oct 5, 2018
@gireeshpunathil
Copy link
Member Author

@addaleax - also if you can describe a higher level statement on the flakes that the current libuv caused, I can see if that scenario can be added to this test. Unfortunately I wasn't able to follow root cause of the recent flakes (at the thread level)

@addaleax
Copy link
Member

addaleax commented Oct 5, 2018

@gireeshpunathil So … as far as I can tell, the issue was that slow_io_work_running was increased with one mutex locked, and decreased with another mutex locked; that left it vulnerable to race conditions between workers, and so writes to the variable could get lost. That could make it seem like we have exhausted our # of available threads for slow I/O work, when in reality they were idle; that in turn prevented new DNS requests from being executed, and so the tests linked in libuv/libuv#2021 stalled and eventually timed out.

I think any test that would perform a lot of fast DNS requests (e.g. for localhost) in a row would be hit by that bug, at least on some platforms?

@Trott
Copy link
Member

Trott commented Nov 21, 2018

@addaleax @gireeshpunathil Can this move forward? Or is it blocked on libuv for now? Or something else?

@gireeshpunathil
Copy link
Member Author

I believe the test makes meaningful assertions only in the presence of libuv/libuv@daf04e8 , that is slated to land in node through libuv v1.24.x

@Trott Trott added the blocked PRs that are blocked by other issues or PRs. label Nov 21, 2018
@gireeshpunathil
Copy link
Member Author

Resume Build CI: https://ci.nodejs.org/job/node-test-commit/24015/

@gireeshpunathil
Copy link
Member Author

not sure what to make out of the CI result. the graphical view says something failed, but going into the link I can't see any?

@gireeshpunathil
Copy link
Member Author

to be sure, re-run CI: https://ci.nodejs.org/job/node-test-commit/24081/

@gireeshpunathil
Copy link
Member Author

Failures in both Linux and AIX are the same, test-cli-syntax that is supposedly addressed through #24907

Fresh CI: https://ci.nodejs.org/job/node-test-pull-request/19350/

@gireeshpunathil
Copy link
Member Author

after many runs of CI in expectation of a green sign giving up for now!

test-cli-syntax needs to be fixed / bailed out of the suite.

I will attempt to debug that first to see what is happening.

@gireeshpunathil
Copy link
Member Author

all the known failures are either marked as flaky or resolved; so another round of CI - looks like I can't resume from the old build, it is expired:
https://ci.nodejs.org/job/node-test-pull-request/19854/

@gireeshpunathil
Copy link
Member Author

a test is timing out in arm. Resume Build CI: https://ci.nodejs.org/job/node-test-commit/24705/

@gireeshpunathil
Copy link
Member Author

with 2 separate runs on arm and windows

https://ci.nodejs.org/job/node-test-commit-arm-fanned/5471/
https://ci.nodejs.org/job/node-test-commit-windows-fanned/23644/

now the CI gets a full green.

@gireeshpunathil gireeshpunathil added the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label Dec 29, 2018
Validate that massive dns lookups do not block filesytem I/O
(or any fast I/O for that matter).
Prior to libuv/libuv#1845 few back-to-back dns
lookup were sufficient to engage libuv threadpool workers in a blocking
manner, throttling other work items that need the pool. this test acts
as a regression test for the same.

Start slow and fast I/Os together, and make sure fast I/O can complete
in at least in 1/100th of time for slow I/O.

Refs: libuv/libuv#1845
Refs: nodejs#8436

PR-URL: nodejs#23099
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
@gireeshpunathil gireeshpunathil merged commit 54fa59c into nodejs:master Dec 30, 2018
targos pushed a commit that referenced this pull request Jan 1, 2019
Validate that massive dns lookups do not block filesytem I/O
(or any fast I/O for that matter).
Prior to libuv/libuv#1845 few back-to-back dns
lookup were sufficient to engage libuv threadpool workers in a blocking
manner, throttling other work items that need the pool. this test acts
as a regression test for the same.

Start slow and fast I/Os together, and make sure fast I/O can complete
in at least in 1/100th of time for slow I/O.

Refs: libuv/libuv#1845
Refs: #8436

PR-URL: #23099
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
@Trott
Copy link
Member

Trott commented Jan 1, 2019

I'll open a separate issue, but this test failed on node-daily-master (the only place where internet tests are run) last night. Considering it was only added two days, that's probably cause for concern?

https://ci.nodejs.org/job/node-test-commit-custom-suites/810/default/console

test-rackspace-ubuntu1604-x64-1

00:02:11 not ok 25 internet/test-uv-threadpool-schedule
00:02:11   ---
00:02:11   duration_ms: 0.613
00:02:11   severity: fail
00:02:11   exitcode: 1
00:02:11   stack: |-
00:02:11     assert.js:351
00:02:11         throw err;
00:02:11         ^
00:02:11     
00:02:11     AssertionError [ERR_ASSERTION]: fast I/O took longer to complete, actual: 17, expected: 3.63
00:02:11         at GetAddrInfoReqWrap.onResolve (/home/iojs/build/workspace/node-test-commit-custom-suites/default/test/internet/test-uv-threadpool-schedule.js:41:12)
00:02:11         at GetAddrInfoReqWrap.callback (/home/iojs/build/workspace/node-test-commit-custom-suites/default/test/common/index.js:376:15)
00:02:11         at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:57:17)
00:02:11   ...

@thefourtheye
Copy link
Contributor

Interesting. Looks like fast IO is really not 100 times faster than slow IO in our infra.

@gireeshpunathil Would it be okay to change the test to ensure that the fast IO takes lesser time (not 1/100th but simply fastIOtime < slowIOtime) than slow IO? Basically it will ensure that DNS lookups do not impact fs operations.

@gireeshpunathil
Copy link
Member Author

@thefourtheye - yes, and that is @Trott's #25358

refack pushed a commit to refack/node that referenced this pull request Jan 14, 2019
Validate that massive dns lookups do not block filesytem I/O
(or any fast I/O for that matter).
Prior to libuv/libuv#1845 few back-to-back dns
lookup were sufficient to engage libuv threadpool workers in a blocking
manner, throttling other work items that need the pool. this test acts
as a regression test for the same.

Start slow and fast I/Os together, and make sure fast I/O can complete
in at least in 1/100th of time for slow I/O.

Refs: libuv/libuv#1845
Refs: nodejs#8436

PR-URL: nodejs#23099
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
@BridgeAR BridgeAR mentioned this pull request Jan 16, 2019
@MylesBorins MylesBorins mentioned this pull request Jan 24, 2019
BethGriggs pushed a commit that referenced this pull request Apr 17, 2019
Validate that massive dns lookups do not block filesytem I/O
(or any fast I/O for that matter).
Prior to libuv/libuv#1845 few back-to-back dns
lookup were sufficient to engage libuv threadpool workers in a blocking
manner, throttling other work items that need the pool. this test acts
as a regression test for the same.

Start slow and fast I/Os together, and make sure fast I/O can complete
in at least in 1/100th of time for slow I/O.

Refs: libuv/libuv#1845
Refs: #8436

PR-URL: #23099
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
BethGriggs pushed a commit that referenced this pull request Apr 28, 2019
Validate that massive dns lookups do not block filesytem I/O
(or any fast I/O for that matter).
Prior to libuv/libuv#1845 few back-to-back dns
lookup were sufficient to engage libuv threadpool workers in a blocking
manner, throttling other work items that need the pool. this test acts
as a regression test for the same.

Start slow and fast I/Os together, and make sure fast I/O can complete
in at least in 1/100th of time for slow I/O.

Refs: libuv/libuv#1845
Refs: #8436

PR-URL: #23099
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
@BethGriggs BethGriggs mentioned this pull request May 1, 2019
BethGriggs pushed a commit that referenced this pull request May 10, 2019
Validate that massive dns lookups do not block filesytem I/O
(or any fast I/O for that matter).
Prior to libuv/libuv#1845 few back-to-back dns
lookup were sufficient to engage libuv threadpool workers in a blocking
manner, throttling other work items that need the pool. this test acts
as a regression test for the same.

Start slow and fast I/Os together, and make sure fast I/O can complete
in at least in 1/100th of time for slow I/O.

Refs: libuv/libuv#1845
Refs: #8436

PR-URL: #23099
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
MylesBorins pushed a commit that referenced this pull request May 16, 2019
Validate that massive dns lookups do not block filesytem I/O
(or any fast I/O for that matter).
Prior to libuv/libuv#1845 few back-to-back dns
lookup were sufficient to engage libuv threadpool workers in a blocking
manner, throttling other work items that need the pool. this test acts
as a regression test for the same.

Start slow and fast I/Os together, and make sure fast I/O can complete
in at least in 1/100th of time for slow I/O.

Refs: libuv/libuv#1845
Refs: #8436

PR-URL: #23099
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
author ready PRs that have at least one approval, no pending requests for changes, and a CI started. dns Issues and PRs related to the dns subsystem. libuv Issues and PRs related to the libuv dependency or the uv binding. test Issues and PRs related to the tests.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants