Investigate flaky parallel/test-tls-sni-option #26910

mscdex · 2019-03-25T23:28:29Z

Version: master
Platform: macos
Subsystem: tls

On osx1011 (test-macstadium-macos10.11-x64-1):

18:54:26 not ok 1968 parallel/test-tls-sni-option
18:54:26   ---
18:54:26   duration_ms: 0.663
18:54:26   severity: fail
18:54:26   exitcode: 1
18:54:26   stack: |-
18:54:26     events.js:171
18:54:26           throw er; // Unhandled 'error' event
18:54:26           ^
18:54:26     
18:54:26     Error: read ECONNRESET
18:54:26         at TLSWrap.onStreamRead (internal/stream_base_commons.js:182:27)
18:54:26     Emitted 'error' event at:
18:54:26         at emitErrorNT (internal/streams/destroy.js:91:8)
18:54:26         at emitErrorAndCloseNT (internal/streams/destroy.js:59:3)
18:54:26         at processTicksAndRejections (internal/process/task_queues.js:81:17)

The text was updated successfully, but these errors were encountered:

MylesBorins · 2019-03-28T05:20:59Z

I've been seeing this locally a bunch

lpinca · 2019-04-01T08:45:39Z

The culprit seems to be this commit: 42dbaed.

Easily reproducible on my machine with:

./tools/test.py -J --repeat=1000 test/parallel/test-tls-sni-option.js

If --node-args=--tls-max-v1.2 is used I can no longer reproduce.

cc: @sam-github

sam-github · 2019-04-18T21:31:18Z

@lpinca What kind of machine do you have? I can't repro with your test.py invocation, or by running:

while ./out/Release/node test/parallel/test-tls-sni-option.js; do; done;

three times in parallel, on linux x64. That said, this test looks like it depends on the order of completion of the handshakes on client relative to server, I am prepping a PR.

sam-github · 2019-04-18T21:46:52Z

I read the top in more detail, it's on OS X, I''ll try on that.

lpinca · 2019-04-19T05:04:44Z

Yes I can't reproduce on Linux, only on macOS.

lpinca · 2019-04-19T05:14:46Z

The following patch fixes it but I'm not sure why.

diff --git a/test/parallel/test-tls-sni-option.js b/test/parallel/test-tls-sni-option.js
index 3a6a231b47..6ccf631169 100644
--- a/test/parallel/test-tls-sni-option.js
+++ b/test/parallel/test-tls-sni-option.js
@@ -115,7 +115,7 @@ let clientError;
 
 const server = tls.createServer(serverOptions, function(c) {
   serverResults.push({ sni: c.servername, authorized: c.authorized });
-  c.end();
+  // c.end();
 });
 
 server.on('tlsClientError', function(err) {
@@ -137,7 +137,8 @@ function startTest() {
         client.authorizationError &&
          (client.authorizationError === 'ERR_TLS_CERT_ALTNAME_INVALID'));
 
-      next();
+      client.on('close', next);
+      client.end();
     });
 
     client.on('error', function(err) {

PR-URL: nodejs#27225 Refs: nodejs#26910 Refs: nodejs#27219 Refs: nodejs#26938 Refs: nodejs#23089 Reviewed-By: Richard Lau <riclau@uk.ibm.com> Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de> Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Yongsheng Zhang <zyszys98@gmail.com>

sam-github · 2019-04-23T01:51:47Z

Without injecting some tracing into the TLS code, I don't know exactly why that is happening, but its not surprising. TLS1.3 continues to exchange some packages after handshake, and handshake completes with different timings, so for some extra messages to arrive after the socket is closed, but for it to be racy, doesn't surprise me. With your change, the close goes from client, to server, then back to client, emits close, and then it's done, so the roundtrip occurs. I fixed bugs like this in other tests, but this one being only a bit racy I never noticed.

lpinca · 2019-04-23T05:29:12Z

Yes but isn't it strange that the race only occurs (or is exacerbated) on macOS?

sam-github · 2019-04-23T15:50:32Z

Systems are different. localhost TCP on Linux doesn't actually do a full TCP protocol, for example, it knows packet loss is impossible because its all in the local host's memory. Also, after a write() on TCP, which process gets scheduled depends on the OS. TLS connections that exchange no data and immediately do an ungraceful close before exchanging data are racy, and our unit tests do a lot of them, but real-world applications establish TLS connections to exchange data. The OpenSSL devs aren't happy that we are exchanging TLS data in the handshake complete callback, that's an issue, too, and perhaps related to this, because it causes a few more packets to be exchanged before the TLS close alert.

I could reintroduce my packet tracing to see exactly what packet is being sent, but that will take a bit to PR, and I'm not sure what the API should look like. I could wireshark, but since the link is encrypted and we lack the feature of dumping the master secret, I won't see the actual packet. I've been meaning to PR an improvement to that, too. Both are open feature requests.

Since I made a number of fixes like this, but with tests that failed more like 1/3 of the time, I'm OK with just doing a graceful TLS end. If we want to wait until TLS debugging features land, that's OK with me, too.

lpinca · 2019-04-23T19:05:37Z

I'm fine with anything as long as the test is not changed from its original intent.
Another possible way to "fix" it without closing from the client is to add a listener for the 'error' event on socket assuming that the error is sometimes expected as per discussion.

sam-github · 2019-04-23T19:56:36Z

The purpose of the test was to check the assertions related to SNI, which isn't affected by how the connection is destroyed.

sam-github · 2019-04-23T20:33:58Z

I reworked my refactor in #27300 so that it uses the same close sequence as original, and the test appears to be stable.

PR-URL: #27225 Refs: #26910 Refs: #27219 Refs: #26938 Refs: #23089 Reviewed-By: Richard Lau <riclau@uk.ibm.com> Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de> Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Yongsheng Zhang <zyszys98@gmail.com>

mscdex added tls Issues and PRs related to the tls subsystem. macos Issues and PRs related to the macOS platform / OSX. flaky-test Issues and PRs related to the tests with unstable failures on the CI. labels Mar 25, 2019

lpinca mentioned this issue Apr 1, 2019

Investigate flaky parallel/test-tls-js-stream #26938

Closed

bnoordhuis mentioned this issue Apr 4, 2019

test: test vm.runInNewContext() filename option #27056

Closed

lpinca mentioned this issue Apr 14, 2019

Investigate flaky parallel/test-tls-sni-server-client #27219

Closed

refack mentioned this issue Apr 14, 2019

test: mark some known flakes #27225

Merged

4 tasks

sam-github mentioned this issue Apr 18, 2019

test: deflake test-tls-sni-option #27300

Closed

4 tasks

lpinca closed this as completed Apr 26, 2019

lpinca mentioned this issue May 5, 2019

test: fix flaky test-tls-multiple-cas-as-string #27569

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate flaky parallel/test-tls-sni-option #26910

Investigate flaky parallel/test-tls-sni-option #26910

mscdex commented Mar 25, 2019 •

edited

Loading

MylesBorins commented Mar 28, 2019

lpinca commented Apr 1, 2019

sam-github commented Apr 18, 2019

sam-github commented Apr 18, 2019

lpinca commented Apr 19, 2019

lpinca commented Apr 19, 2019

sam-github commented Apr 23, 2019

lpinca commented Apr 23, 2019

sam-github commented Apr 23, 2019

lpinca commented Apr 23, 2019

sam-github commented Apr 23, 2019

sam-github commented Apr 23, 2019

Investigate flaky parallel/test-tls-sni-option #26910

Investigate flaky parallel/test-tls-sni-option #26910

Comments

mscdex commented Mar 25, 2019 • edited Loading

MylesBorins commented Mar 28, 2019

lpinca commented Apr 1, 2019

sam-github commented Apr 18, 2019

sam-github commented Apr 18, 2019

lpinca commented Apr 19, 2019

lpinca commented Apr 19, 2019

sam-github commented Apr 23, 2019

lpinca commented Apr 23, 2019

sam-github commented Apr 23, 2019

lpinca commented Apr 23, 2019

sam-github commented Apr 23, 2019

sam-github commented Apr 23, 2019

mscdex commented Mar 25, 2019 •

edited

Loading