Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

investigate flaky test-http-client-race-2 on SmartOS #11026

Closed
Trott opened this issue Jan 26, 2017 · 9 comments
Closed

investigate flaky test-http-client-race-2 on SmartOS #11026

Trott opened this issue Jan 26, 2017 · 9 comments
Labels
http Issues or PRs related to the http subsystem. smartos Issues and PRs related to the SmartOS platform. test Issues and PRs related to the tests.

Comments

@Trott
Copy link
Member

Trott commented Jan 26, 2017

  • Version: v8.0.0-pre
  • Platform: SmartOS16-64
  • Subsystem: http

https://ci.nodejs.org/job/node-test-commit-smartos/6613/nodes=smartos16-64/console

not ok 718 parallel/test-http-client-race-2
  ---
  duration_ms: 60.142
  severity: fail
  stack: |-
    timeout
@Trott Trott added test Issues and PRs related to the tests. http Issues or PRs related to the http subsystem. smartos Issues and PRs related to the SmartOS platform. labels Jan 26, 2017
@Trott
Copy link
Member Author

Trott commented Jan 26, 2017

I wonder if we should open a single issue for this and all the other first-time SmartOS timeout test failures we've been seeing in the last week or three.

#11026
#11019
#11003
#10966
#10592

One thing they all seem to have in common: They are all http/tls tests. Totally guessing here, but maybe the test suite is triggering some sort of firewall/intrusion detection on the host causing it to silently drop a connection or something like that?

@nodejs/testing @nodejs/platform-smartos @nodejs/http

@misterdjules
Copy link

@Trott I'll try to take a look at this before the end of the week, if not early next week at the latest. Sorry for the delay :(

@misterdjules
Copy link

Looking at it now.

@misterdjules
Copy link

misterdjules commented Jan 31, 2017

So far, I haven't been able to reproduce the problem described by any of the issues listed above.

In order to be able to get more information and investigate future spurious failures, I submitted a PR that sends SIGABRT instead of SIGTERM to test processes that timeout. This will allow us to take a look at core files generated from these processes with tools such as llnode and mdb_v8, and will potentially help us root cause these issues.

In the meantime I'll continue trying to reproduce and investigate those issues, I'll keep you posted.

@Trott
Copy link
Member Author

Trott commented Jan 31, 2017

I haven't noticed any SmartOS failures lately that are similar to these. It's possible these issues have stopped happening on CI. In which case, they may be very difficult to reproduce at this point.
¯\(ツ)

misterdjules pushed a commit to misterdjules/node-1 that referenced this issue Feb 2, 2017
Currently, when a process times out, it is terminated by sending it the
SIGTERM signal. Sending SIGBART instead allows the operating system to
generate a core file that can be investigated later using post-mortem
debuggers such as llnode or mdb_v8.

This can be very useful when investigating flaky tests that time out,
since in that case the failure is difficult to reproduce, and being able
to look at a core file makes a big difference.

With these changes, passing the --abort-on-timeout command line option
to tools/test.py now sends SIGABRT to processes timing out on all
platforms but Windows.

Refs: nodejs#11026
jasnell pushed a commit that referenced this issue Feb 3, 2017
Currently, when a process times out, it is terminated by sending it the
SIGTERM signal. Sending SIGBART instead allows the operating system to
generate a core file that can be investigated later using post-mortem
debuggers such as llnode or mdb_v8.

This can be very useful when investigating flaky tests that time out,
since in that case the failure is difficult to reproduce, and being able
to look at a core file makes a big difference.

With these changes, passing the --abort-on-timeout command line option
to tools/test.py now sends SIGABRT to processes timing out on all
platforms but Windows.

PR-URL: #11086
Ref: #11026
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Gibson Fahnestock <gibfahn@gmail.com>
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
@misterdjules
Copy link

#11086 was merged and nodejs/build#613 was created to make tests that time out generate a core file that could be inspected to help root cause these failures.

Thus, I'd suggest that we mark this test (and the other flaky ones mentioned in #11026 (comment)) as flaky on SmartOS.

Then we should make sure that when a build is marked "unstable" on SmartOS, we grab the core files that are generated and upload them somewhere (we can use Joyent's manta for that, as this is part of the resources donated by Joyent to the project) where they won't be cleaned up for further investigation.

How does that sound?

@misterdjules
Copy link

Thus, I'd suggest that we mark this test (and the other flaky ones mentioned in #11026 (comment)) as flaky on SmartOS.

Thinking about it more, marking these tests as flaky might not be the best way forward. It is possible that these timeouts are due to the environment more than due to the tests themselves or to a bug in node on SmartOS.

Making these tests flaky could potentially make actual future failures silent, whereas leaving them non-flaky would make actual future failures trigger a failed build.

Since these failures don't seem to be too frequent, it shouldn't have a significant impact on productivity, and we'll be able to use the new --abort-on-timeout command line option to gather more data on the problem when these tests fail again (see nodejs/build#613).

Ideally, in the meantime we could also gather more data on the system and the test processes when a test times out by adding more instrumentation to tools/test.py.

Thus, I think I'm leaning towards closing these issues and using the new instrumentation/adding more instrumentation to help us investigate future failures instead of marking these tests as flaky.

@Trott
Copy link
Member Author

Trott commented Feb 3, 2017

SGTM. I don't want to mark tests as flaky in general except as an absolutely last resort. In the past, there were so many unreliable tests that we had to mark tests flaky or else we risked people ignoring all CI failures. Things have improved a lot since then. Marking as flaky reduces the visibility of the test failures. I'd rather keep them visible.

italoacasas pushed a commit that referenced this issue Feb 4, 2017
Currently, when a process times out, it is terminated by sending it the
SIGTERM signal. Sending SIGBART instead allows the operating system to
generate a core file that can be investigated later using post-mortem
debuggers such as llnode or mdb_v8.

This can be very useful when investigating flaky tests that time out,
since in that case the failure is difficult to reproduce, and being able
to look at a core file makes a big difference.

With these changes, passing the --abort-on-timeout command line option
to tools/test.py now sends SIGABRT to processes timing out on all
platforms but Windows.

PR-URL: #11086
Ref: #11026
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Gibson Fahnestock <gibfahn@gmail.com>
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
misterdjules pushed a commit to misterdjules/node-1 that referenced this issue Feb 13, 2017
Currently, when a process times out, it is terminated by sending it the
SIGTERM signal. Sending SIGBART instead allows the operating system to
generate a core file that can be investigated later using post-mortem
debuggers such as llnode or mdb_v8.

This can be very useful when investigating flaky tests that time out,
since in that case the failure is difficult to reproduce, and being able
to look at a core file makes a big difference.

With these changes, passing the --abort-on-timeout command line option
to tools/test.py now sends SIGABRT to processes timing out on all
platforms but Windows.

PR-URL: nodejs#11086
Ref: nodejs#11026
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Gibson Fahnestock <gibfahn@gmail.com>
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
italoacasas pushed a commit to italoacasas/node that referenced this issue Feb 14, 2017
Currently, when a process times out, it is terminated by sending it the
SIGTERM signal. Sending SIGBART instead allows the operating system to
generate a core file that can be investigated later using post-mortem
debuggers such as llnode or mdb_v8.

This can be very useful when investigating flaky tests that time out,
since in that case the failure is difficult to reproduce, and being able
to look at a core file makes a big difference.

With these changes, passing the --abort-on-timeout command line option
to tools/test.py now sends SIGABRT to processes timing out on all
platforms but Windows.

PR-URL: nodejs#11086
Ref: nodejs#11026
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Gibson Fahnestock <gibfahn@gmail.com>
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
misterdjules pushed a commit that referenced this issue Feb 15, 2017
Currently, when a process times out, it is terminated by sending it the
SIGTERM signal. Sending SIGBART instead allows the operating system to
generate a core file that can be investigated later using post-mortem
debuggers such as llnode or mdb_v8.

This can be very useful when investigating flaky tests that time out,
since in that case the failure is difficult to reproduce, and being able
to look at a core file makes a big difference.

With these changes, passing the --abort-on-timeout command line option
to tools/test.py now sends SIGABRT to processes timing out on all
platforms but Windows.

PR-URL: #11086
Ref: #11026
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Gibson Fahnestock <gibfahn@gmail.com>
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
misterdjules pushed a commit that referenced this issue Feb 15, 2017
Currently, when a process times out, it is terminated by sending it the
SIGTERM signal. Sending SIGBART instead allows the operating system to
generate a core file that can be investigated later using post-mortem
debuggers such as llnode or mdb_v8.

This can be very useful when investigating flaky tests that time out,
since in that case the failure is difficult to reproduce, and being able
to look at a core file makes a big difference.

With these changes, passing the --abort-on-timeout command line option
to tools/test.py now sends SIGABRT to processes timing out on all
platforms but Windows.

PR-URL: #11086
Ref: #11026
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Gibson Fahnestock <gibfahn@gmail.com>
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
MylesBorins pushed a commit that referenced this issue Feb 21, 2017
Currently, when a process times out, it is terminated by sending it the
SIGTERM signal. Sending SIGBART instead allows the operating system to
generate a core file that can be investigated later using post-mortem
debuggers such as llnode or mdb_v8.

This can be very useful when investigating flaky tests that time out,
since in that case the failure is difficult to reproduce, and being able
to look at a core file makes a big difference.

With these changes, passing the --abort-on-timeout command line option
to tools/test.py now sends SIGABRT to processes timing out on all
platforms but Windows.

PR-URL: #11086
Ref: #11026
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Gibson Fahnestock <gibfahn@gmail.com>
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
MylesBorins pushed a commit that referenced this issue Feb 21, 2017
Currently, when a process times out, it is terminated by sending it the
SIGTERM signal. Sending SIGBART instead allows the operating system to
generate a core file that can be investigated later using post-mortem
debuggers such as llnode or mdb_v8.

This can be very useful when investigating flaky tests that time out,
since in that case the failure is difficult to reproduce, and being able
to look at a core file makes a big difference.

With these changes, passing the --abort-on-timeout command line option
to tools/test.py now sends SIGABRT to processes timing out on all
platforms but Windows.

PR-URL: #11086
Ref: #11026
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Gibson Fahnestock <gibfahn@gmail.com>
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
krydos pushed a commit to krydos/node that referenced this issue Feb 25, 2017
Currently, when a process times out, it is terminated by sending it the
SIGTERM signal. Sending SIGBART instead allows the operating system to
generate a core file that can be investigated later using post-mortem
debuggers such as llnode or mdb_v8.

This can be very useful when investigating flaky tests that time out,
since in that case the failure is difficult to reproduce, and being able
to look at a core file makes a big difference.

With these changes, passing the --abort-on-timeout command line option
to tools/test.py now sends SIGABRT to processes timing out on all
platforms but Windows.

PR-URL: nodejs#11086
Ref: nodejs#11026
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Gibson Fahnestock <gibfahn@gmail.com>
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
@Trott
Copy link
Member Author

Trott commented Jul 16, 2017

Haven't seen this in a long time. Closing. Feel free to re-open or comment if it should be open.

@Trott Trott closed this as completed Jul 16, 2017
abhishekumar-tyagi pushed a commit to abhishekumar-tyagi/node that referenced this issue May 5, 2024
Currently, when a process times out, it is terminated by sending it the
SIGTERM signal. Sending SIGBART instead allows the operating system to
generate a core file that can be investigated later using post-mortem
debuggers such as llnode or mdb_v8.

This can be very useful when investigating flaky tests that time out,
since in that case the failure is difficult to reproduce, and being able
to look at a core file makes a big difference.

With these changes, passing the --abort-on-timeout command line option
to tools/test.py now sends SIGABRT to processes timing out on all
platforms but Windows.

PR-URL: nodejs/node#11086
Ref: nodejs/node#11026
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Gibson Fahnestock <gibfahn@gmail.com>
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
http Issues or PRs related to the http subsystem. smartos Issues and PRs related to the SmartOS platform. test Issues and PRs related to the tests.
Projects
None yet
Development

No branches or pull requests

2 participants