Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add -showDebugAddress to enable debugging hanging parallel tests #461

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

onsi
Copy link
Owner

@onsi onsi commented Apr 25, 2018

The experience when parallel tests hang currently sucks. (See #323 and #206)

I had some time recently to try out a few ideas and landed on this PR that I'd like feedback on.

The basic premise is that when you now run ginkgo -p -showDebugAddress Ginkgo will emit a debug http address that delivers a JSON payload that represents:

  • the current running nodes
  • the current running test on each node
  • any stdout output emitted by that test so far
  • any GinkgoWriter writes emitted by that test so far
  • when the test started running

The hope is that this information can be useful to help debug the tests. The JSON output is a bit noisy (it's a serialization of some of the reporter types).

This would ultimately benefit from some sort of user-friendly client (e.g. ginkgo debug <DEBUG ADDRESS> could emit some pretty printed output; or one could imagine a simple web client to render the output nicely). There are also alternative ways to do this (e.g. send a signal to ginkgo to emit this JSON to stdout instead of hitting an HTTP endpoint).

Would appreciate any and all feedback and also contributions to improve the user experience.

@jvshahid
Copy link
Collaborator

If this information is available on the coordinating node why wouldn't a combination of running ginkgo with a new flag (sorry couldn't come up with a good name) & C-c be sufficient. If you interrupt ginkgo in this new mode it could print the following for any child node that is still running:

  1. send it sigabrt to capture the stack trace in the stderr
  2. wait for the node to exit
  3. print the current test that is running on that node and when it was started
  4. GingkgoWriter output
  5. Standrad out output
  6. Standard err output (which should include the stacktrace dump)

I would like to point out that the stacktrace is very important to me when debugging hanging tests.

@onsi
Copy link
Owner Author

onsi commented Apr 26, 2018

I was imagining you might want to get visibility into the test without it hanging, but could explore some of these options.

@onsi
Copy link
Owner Author

onsi commented Apr 26, 2018

Actually - I've received an alternative solution suggestion that would be easier to implement and reason about. Instead, what if:

ginkgo -p -debug (for example)

caused each node to stream its output to a file on disk. Then the user could examine those files to understand what each node is doing and whether it is stuck.

In this mode a signal could be interpreted by the coordinating process as a prompt to have the nodes emit their stack trace - perhaps to the same file.

Thoughts?

@blgm
Copy link
Collaborator

blgm commented Apr 26, 2018

A bit more on the 'files on disk' suggestion:

  • Initially it could be controlled by an environment variable, e.g. GINKGO_DEBUG_DIR
  • it would be good to specify the directory as you might want to put the output somewhere that it can be captured (for instance if running in Concourse)
  • the style of files could be similar to UNIX syslog - i.e. a high resolution timestamp followed by human-readable text
  • I don't think there's any need to do clever log rotation etc... for the first implementation
  • Another (even simpler) implementation would be to have each worker log to STDOUT. This could be controlled by a GINKGO_DEBUG environment variable, or with multiple uses of the -v flag - a bit like ssh which allows things like ssh -v -v -v.

@onsi
Copy link
Owner Author

onsi commented Apr 26, 2018

I think I lean towards a flag over an env-var as it is ginkgo's primary interaction model. Can you share why environment variables might be preferred?

@jvshahid
Copy link
Collaborator

My only concern with the files on disk is it could be harder to get those files if the tests are running on concourse for example. Collecting the information and printing it to stdout could help debugging hanging tests in CI if concourse sends a SIGTERM and allow enough time for ginkgo to print the information to stdout

@anEXPer
Copy link

anEXPer commented Apr 26, 2018

Files on disk is easy to look at when debugging failures in CI with hijack, etc, and easy to follow live locally with tail -f.

stdout isn't suitable, both because of redaction concerns in CI, and because of difficulties watching output from multiple nodes in a single stream when watching locally.

I prefer writing to a file as a solution over both an API and stdout.

@blgm
Copy link
Collaborator

blgm commented Apr 27, 2018

@onsi, I don’t have a strong view on whether it’s a flag or an environment variable. I would lean towards an environment variable if it’s a “pre-release” feature that I want feedback from specific users, and might change radically. And a command line flag is more appropriate for something that’s stable and is ready for used by anyone.

@onsi
Copy link
Owner Author

onsi commented May 6, 2018

I've pushed a file-based approach to parallel-debug-file. Would love folks to take a look and send feedback. @williammartin might need your help vendoring a couple of dependencies that needed to get pulled in if we choose to go down this road.

With this approach

ginkgo -p -debug

will emit output from each node to files named ginkgo-node-N.log in the test's directory.

@blgm
Copy link
Collaborator

blgm commented May 29, 2018

@onsi, after what must seem like a geological age, here are some very quick thoughts on the parallel-debug-file branch:

  • If you run ginko recursively, then you get a lot of files which are a bit tricky to clean up. You could add them to .gitignore obviously. I just wonder whether a better user experience would be to put them somewhere else. But that could be fixed in a later iteration.
  • It would be nice if the lines had ISO 8601 timestamps at the start, a bit like a traditional UNIX syslog. That's particularly useful when looking at potential hangs as you have some idea about when the last line was written.
  • I wouldn't worry too much about vendoring just yet. The branch works in Travis. Vendoring in Go is very much in flux, and there are different solutions for different versions of Go.

@onsi
Copy link
Owner Author

onsi commented May 30, 2018

thanks for taking a look @blgm

I'd love to find an interface that works well. I think the file approach has merit and agree that ending up with lots of files all over the place isn't great. Any thoughts on an alternative design?

As for timestamps. Unfortunately given the syscall hackery involved in rerouting stdout I don't think I can inject timestamps.

@blgm
Copy link
Collaborator

blgm commented May 30, 2018

Perhaps creating them all in the same directory with a name scheme like ginkgo-node-U-N.log where U is some kind of ID? But it would make it harder to correlate files to the code that they relate to.

I'd be in favour of merging this as is, with a caveat that it's an experimental feature subject to change. Most users would not add the --debug flag so would not see any difference. But users who observe occasional Ginkgo hangs would have a tool to help them out. We could then get feedback from these users, and base any future changes on that feedback.

@nodo
Copy link
Collaborator

nodo commented May 31, 2018

Just out of curiosity, what were the reasons to not follow the suggestion from @vito in #323 ?

I am not super familiar with the way Ginkgo handles parallelism, but each node could trap SIGQUIT and output the stack dump to a known file. Maybe we could also create a new process group with the coordinating node and the nodes running the test, so that a user could kill them all with one command and get the stack dumps.

The advantage of this approach is that there's no need to start Ginkgo with special flags. A disadvantage is that maybe the stack dump is harder to read than logs from the nodes.

@nodo
Copy link
Collaborator

nodo commented May 31, 2018

Oh I have missed the second comment on this thread.

I was imagining you might want to get visibility into the test without it hanging, but could explore some of these options.

^_ fair enough, but it assumes users are starting ginkgo with the --debug flag. Which is a bit strange... unless they are having issues :)

@onsi
Copy link
Owner Author

onsi commented Jul 10, 2018

Good point @nodo

We should update the docs here:
https://onsi.github.io/ginkgo/#ginkgo-and-continuous-integration

To recommend that folks add the debug flag in CI environments

@blgm
Copy link
Collaborator

blgm commented Jul 10, 2018

I've merged the file solution via #499. I'll leave this open to remind us to update the docs.

@bruce-ricard
Copy link
Contributor

bruce-ricard commented Apr 15, 2019

I'll leave this open to remind us to update the docs.

Is it working @blgm ? :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants