Skip to content
This repository has been archived by the owner on Apr 22, 2023. It is now read-only.

V8: avoid deadlock when profiling is active #25309

Closed
wants to merge 1 commit into from

Conversation

dmelikyan
Copy link

A deadlock happens when sampler initiated by SIGPROF tries to lock
the thread and the thread is already locked by the same thread. As
a result, other thread involved in sampling process hangs. The
patch adds a check for thread lock before continuing sampler
operation.

The fix has been tested on a sample app under load with and without
profiling turned on.

Fixes issue #14576 and specifically the duplicate issue #25295

@JacksonTian
Copy link

The issue was reported multi time recently. cc @misterdjules

@mhdawson
Copy link
Member

mhdawson commented Jun 4, 2015

FYI @tunniclm is validating this change against issues we've seen as well

@tunniclm
Copy link

tunniclm commented Jun 5, 2015

Background
I ran across this hang as well while testing some code using the CpuProfiler API. My test would hang about 1 in 60 runs. To enable continued testing of that code, I made some rough changes in an attempt to make my local V8 stable. I explored two approaches both different to the one in this pr:

  1. Make the process_wide_mutex_ recursive -- this did not resolve the hang but did narrow the window significantly (it would hang in about 1 in 300 runs); the problem would now only happen when pthread_mutex_lock() was interrupted at a critical point
  2. Block the SIGPROF signal using pthread_sigmask() while acquiring and releasing the process_wide_mutex_ -- this worked (no hangs in 3000 runs)

In preparing to share this with the community I found @dmelikyan 's work. I wanted to check the fixes with my tests just in case they could find a similar window as (1).

Result
I have run my testcases against the fix in this pull request 3000 times on Linux x86 and Linux PPC64 and there were no hangs, so this fix looks good to me. :)

// ISSUE-14576: To avoid deadlock, return if there is a thread lock
if (Isolate::GetProcessWideMutex().Pointer()->TryLock()) {
Isolate::GetProcessWideMutex().Pointer()->Unlock();
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: I'm not familiar with V8's code base's style guidelines, but I think this line should be } else { (curly braces on the same line as the else keyword).

@misterdjules
Copy link

@tunniclm Thank you for the tests report!

@dmelikyan Besides my two comments, I think it's worth it to try to come up with a regression test, is this something you have some time to work on? Thank you very much again for your work and your patience!

@misterdjules misterdjules added the V8 label Jun 6, 2015
@dmelikyan
Copy link
Author

@misterdjules I've added a regression test. Not sure if "simple" if is the right place for it though. Also fixed the style problem.

@dmelikyan dmelikyan force-pushed the fix-issue-14576 branch 2 times, most recently from 172f9c3 to b9a5b85 Compare June 8, 2015 10:21
@dmelikyan
Copy link
Author

@misterdjules Switched to using mutex pointer instead of object just in case related to your comments.


var child = undefined;

var deadlockTimer = setTimeout(function() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests suite driver has a default timeout of 60 seconds, and we usually don't setup timeouts in the tests themselves, so I would suggest removing that mechanism. If the test times out, it will eventually be killed by the tests suite driver and the failure will be reported as a timeout.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spawned and hanging node processes aren't killed by the test suite, because they do not die when parent process dies (at least on Mac). They should be killed explicitly. I've changed the way it's done so that it is compliant with the timeout option. Also running the test script multiple times now.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmelikyan That's great, thank you!

@misterdjules
Copy link

@dmelikyan test/simple is the right place if the test runs very quickly (under a couple seconds) and doesn't depend on any external dependency (which is the case for your test).

});
}

var testScript = "var i = 0; function r() { if(++i > 25) return; setTimeout(r, 1); }; r();";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: wrap at 80 columns by concatenating the string.

@misterdjules
Copy link

@dmelikyan Thank you very much for making the changes according to my comments 👍 I think we're making some good progress to get this landed soon. I just have one remaining question and some styling comments.

@dmelikyan
Copy link
Author

@misterdjules See latest commit for style changes. Other thing, running with --prof creates files, e.g. isolate-0x101004c00-v8.log in the node directory. Should we worry about it? I'll try to find a way to remove them or disable if possible at all.

@misterdjules
Copy link

@dmelikyan That's a good question, I think it's worth it to investigate if we can remove them when the test completes.

@misterdjules
Copy link

@dmelikyan @tunniclm So I would think that this should be ready to land when after we determine what can be done with the log files generated by the test. Thank you for the great work so far!

@dmelikyan
Copy link
Author

@misterdjules There is a V8 option for disabling per isolate log files. See my last commit. No log files are generated now.

@misterdjules
Copy link

@dmelikyan Great! Now let's just squash the commits into one and we'll be ready to land it.

A deadlock happens when sampler initiated by SIGPROF tries to lock
the thread and the thread is already locked by the same thread. As
a result, other thread involved in sampling process hangs. The
patch adds a check for thread lock before continuing sampler
operation.

The fix has been tested on a sample app under load with and without
profiling turned on.

Fixes issue nodejs#14576 and specifically the duplicate issue nodejs#25295
@dmelikyan
Copy link
Author

@misterdjules squashed.

@misterdjules
Copy link

LGTM.

misterdjules pushed a commit that referenced this pull request Jun 10, 2015
A deadlock happens when sampler initiated by SIGPROF tries to lock
the thread and the thread is already locked by the same thread. As
a result, other thread involved in sampling process hangs. The
patch adds a check for thread lock before continuing sampler
operation.

The fix has been tested on a sample app under load with and without
profiling turned on.

Fixes issue #14576 and specifically the duplicate issue #25295

Reviewed-By: Julien Gilli <julien.gilli@joyent.com>
PR-URL: #25309
@misterdjules
Copy link

Landed in b81a643, thank you all 👍

@misterdjules
Copy link

Added a reference to this change to the list of floating patches on top of V8 in v0.12.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants