-
-
Notifications
You must be signed in to change notification settings - Fork 905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bogus data being marked for GC sweeps under Rubinius #1047
Comments
Extra info: Rubinius has two GC systems in place, Immix and Baker. Baker is used for the young generation, Immix for the mature generation (if I'm not mistaken). In a production application the above issue mainly occurs in the Baker system whereas in the script mentioned above the issue primarily occurs in the Immix system. |
http://hastebin.com/raw/rolenakale This pastie contains a sample of the backtraces of all threads when this issue would occur in a production application. The relevant bit is the following:
This shows that it occurs both in the young and mature generation. Corresponding Rubinius issue: rubinius/rubinius#2908 |
Backtraces of the repro script: http://hastebin.com/raw/pexacaladi |
This indirectly fixes one of the issues exposed in the following: sparklemotion/nokogiri#1047 #2844 This is not the root cause of the Nokogiri-relate segvs, which appear to be Nokogiri (libxml2) releasing data for a node while a Data object wrapping it is still reachable and hence is having mark() called on it.
I ran the repro script that @yorickpeterse provided, with one change: I upped the threads to 100. I ran this on an Ubuntu 12.04 vbox machine on rubinius/rubinius@56d2edb with rubinius/rubinius@eba5d40 and rubinius/rubinius@6da3885 applied. I build Rubinius and Nokogiri without optimizations and with debugging symbols enabled. I instrumented Nokogiri (https://gist.github.com/brixen/9138421#file-nokogiri_1047-diff) to track where libxml2 pointers were being wrapped and where xmlFree* was being called. I also added instrumentation around the malloc, realloc, and free functions that Nokogiri is configuring libxml2 with. I instrumented Rubinius (https://gist.github.com/brixen/9138421#file-rbx-diff) to track Data objects being finalized. I've run multiple times with the same results:
This gist is the output of running ack on the log file for the pointer that causes the fault: https://gist.github.com/brixen/9106890. I built the instrumentation up over time, but the same pattern is visible. This run is the last: https://gist.github.com/brixen/9106890#file-gistfile7-txt The pointer causing the fault is a member of a NodeSet and is wrapped at this point in Nokogiri code: https://github.com/sparklemotion/nokogiri/blob/master/ext/nokogiri/xml_node_set.c#L268 I attempted to use xmlCopyNode, passing 1 for the extended parameter, but that function does not completely copy the node. In particular, the node->doc was 0x0 in the copy, as were parent, next, previous, IIRC. The fundamental issue appears to be that Nokogiri is assuming the lifetime of the libxml2 object that it is wrapping bounds the lifetime of the Ruby object in which the pointer is wrapped. This assumption appears to be invalid in this particular case, which causes me to wonder how many other cases are invalid as well, but by some happy coincidence, usually work. If this is the case, Nokogiri is fundamentally unsound. |
Also, I have all those logs if they are useful, but they are really big (each ~300mb gzip'd). |
This makes whatever the issue is in sparklemotion/nokogiri#1047 #2844 to not manifest during more than 1 hour of 100 threads running in the repro script in the Nokogiri ticket. If a bug never appears, does it exist?
There are two scenarios that seem plausible. Rubinius has a concurrent mark thread for the mature generation. It's possible that Nokogiri calls a libxml2 function that releases a tree, ultimately calling free on a node that the mark thread reaches after libxml2 calls free, but before Nokogiri finishes making the Ruby Data objects unreachable. The other scenario is that Nokogiri is not thread safe in some respect. Or it could be a combination of the two. Note that here, free is called 4 times on an address with no intervening allocs, and here, it appears alloc is returning the same address twice with no intervening free, then free is called twice. The latter could be the result of interleaving of output, but the former is hard to explain. Running Rubinius with the Ultimately, I added a C-API lock specifically for Nokogiri. This also appears to prevent the issue from manifesting in over 60 min of processing. |
VMs like the JVM that have concurrently-executable garbage collectors generally do not allow those collectors to run concurrently with C calls that may be accessing managed objects. In fact, under most circumstances JVMs do not allow GC to run at all while unmanaged code is executing, unless you know the proper VM-specific incantations to allow that to happen. If Rubinius is actually running GC concurrent to unmanaged code, this will not be the only issue to come up. Part of the implicit contract of MRI's C API is that unless GC is explicitly invoked, objects in hand will live until execution returns to the VM. The JVM's JNI API makes the same guarantee for the same reasons: you can't GC objects that native code might still be using. I would recommend not allowing Rubinius's GC to run during the lifetime of a C extension downcall for all C extensions. |
Rubinius was not collecting an object. It was marking it, per the mark function set on the Data object. I'd argue it's a bug to wrap a structure that is being free'd independent of the available free function on the Data object. |
@brixen Perhaps I'm confused about your GC... isn't marking part of the GC cycle? Or is there some other "marking" going on that's unrelated to GC? |
Perhaps I can clarify things a bit. What I meant was that within a given C downcall, Rubinius should probably guarantee that no concurrent interaction of GC mechanisms (marking, sweeping, compacting) can occur while that C call is active. Further, the MRI C API guarantees (implicitly, I'll grant) that VM-level mechanisms like GC will not interact with any objects (excessive, perhaps, but illlustrative when you are attempting to emulate that API) while a C extension is executing. I think Rubinius will continue to run into problems with C extensions as long as it allows VM-level processes to run concurrently with C extension downcalls. |
Was the Rubinius team able to sort out exactly what's causing this? I have heard vague claims that "libml isn't thread-safe" or "nokogiri isn't thread-safe" but nobody seems to have recorded exactly what makes them thread-safe. That would be very useful information to give to either the Nokogiri team or the libxml maintainers. If this is fixable in Nokogiri, it should be fixed. Need more information on what to fix. If it's not fixable in Nokogiri but exposes a problem in libxml, the libxml maintainers should be notified and this bug should be closed (and libxml should be fixed, but that's somewhat out of our control). If this is not fixable in Nokogiri or libxml, then it would indicate something's wrong in Rubinius, and this bug should be closed. |
I would like some clarification about @brixen's "same result" comment above. Specifically:
What does this mean? A pointer is a memory address. Do you mean the object that wraps the pointer? And marked by whom?
What are these indications? Do you have evidence to support this claim?
Again I'm unclear what you mean by "mark the pointer". If the pointer is allocated by libxml, then marking does not apply; it's a pointer to opaque memory libxml2 controls. I think you mean that the pointer gets freed independently of the DATA-related free function associated with the object? And you believe it's libxml doing it? |
I've tried to consume as much of this bug as possible, and I'm still confused why you think that the problem is in nokogiri or libxml. In #1047 (comment) @brixen said that adding a lock around nokogiri native calls and turning off the concurrent GC fixes the problem. That could indicate a concurrency issue in Nokogiri or libxml, true. But it could also point at Rubinius's concurrent GC or a problem with the test (especially if the threads running in Rubinius are not using isolated objects from Nokogiri). What's the status? Do the changes that fixed the issue for @brixen fix it in general? |
I've been running the I've tried:
Any suggestions on how I can more reliably reproduce it? I've tried increasing and decreasing the number of threads; what else should I be trying? Possibly-notably, I don't seem to be able to saturate my CPUs. Not having much experience using Rubinius, I wonder if I'm missing something obvious? Any help or advice would be greatly appreciated. Thanks. -m |
@flavorjones I added a C-API lock specifically for Nokogiri, so you'll get no parallelization in Nokogiri code when running threads on Rubinius https://github.com/rubinius/rubinius/blob/a7a6cb0052e3a3c19a3559f43cb1eb973251bfb5/vm/shared_state.cpp#L432 Perhaps @yorickpeterse has a better repro. |
@brixen I tried running the script with and without the lock, neither crashed for me on Rbx master. I'll try with my actual apps in the coming week to see what happens there. |
@yorickpeterse when you say, "with and without the lock", what do you mean? Nokogiri gets its own artisanal, free-range, environmentally friendly special lock now. How are you running, "without the lock"? |
@brixen As in, I removed the |
@yorickpeterse ah, that is good to know! |
@yorickpeterse, any luck seeing this happen in the wild in the last 10 days? |
@flavorjones Sorry, I haven't gotten to it yet. We're right in the middle of a big database migration this week, I'll try to take a look at it this Friday/next week. |
@yorickpeterse - just bumping to keep this on your radar. Would love more info if you have it. |
I tried my app again and it did crash once with a backtrace similar to the ones discussed above. Sadly I didn't set up logging of crashes properly making copy-pasting the error a total nightmare. I'm currently running the application under GDB but of course now it refuses to crash. |
Running modern nokogiri under rbx-3, these crashes don't seem to occur (at least not in CI). I'd like to propose closing this, unless there's new and compelling information that I can use to investigate. |
Closing. If anyone comes across this later, rbx3 builds with both system and vendored libraries are here, and look green: |
Problem: in certain cases it seems that Nokogiri is marking (using the
mark
function found at https://github.com/sparklemotion/nokogiri/blob/master/ext/nokogiri/xml_node.c#L15) bogus data to be sweeped by the GC. Under MRI this seems to magically work but under Rubinius this triggers a segfault.A script that reproduces this is as following: https://gist.github.com/YorickPeterse/29bdbab31c0cabdc66b2. When running this make sure you're using Rbx 2.2.5 and Nokogiri from Git source.
Observations so far:
mark()
function seems to handlexmlElementType
structures which are then marked at https://github.com/sparklemotion/nokogiri/blob/master/ext/nokogiri/xml_node.c#L23. This in some cases triggers a segfault.Samples from gdb:
The corresponding backtrace:
I doubt this particular issue is caused due to threading, it's more likely that this is an issue similar to the one described in #939. The use of threading simply increases the amount of GC activity which in turn causes this issue to surface much quicker.
The text was updated successfully, but these errors were encountered: