-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasional async readback deadlocks on GLX #177
Comments
https://gist.github.com/4066441#file_app_node.trace I lost the original app node log. The log above is from a different run |
I solved a couple of these for 1.4 already, will have a look at it. |
It also happens with RTNeuron but only with DB compounds. I've commented the override of eq::Channel::frameAssemble on both and then they don't segfault anymore. |
The segfaults above are from an assertion which should be harmless in release mode. I'll fix this one in any case, hopefully tomorrow. |
@hernando: Please verify fix and close bug |
4 out of 10 tries deadlocked, another one aborted. For the aborted one I only have the trace: |
Ok, that's a different one I didn't see. Investigating... |
The crash seems harmless, it's just a sanity check for reference counting/deletion. The deadlock happens here, within my favourite driver:
This trace: https://gist.github.com/4066441#file_client.deadlock.trace |
This is again async readback related. Can you reproduce it with async rb off? |
On 14/11/12 14:20, Stefan Eilemann wrote:
|
Remove EQ_COMPRESSOR_USE_ASYNC_DOWNLOAD from compressorReadDrawPixels.cpp Daniel and me have a hypothesis: That behavior is expected, none of the contexts in the wglShareLists call can be current in a different thread to the one you are calling wglShareLists from. That important piece of information is missing from the MSDN. [http://www.opengl.org/discussion_boards/showthread.php/152648-wglShareLists-failing] The glx context sharing likely has a similar constraint. |
After disabling async readback the test doesn't deadlock anymore. The original issue can be considered solved. |
I respectfully disagree, the async readback should work. Did you get a chance to test 39e45bf with async readbacks on? |
I was just saying that the issue originally reported has been solved and this ticket as diverted into a different issue, shouldn't it be renamed then? |
Now it's unlikely and slightly different: #0 0x00007f497bd95e23 in __GI___poll (fds=, nfds=, timeout=) at ../sysdeps/unix/sysv/linux/poll.c:87 #1 0x00007f4977c6d4f2 in ?? () from /usr/lib/x86_64-linux-gnu/libxcb.so.1 #2 0x00007f4977c6eaaf in xcb_wait_for_reply () from /usr/lib/x86_64-linux-gnu/libxcb.so.1 #3 0x00007f497b76cb7d in _XReply () from /usr/lib/x86_64-linux-gnu/libX11.so.6 #4 0x00007f497d6287d8 in ?? () from /usr/lib/libGL.so.1 #5 0x00007f497d629bab in ?? () from /usr/lib/libGL.so.1 #6 0x00007f497daf9873 in eq::glx::Window::makeCurrent (this=0x7f49703cef40, cache=true) at /home/jhernando/bbp/Buildyard/src/Equalizer/libs/eq/client/glx/window.cpp:709 #7 0x00007f497dae40c7 in eq::Window::makeCurrentTransfer ( this=0x7f497019bf90, useCache=true) at /home/jhernando/bbp/Buildyard/src/Equalizer/libs/eq/client/window.cpp:528 #8 0x00007f497da45e0f in eq::Channel::_cmdFinishReadback ( this=0x7f4970337340, cmd=...) at /home/jhernando/bbp/Buildyard/src/Equalizer/libs/eq/client/channel.cpp:2164 |
The deadlock is:
This smells not good. Right now the only option I can think of is to downgrade the drivers to 270. XInitThreads is called, so I don't see what we're doing wrong here. |
Do you see similar issues in RTNeuron/eVolve with multiprocess mode? |
This is indeed a multiprocess configuration, the rendering client and application node are separate processes, and I've also seen the issue with RTNeuron. Another important detail is that the tests have been done in a single GPU machine, I haven't tested in multi-GPU. |
Ah, the two 'GPU' threads are render and transfer. Bummer. |
… thread instead of transfer thread Conflicts: libs/eq/client/channel.cpp libs/eq/client/pipe.cpp
transfer window context temporarily current in draw thread It looks like the driver realizes the context on the first makeCurrent (hmm, reverse engineering), still observing the deadlock occasionally. This 'useless' makecurrent hopefully solves this issue.
@hernando: Please test again whenever it's convenient. |
Tried again (around 25 times) #0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136 #1 0x00007fe62d628200 in _L_lock_928 () from /lib/x86_64-linux-gnu/libpthread.so.0 #2 0x00007fe62d628099 in __pthread_mutex_lock (mutex=0x7fe62e6cc860) at pthread_mutex_lock.c:82 #3 0x00007fe62e46a042 in ?? () from /usr/lib/libGL.so.1 #4 0x00007fe62e443217 in ?? () from /usr/lib/libGL.so.1 #5 0x00007fe62e439013 in glXSwapBuffers () from /usr/lib/libGL.so.1 #6 0x00007fe62e916b07 in eq::glx::Window::swapBuffers (this=0x11c8b40) at /home/jhernando/bbp/Buildyard/src/Equalizer/libs/eq/client/glx/window.cpp:720 #7 0x00007fe62e9014db in eq::Window::swapBuffers (this=0x13864c0) at /home/jhernando/bbp/Buildyard/src/Equalizer/libs/eq/client/window.cpp:591 #8 0x000000000045fcbd in eVolve::Window::swapBuffers (this=0x13864c0) at /home/jhernando/bbp/Buildyard/src/Equalizer/examples/eVolve/window.cpp:144 And one seemingly harmless abort at exit (I don't have full backtraces for this one) 32056 Main /home/jhernando/bbp/Buildyard/src/Collage/co/commandQueue.cpp:52 5149 Flushing non-empty c ommand queue 32056 Main /home/jhernando/bbp/Buildyard/src/Collage/co/localNode.cpp:448 5149 Assert: connection->ge tRefCount()==2 || connection->getDescription()->type >= co::CONNECTIONTYPE_MULTICAST [3: Connection 0 x7f71fc01d3c0 type N2co16SocketConnectionE state closed description TCPIP#102400#localhost##45168#def ault#] , in: lunchbox::abort() co::LocalNode::removeListeners(std::vector, std::allocator > > const&) eq::Config::notifyDetach() co::ObjectStore::unmapObject(co::Object*) co::LocalNode::unmapObject(co::Object*) eq::fabric::Server, eq::fabric::ElementVisitor >, eq::fabric::ElementVisitor >, eq::fabric::ElementVisitor > > > > > >::_cmdDestroyConfig(co::ICommand&) co::CommandFunc::operator()(co::ICommand&) co::ICommand::operator()() eq::fabric::Client::processCommand(unsigned int) eq::Server::releaseConfig(eq::Config*) eVolve::EVolve::run() ../../Equalizer/bin/eVolve(main+0x1c0) [0x456e70] __libc_start_main |
The remaining deadlock looks like a driver issue. Will move to 1.6 milestone. |
…ale#177; create shared context from render thread instead of async fetch thread
@hernando: While testing the 270/310 drivers, can you also report if this issue is reproducible? |
I've tested eVolve more than 30 times with 310.32 and I couldn't reproduce it. |
Closing, seems like a driver issue. |
Part of the backtrace of the aborting thread in the app:
...
#5 0x00007fa411f8c101 in eq::Image::~Image (this=0x7fa4040324e0, __in_chrg=)
#6 0x00007fa411f8c144 in eq::Image::~Image (this=0x7fa4040324e0, __in_chrg=)
#7 0x00007fa411f82e1a in eq::FrameData::~FrameData (this=0x14b9eb0, __in_chrg=)
#8 0x00007fa411f8305e in eq::FrameData::~FrameData (this=0x14b9eb0, __in_chrg=)
...
#18 0x00007fa411fd8790 in eq::Window::_cmdDestroyChannel (this=0x7fa40402cd00, cmd=...)
...
The rendering client closes it's pipe, but stays alive after the crash.
The text was updated successfully, but these errors were encountered: