Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional async readback deadlocks on GLX #177

Closed
hernando opened this issue Nov 13, 2012 · 25 comments
Closed

Occasional async readback deadlocks on GLX #177

hernando opened this issue Nov 13, 2012 · 25 comments
Assignees
Labels
Milestone

Comments

@hernando
Copy link

Part of the backtrace of the aborting thread in the app:
...
#5 0x00007fa411f8c101 in eq::Image::~Image (this=0x7fa4040324e0, __in_chrg=)

at /home/jhernando/bbp/Buildyard/src/Equalizer/libs/eq/client/image.cpp:228

#6 0x00007fa411f8c144 in eq::Image::~Image (this=0x7fa4040324e0, __in_chrg=)

at /home/jhernando/bbp/Buildyard/src/Equalizer/libs/eq/client/image.cpp:229

#7 0x00007fa411f82e1a in eq::FrameData::~FrameData (this=0x14b9eb0, __in_chrg=)

at /home/jhernando/bbp/Buildyard/src/Equalizer/libs/eq/client/frameData.cpp:67

#8 0x00007fa411f8305e in eq::FrameData::~FrameData (this=0x14b9eb0, __in_chrg=)

at /home/jhernando/bbp/Buildyard/src/Equalizer/libs/eq/client/frameData.cpp:73

...
#18 0x00007fa411fd8790 in eq::Window::_cmdDestroyChannel (this=0x7fa40402cd00, cmd=...)

at /home/jhernando/bbp/Buildyard/src/Equalizer/libs/eq/client/window.cpp:791

...

The rendering client closes it's pipe, but stays alive after the crash.

@ghost ghost assigned eile Nov 13, 2012
@eile
Copy link
Member

eile commented Nov 13, 2012

I solved a couple of these for 1.4 already, will have a look at it.

@hernando
Copy link
Author

It also happens with RTNeuron but only with DB compounds. I've commented the override of eq::Channel::frameAssemble on both and then they don't segfault anymore.

@eile
Copy link
Member

eile commented Nov 13, 2012

The segfaults above are from an assertion which should be harmless in release mode. I'll fix this one in any case, hopefully tomorrow.

@eile
Copy link
Member

eile commented Nov 14, 2012

@hernando: Please verify fix and close bug

@eile
Copy link
Member

eile commented Nov 14, 2012

Ok, that's a different one I didn't see. Investigating...

@eile
Copy link
Member

eile commented Nov 14, 2012

The crash seems harmless, it's just a sanity check for reference counting/deletion.

The deadlock happens here, within my favourite driver:

Thread 2 (Thread 0x7f37d78ff700 (LWP 10088)):
#0  0x00007f37e3a6ae23 in __GI___poll (fds=<optimized out>, 
    nfds=<optimized out>, timeout=<optimized out>)
    at ../sysdeps/unix/sysv/linux/poll.c:87
#1  0x00007f37df9424f2 in ?? () from /usr/lib/x86_64-linux-gnu/libxcb.so.1
#2  0x00007f37df943aaf in xcb_wait_for_reply ()
   from /usr/lib/x86_64-linux-gnu/libxcb.so.1
#3  0x00007f37e3441b7d in _XReply () from /usr/lib/x86_64-linux-gnu/libX11.so.6
#4  0x00007f37e532b3b1 in ?? () from /usr/lib/libGL.so.1
#5  0x00007f37e52fdc38 in ?? () from /usr/lib/libGL.so.1
#6  0x00007f37e52f2bfc in glXCreateNewContext () from /usr/lib/libGL.so.1
#7  0x00007f37e57cd36a in eq::glx::Window::createGLXContext (this=0x1182c50, 

This trace: https://gist.github.com/4066441#file_client.deadlock.trace

@eile
Copy link
Member

eile commented Nov 14, 2012

This is again async readback related. Can you reproduce it with async rb off?

@hernando
Copy link
Author

On 14/11/12 14:20, Stefan Eilemann wrote:

This is again async readback related. Can you reproduce it with async rb off?
How do I do it?
I've skimed over the compilation options with ccmake and taken a look at
loader.l, but I haven't found how to do it.

@eile
Copy link
Member

eile commented Nov 14, 2012

Remove EQ_COMPRESSOR_USE_ASYNC_DOWNLOAD from compressorReadDrawPixels.cpp

Daniel and me have a hypothesis:

That behavior is expected, none of the contexts in the wglShareLists call can be current in a different thread to the one you are calling wglShareLists from. That important piece of information is missing from the MSDN. [http://www.opengl.org/discussion_boards/showthread.php/152648-wglShareLists-failing]

The glx context sharing likely has a similar constraint.

eile pushed a commit that referenced this issue Nov 14, 2012
@hernando
Copy link
Author

After disabling async readback the test doesn't deadlock anymore. The original issue can be considered solved.

@eile
Copy link
Member

eile commented Nov 14, 2012

I respectfully disagree, the async readback should work.

Did you get a chance to test 39e45bf with async readbacks on?

@eile eile reopened this Nov 14, 2012
@hernando
Copy link
Author

I was just saying that the issue originally reported has been solved and this ticket as diverted into a different issue, shouldn't it be renamed then?

@hernando
Copy link
Author

Now it's unlikely and slightly different:
I run it more than a dozen times and a couple I got this
https://gist.github.com/4066441#file_app_node.deadlock.trace.2
https://gist.github.com/4066441#file_client.deadlock.trace.2

#0 0x00007f497bd95e23 in __GI___poll (fds=,
    nfds=, timeout=)
    at ../sysdeps/unix/sysv/linux/poll.c:87
#1 0x00007f4977c6d4f2 in ?? () from /usr/lib/x86_64-linux-gnu/libxcb.so.1
#2 0x00007f4977c6eaaf in xcb_wait_for_reply ()
   from /usr/lib/x86_64-linux-gnu/libxcb.so.1
#3 0x00007f497b76cb7d in _XReply () from /usr/lib/x86_64-linux-gnu/libX11.so.6
#4 0x00007f497d6287d8 in ?? () from /usr/lib/libGL.so.1
#5 0x00007f497d629bab in ?? () from /usr/lib/libGL.so.1
#6 0x00007f497daf9873 in eq::glx::Window::makeCurrent (this=0x7f49703cef40,
    cache=true)
    at /home/jhernando/bbp/Buildyard/src/Equalizer/libs/eq/client/glx/window.cpp:709
#7 0x00007f497dae40c7 in eq::Window::makeCurrentTransfer (
    this=0x7f497019bf90, useCache=true)
    at /home/jhernando/bbp/Buildyard/src/Equalizer/libs/eq/client/window.cpp:528
#8 0x00007f497da45e0f in eq::Channel::_cmdFinishReadback (
    this=0x7f4970337340, cmd=...)
    at /home/jhernando/bbp/Buildyard/src/Equalizer/libs/eq/client/channel.cpp:2164

@eile
Copy link
Member

eile commented Nov 14, 2012

The deadlock is:

Thread A: glXSwapBuffers ... __lll_lock_wait
Thread B: eq::glx::Window::makeCurrent ... _XReply ... __GI___poll

This smells not good. Right now the only option I can think of is to downgrade the drivers to 270. XInitThreads is called, so I don't see what we're doing wrong here.

@eile
Copy link
Member

eile commented Nov 14, 2012

Do you see similar issues in RTNeuron/eVolve with multiprocess mode?

@hernando
Copy link
Author

This is indeed a multiprocess configuration, the rendering client and application node are separate processes, and I've also seen the issue with RTNeuron.

Another important detail is that the tests have been done in a single GPU machine, I haven't tested in multi-GPU.

@eile
Copy link
Member

eile commented Nov 14, 2012

Ah, the two 'GPU' threads are render and transfer. Bummer.

eile pushed a commit that referenced this issue Nov 14, 2012
… thread instead of transfer thread

Conflicts:
	libs/eq/client/channel.cpp
	libs/eq/client/pipe.cpp
eile pushed a commit that referenced this issue Nov 16, 2012
transfer window context temporarily current in draw thread

It looks like the driver realizes the context on the first makeCurrent
(hmm, reverse engineering), still observing the deadlock
occasionally. This 'useless' makecurrent hopefully solves this issue.
@eile
Copy link
Member

eile commented Nov 16, 2012

@hernando: Please test again whenever it's convenient.

@hernando
Copy link
Author

Tried again (around 25 times)
Once got a different deadlock. This seems to be an unrelated driver issue.

#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
#1  0x00007fe62d628200 in _L_lock_928 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007fe62d628099 in __pthread_mutex_lock (mutex=0x7fe62e6cc860) at pthread_mutex_lock.c:82
#3  0x00007fe62e46a042 in ?? () from /usr/lib/libGL.so.1
#4  0x00007fe62e443217 in ?? () from /usr/lib/libGL.so.1
#5  0x00007fe62e439013 in glXSwapBuffers () from /usr/lib/libGL.so.1
#6  0x00007fe62e916b07 in eq::glx::Window::swapBuffers (this=0x11c8b40)
    at /home/jhernando/bbp/Buildyard/src/Equalizer/libs/eq/client/glx/window.cpp:720
#7  0x00007fe62e9014db in eq::Window::swapBuffers (this=0x13864c0)
    at /home/jhernando/bbp/Buildyard/src/Equalizer/libs/eq/client/window.cpp:591
#8  0x000000000045fcbd in eVolve::Window::swapBuffers (this=0x13864c0)
    at /home/jhernando/bbp/Buildyard/src/Equalizer/examples/eVolve/window.cpp:144

And one seemingly harmless abort at exit (I don't have full backtraces for this one)

32056 Main /home/jhernando/bbp/Buildyard/src/Collage/co/commandQueue.cpp:52 5149 Flushing non-empty c
ommand queue
32056 Main /home/jhernando/bbp/Buildyard/src/Collage/co/localNode.cpp:448 5149 Assert: connection->ge
tRefCount()==2 || connection->getDescription()->type >= co::CONNECTIONTYPE_MULTICAST [3: Connection 0
x7f71fc01d3c0 type N2co16SocketConnectionE state closed description TCPIP#102400#localhost##45168#def
ault#] , in: 
    lunchbox::abort()
    co::LocalNode::removeListeners(std::vector, std::allocator > > const&)
    eq::Config::notifyDetach()
    co::ObjectStore::unmapObject(co::Object*)
    co::LocalNode::unmapObject(co::Object*)
    eq::fabric::Server, eq::fabric::ElementVisitor >, eq::fabric::ElementVisitor >, eq::fabric::ElementVisitor > > > > > >::_cmdDestroyConfig(co::ICommand&)
    co::CommandFunc::operator()(co::ICommand&)
    co::ICommand::operator()()
    eq::fabric::Client::processCommand(unsigned int)
    eq::Server::releaseConfig(eq::Config*)
    eVolve::EVolve::run()
    ../../Equalizer/bin/eVolve(main+0x1c0) [0x456e70]
    __libc_start_main

@eile
Copy link
Member

eile commented Nov 26, 2012

The remaining deadlock looks like a driver issue. Will move to 1.6 milestone.

tribal-tec added a commit to tribal-tec/Equalizer that referenced this issue Nov 26, 2012
…ale#177; create shared context from render thread instead of async fetch thread
@eile
Copy link
Member

eile commented Jan 22, 2013

@hernando: While testing the 270/310 drivers, can you also report if this issue is reproducible?

@hernando hernando reopened this Jan 23, 2013
@hernando
Copy link
Author

I've tested eVolve more than 30 times with 310.32 and I couldn't reproduce it.
By the way, what happened to the rendering client log files?

@eile
Copy link
Member

eile commented Jan 25, 2013

Closing, seems like a driver issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants