Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dashboard occasionally throws "Task exception was never retrieved" errors on Windows #1216

Open
cjbe opened this issue Dec 12, 2018 · 34 comments

Comments

@cjbe
Copy link
Contributor

cjbe commented Dec 12, 2018

Bug Report

One-Line Summary

The dashboard occasionally throws "Task exception was never retrieved" errors on Windows, even with no user activity.

Issue Details

These errors seem to occur occasionally (~10 per day). They occur even without any user activity. For example, on a dashboard which has some applets open plotting actively changing datasets, but without any user button presses, or any applets being opened or closed.

Steps to Reproduce

No known reproduction

Actual (undesired) Behavior

The traceback is always like this:

quamash.QEventLoop:Task exception was never retrieved
future: <Task finished coro=<AsyncioParentComm._autoclose() done, defined at C:\Users\ballance\AppData\Local\Continuum\anaconda3\envs\artiq\lib\site-packages\artiq\protocols\pipe_ipc.py:118> exception=OSError(9, 'The handle is invalid', None, 6, None)>
Traceback (most recent call last):
  File "C:\Users\ballance\AppData\Local\Continuum\anaconda3\envs\artiq\lib\asyncio\tasks.py", line 240, in _step
    result = coro.send(None)
  File "C:\Users\ballance\AppData\Local\Continuum\anaconda3\envs\artiq\lib\site-packages\artiq\protocols\pipe_ipc.py", line 120, in _autoclose
    self.server[0].close()
  File "C:\Users\ballance\AppData\Local\Continuum\anaconda3\envs\artiq\lib\asyncio\windows_events.py", line 283, in close
    pipe.close()
  File "C:\Users\ballance\AppData\Local\Continuum\anaconda3\envs\artiq\lib\asyncio\windows_utils.py", line 157, in close
    CloseHandle(self._handle)
OSError: [WinError 6] The handle is invalid

Your System

Using Artiq from master

@jordens
Copy link
Member

jordens commented Dec 13, 2018

I haven't seen this. Also on windows.

@hartytp
Copy link
Collaborator

hartytp commented May 10, 2019

Sometimes after an experiment terminates with an exception I see

asyncio:Task exception was never retrieved
future: <Task finished coro=<DatasetNamespaces.finish_rid.<locals>.gc() done, defined at /home/ion/artiq/artiq/artiq/master/databases.py:125> exception=KeyError('datasets_rid_11065',)>
Traceback (most recent call last):
  File "/home/ion/anaconda3/envs/artiq-main/lib/python3.5/asyncio/tasks.py", line 240, in _step
    result = coro.send(None)
  File "/home/ion/artiq/artiq/artiq/master/databases.py", line 127, in gc
    self._publisher.remove_notifier(_rid_namespace_name(rid))
  File "/home/ion/artiq/artiq/artiq/protocols/sync_struct.py", line 306, in remove_notifier
    notifier = self.notifiers[name]
KeyError: 'datasets_rid_11065

@jordens
Copy link
Member

jordens commented May 10, 2019

Different issue? Windows vs Linux, pipe_rpc vs sync_struct, different errors, different code area, different tools and components.

@hartytp
Copy link
Collaborator

hartytp commented May 10, 2019

Yes, it is (and I'll break it out in a second). Just posting here as a reminder that the frequency of the occurance of these events is due to the fact that it has more than one cause...

@dnadlinger
Copy link
Collaborator

The latter is due to a feature not merged upstream (yet), although I thought I had fixed that particular issue a while back.

@hartytp
Copy link
Collaborator

hartytp commented May 10, 2019

I think this is a race condition inside asyncio....

I hacked my local copy of asyncio as follows

class PipeHandle:
    """Wrapper for an overlapped pipe handle which is vaguely file-object like.

    The IOCP event loop can use these instead of socket objects.
    """
    def __init__(self, handle):
        print("new pipe handle: {}".format(handle))
        self._handle = handle
        self._closed = False

    def __repr__(self):
        if self._handle is not None:
            handle = 'handle=%r' % self._handle
        else:
            handle = 'closed'
        return '<%s %s>' % (self.__class__.__name__, handle)

    @property
    def handle(self):
        return self._handle

    def fileno(self):
        if self._handle is None:
            raise ValueError("I/O operatioon on closed pipe")
        return self._handle

    def close(self, *, CloseHandle=_winapi.CloseHandle):
        if self._handle is not None:
            print("close pipe handle: {} {}".format(self._handle, self._closed))
            try:
                print(CloseHandle)
                CloseHandle(self._handle)
            except Exception as e:
                print("wtf {} {} {}".format(e, self._handle, self._closed))
                raise
            print("closed")
            self._closed = True
            self._handle = None

    def __del__(self):
        if self._handle is not None:
            warnings.warn("unclosed %r" % self, ResourceWarning)
            self.close()

    def __enter__(self):
        return self

    def __exit__(self, t, v, tb):
        self.close()

and I see...

opened server pipe 2476
new pipe handle: 2476
new pipe handle: 2484
new pipe handle: 2488
opened server pipe 2536
new pipe handle: 2536
opened server pipe 2544
new pipe handle: 2544
INFO:applet(422 frequency):print:new pipe handle: 1176
QWindowsWindow::setGeometry: Unable to set geometry 200x130+0+0 on QWindow/''. Resulting geometry:  640x480+960+465 (frame: 0, 0, 0, 0, custom margin: 0, 0, 0, 0, minimum size: 0x0, maximum size: 16777215x16777215).
QWindowsWindow::setGeometry: Unable to set geometry 200x130+0+0 on QWindow/''. Resulting geometry:  640x480+960+465 (frame: 0, 0, 0, 0, custom margin: 0, 0, 0, 0, minimum size: 0x0, maximum size: 16777215x16777215).
QWindowsWindow::setGeometry: Unable to set geometry 200x130+0+0 on QWindow/''. Resulting geometry:  640x480+960+465 (frame: 0, 0, 0, 0, custom margin: 0, 0, 0, 0, minimum size: 0x0, maximum size: 16777215x16777215).
QWindowsWindow::setGeometry: Unable to set geometry 200x130+0+0 on QWindow/''. Resulting geometry:  640x480+960+465 (frame: 0, 0, 0, 0, custom margin: 0, 0, 0, 0, minimum size: 0x0, maximum size: 16777215x16777215).
INFO:applet(422 frequency):print:close pipe handle: 1176 False
INFO:applet(422 frequency):print:<built-in function CloseHandle>
INFO:applet(422 frequency):print:closed
close pipe handle: 2484 False
<built-in function CloseHandle>
closed
close pipe handle: 2488 False
<built-in function CloseHandle>
closed
close pipe handle: 2544 False
<built-in function CloseHandle>
close pipe handle: 2536 False
<built-in function CloseHandle>
closed
closed
close pipe handle: 2536 False
<built-in function CloseHandle>
wtf CloseHandle() argument must be int, not None None True
ERROR:dashboard:quamash._QEventLoop:Task exception was never retrieved
future: <Task finished coro=<AsyncioParentComm._autoclose() done, defined at C:\Users\LabUser\Miniconda3\envs\artiq-env\lib\site-packages\artiq\protocols\pipe_ipc.py:118> exception=TypeError('CloseHandle() argument must be int, not None',)>
Traceback (most recent call last):
  File "C:\Users\LabUser\Miniconda3\envs\artiq-env\lib\asyncio\tasks.py", line 240, in _step
    result = coro.send(None)
  File "C:\Users\LabUser\Miniconda3\envs\artiq-env\lib\site-packages\artiq\protocols\pipe_ipc.py", line 120, in _autoclose
    self.server[0].close()
  File "C:\Users\LabUser\Miniconda3\envs\artiq-env\lib\asyncio\windows_events.py", line 284, in close
    pipe.close()
  File "C:\Users\LabUser\Miniconda3\envs\artiq-env\lib\asyncio\windows_utils.py", line 162, in close
    CloseHandle(self._handle)
T

I guess this is a bug to do with python's lazy evaluation or something like that? It looks like by the time self._handle is captured for the call to CloseHandle it's already been set to None.

@cjbe can you try patching your local asyncio here https://github.com/python/cpython/blob/a2fedd8c910cb5f5b9bd568d6fd44d63f8f5cfa5/Lib/asyncio/windows_utils.py#L105-L108

to be something like:

    def close(self, *, CloseHandle=_winapi.CloseHandle):
        if self._handle is not None:
            handle = self._handle
            self._handle = None
            CloseHandle(handle)

@hartytp
Copy link
Collaborator

hartytp commented May 10, 2019

@cjbe @dnadlinger can you apply that patch locally and confirm that it fixes the error? If so, I'll close this issue and move the discussion to asyncio.

@dnadlinger
Copy link
Collaborator

@hartytp: Fixed the RID namespacing-related issue in our local fork.

As for the handle closing issue, doesn't this look more like a race condition between (OS) threads? If that is the case, your patch would only paper over the actual problem. (There shouldn't be any lazy evaluation of that kind in Python.)

@dnadlinger
Copy link
Collaborator

Did you stumble over a good way of reproducing this? Repeatedly opening/closing a bunch of applets maybe?

@dnadlinger
Copy link
Collaborator

(Maybe it's actually our, i.e. dashboard, code calling close() from different threads after all – an awkwardly straightforward way of debugging this would be to, in close(), save backtraces into a thread-safe global buffer, and fetch/print all of them for a given when the above issue occurs.)

@hartytp
Copy link
Collaborator

hartytp commented May 14, 2019

Did you stumble over a good way of reproducing this? Repeatedly opening/closing a bunch of applets maybe?

Yes, that basically does it.

As for the handle closing issue, doesn't this look more like a race condition between (OS) threads? If that is the case, your patch would only paper over the actual problem. (There shouldn't be any lazy evaluation of that kind in Python.)

hmmm...I must have misread the log I posted. The original thing I was looking for was a race where a PipeHandle is closed twice, which is what the log clearly shows is occurring. Somehow I convinced myself that wasn't the case.

So it is a race between threads.

@dnadlinger
Copy link
Collaborator

dnadlinger commented May 14, 2019

Which two threads would that be, though? Or does CloseHandle itself cause some callbacks to be invoked? (Or e.g. the object to be deleted indirectly?) It's been too long since I've had to deal with IOCP…

@hartytp
Copy link
Collaborator

hartytp commented May 14, 2019

That's a good question. I haven't figured it out.

AFAICT the handle is only closed by the AsyncioParentComm._auto_close task.

async def _autoclose(self):
await self.process.wait()
self.server[0].close()
del self.server
if self.ready.is_set():
self.writer.close()
del self.reader
del self.writer
That task is created in AsyncioParentComm._auto_close here
async def create_subprocess(self, *args, **kwargs):
loop = asyncio.get_event_loop()
def factory():
reader = asyncio.StreamReader(loop=loop, limit=100*1024*1024)
protocol = asyncio.StreamReaderProtocol(reader,
self._child_connected,
loop=loop)
return protocol
self.server = await loop.start_serving_pipe(
factory, self.address)
self.process = await asyncio.create_subprocess_exec(
*args, **kwargs)
asyncio.ensure_future(self._autoclose())

So, unless there is some other code path I'm not aware of (when I have time, I'll hack in a stack trace each time we close a handle to rule this out), it looks like create_subprocess is being called multiple times on the same AsyncioParentComm object or something like that. If that happened, I can see how a race could occur.

@dnadlinger
Copy link
Collaborator

The handle is also close()d by its __del__, but that shouldn't be invoked while pipe_ipc still has a reference…

@hartytp
Copy link
Collaborator

hartytp commented May 14, 2019

Looking at the trace, I think this race is probably in a single thread (which seems most likely given the original asyncio code). I'd bet a beer that it's a race in the applet code that's scheduling two autoclose events...

@dnadlinger
Copy link
Collaborator

But then (i.e. if it's a single thread) how would self._handle change between time of check and time of use?

@hartytp
Copy link
Collaborator

hartytp commented May 14, 2019

sigh...sorry, you're right. It's late.

Well, then I'm back to not understanding this at all. The next step I can think of would be to add a stack trace print to the PipeHandle.close to verify whether _auto_close is indeed being called twice, or if it's actually some other code path I haven't spotted yet. I'll do that when I have time.

@hartytp
Copy link
Collaborator

hartytp commented May 15, 2019

Improved the diagnostics to show which thread the calls are coming from:

close pipe handle 2872
close pipe handle 2872
closed handle 2872 in thread 6172
exception closing handle 2872 in thread 9636
[WinError 6] The handle is invalid
Exception ignored in: <bound method PipeHandle.__del__ of <PipeHandle closed>>
Traceback (most recent call last):
  File "C:\Users\LabUser\Miniconda3\envs\artiq-env\lib\asyncio\windows_utils.py", line 174, in __del__
close pipe handle 2792
closed handle 2792 in thread 6172
    self.close()
  File "C:\Users\LabUser\Miniconda3\envs\artiq-env\lib\asyncio\windows_utils.py", line 161, in close
    CloseHandle(self._handle)
OSError: [WinError 6] The handle is invalid
QWindowsWindow::setGeometry: Unable to set geometry 391x175+2169+1081 on QWindow/''. Resulting geometry:  640x480+960+465 (frame: 0, 0, 0, 0, custom margin: 0, 0, 0, 0, minimum size: 0x0, maximum size: 16777215x16777215).
QWindowsWindow::setGeometry: Unable to set geometry 391x175+2169+1081 on QWindow/''. Resulting geometry:  640x480+960+465 (frame: 0, 0, 0, 0, custom margin: 0, 0, 0, 0, minimum size: 0x0, maximum size: 16777215x16777215).
INFO:applet(Monitor highna_apd0-PMT ratio):print:close pipe handle 1132
INFO:applet(Monitor highna_apd0-PMT ratio):print:closed handle 1132 in thread 10300
close pipe handle 2584
closed handle 2584 in thread 6172
close pipe handle 2592
closed handle 2592 in thread 6172
close pipe handle 2848
closed handle 2848 in thread 6172
close pipe handle 2652
closed handle 2652 in thread 6172
close pipe handle 2580
closed handle 2580 in thread 6172
QWindowsWindow::setGeometry: Unable to set geometry 391x175+2169+1081 on QWindow/''. Resulting geometry:  640x480+960+465 (frame: 0, 0, 0, 0, custom margin: 0, 0, 0, 0, minimum size: 0x0, maximum size: 16777215x16777215).
QWindowsWindow::setGeometry: Unable to set geometry 391x175+2169+1081 on QWindow/''. Resulting geometry:  640x480+960+465 (frame: 0, 0, 0, 0, custom margin: 0, 0, 0, 0, minimum size: 0x0, maximum size: 16777215x16777215).
INFO:applet(Monitor highna_apd1):print:close pipe handle 1136
INFO:applet(Monitor highna_apd1):print:closed handle 1136 in thread 11440
close pipe handle 2620
closed handle 2620 in thread 6172
close pipe handle 2608
closed handle 2608 in thread 6172
close pipe handle 2800
closed handle 2800 in thread 6172
close pipe handle 2520
closed handle 2520 in thread 6172
close pipe handle 2588
closed handle 2588 in thread 6172
QWindowsWindow::setGeometry: Unable to set geometry 391x175+2169+1081 on QWindow/''. Resulting geometry:  640x480+960+465 (frame: 0, 0, 0, 0, custom margin: 0, 0, 0, 0, minimum size: 0x0, maximum size: 16777215x16777215).
QWindowsWindow::setGeometry: Unable to set geometry 391x175+2169+1081 on QWindow/''. Resulting geometry:  640x480+960+465 (frame: 0, 0, 0, 0, custom margin: 0, 0, 0, 0, minimum size: 0x0, maximum size: 16777215x16777215).
INFO:applet(Monitor highna_apd3-PMT ratio):print:close pipe handle 1196
INFO:applet(Monitor highna_apd3-PMT ratio):print:closed handle 1196 in thread 604
close pipe handle 2736
closed handle 2736 in thread 6172
close pipe handle 2808
closed handle 2808 in thread 6172
close pipe handle 2876
closed handle 2876 in thread 6172
close pipe handle 2896
closed handle 2896 in thread 6172
close pipe handle 2820
closed handle 2820 in thread 6172
QWindowsWindow::setGeometry: Unable to set geometry 391x175+2169+1081 on QWindow/''. Resulting geometry:  640x480+960+465 (frame: 0, 0, 0, 0, custom margin: 0, 0, 0, 0, minimum size: 0x0, maximum size: 16777215x16777215).
QWindowsWindow::setGeometry: Unable to set geometry 391x175+2169+1081 on QWindow/''. Resulting geometry:  640x480+960+465 (frame: 0, 0, 0, 0, custom margin: 0, 0, 0, 0, minimum size: 0x0, maximum size: 16777215x16777215).
INFO:applet(Monitor highna_apd2-PMT ratio):print:close pipe handle 1200
INFO:applet(Monitor highna_apd2-PMT ratio):print:closed handle 1200 in thread 2684
close pipe handle 2748
closed handle 2748 in thread 6172
close pipe handle 2756
closed handle 2756 in thread 6172
close pipe handle 2888
close pipe handle 2740
closed handle 2888 in thread 6172
close pipe handle 2740
closed handle 2740 in thread 9636
exception closing handle 2740 in thread 6172
[WinError 6] The handle is invalid
ERROR:dashboard:quamash._QEventLoop:Task exception was never retrieved
future: <Task finished coro=<AsyncioParentComm._autoclose() done, defined at C:\Users\LabUser\Miniconda3\envs\artiq-env\lib\site-packages\artiq\protocols\pipe_ipc.py:118> exception=OSError(9, 'The handle is invalid', None, 6, None)>
Traceback (most recent call last):
  File "C:\Users\LabUser\Miniconda3\envs\artiq-env\lib\asyncio\tasks.py", line 240, in _step
    result = coro.send(None)
  File "C:\Users\LabUser\Miniconda3\envs\artiq-env\lib\site-packages\artiq\protocols\pipe_ipc.py", line 120, in _autoclose
    self.server[0].close()
  File "C:\Users\LabUser\Miniconda3\envs\artiq-env\lib\asyncio\windows_events.py", line 284, in close
    pipe.close()
  File "C:\Users\LabUser\Miniconda3\envs\artiq-env\lib\asyncio\windows_utils.py", line 161, in close
    CloseHandle(self._handle)
OSError: [WinError 6] The handle is invalid

So, we definitely see the PipeHandles being closed from two threads...

class PipeHandle:
    """Wrapper for an overlapped pipe handle which is vaguely file-object like.

    The IOCP event loop can use these instead of socket objects.
    """
    def __init__(self, handle):
        print("new pipe handle: {}".format(handle))
        self._handle = handle

    def __repr__(self):
        if self._handle is not None:
            handle = 'handle=%r' % self._handle
        else:
            handle = 'closed'
        return '<%s %s>' % (self.__class__.__name__, handle)

    @property
    def handle(self):
        return self._handle

    def fileno(self):
        if self._handle is None:
            raise ValueError("I/O operatioon on closed pipe")
        return self._handle

    def close(self, *, CloseHandle=_winapi.CloseHandle):
        if self._handle is not None:
            print("close pipe handle {}".format(self._handle))
            try:
                CloseHandle(self._handle)
            except Exception as e:
                print("exception closing handle {} in thread {}"
                      .format(self._handle, threading.current_thread().ident))
                print(e)
                raise
            print("closed handle {} in thread {}, trace: {}"
                  .format(self._handle, threading.current_thread().ident,
                          traceback.format_stack()))
            self._handle = None

    def __del__(self):
        if self._handle is not None:
            warnings.warn("unclosed %r" % self, ResourceWarning)
            self.close()

    def __enter__(self):
        return self

    def __exit__(self, t, v, tb):
        self.close()

@hartytp
Copy link
Collaborator

hartytp commented May 15, 2019

With stack traces:

INFO:dashboard:root:ARTIQ dashboard 5.0.dev+1259.ge86e2bbe connected to Alice (10.255.6.191)
opened server pipe 2472
new pipe handle: 2472
new pipe handle: 2480
new pipe handle: 2484
opened server pipe 2520
new pipe handle: 2520
new pipe handle: 2516
new pipe handle: 2536
opened server pipe 2560
new pipe handle: 2560
opened server pipe 2556
new pipe handle: 2556
close pipe handle 2480
closed handle 2480 in thread 3132, trace: ['  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\runpy.py", line 193, in _run_module_as_main\n    "__main__", mod_spec)\n', '  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\runpy.py", line 85, in _run_code\n    exec(code, run_globals)\n', '  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\site-packages\\artiq\\frontend\\artiq_dashboard.py", line 245, in <module>\n    main()\n', '  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\site-packages\\artiq\\frontend\\artiq_dashboard.py", line 242, in main\n    loop.run_until_complete(main_window.exit_request.wait())\n', '  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\site-packages\\Quamash-0.6.1-py3.5.egg\\quamash\\__init__.py", line 281, in run_until_complete\n    self.run_forever()\n', '  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\site-packages\\Quamash-0.6.1-py3.5.egg\\quamash\\__init__.py", line 266, in run_forever\n    rslt = self.__app.exec_()\n', '  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\site-packages\\Quamash-0.6.1-py3.5.egg\\quamash\\__init__.py", line 207, in timerEvent\n    handle._run()\n', '  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\asyncio\\events.py", line 127, in _run\n    self._callback(*self._args)\n', '  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\asyncio\\proactor_events.py", line 136, in _call_connection_lost\n    self._sock.close()\n', '  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\asyncio\\windows_utils.py", line 170, in close\n    traceback.format_stack()))\n']
close pipe handle 2560
close pipe handle 2472
closed handle 2560 in thread 3132, trace: ['  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\runpy.py", line 193, in _run_module_as_main\n    "__main__", mod_spec)\n', '  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\runpy.py", line 85, in _run_code\n    exec(code, run_globals)\n', '  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\site-packages\\artiq\\frontend\\artiq_dashboard.py", line 245, in <module>\n    main()\n', '  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\site-packages\\artiq\\frontend\\artiq_dashboard.py", line 242, in main\n    loop.run_until_complete(main_window.exit_request.wait())\n', '  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\site-packages\\Quamash-0.6.1-py3.5.egg\\quamash\\__init__.py", line 281, in run_until_complete\n    self.run_forever()\n', '  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\site-packages\\Quamash-0.6.1-py3.5.egg\\quamash\\__init__.py", line 266, in run_forever\n    rslt = self.__app.exec_()\n', '  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\site-packages\\Quamash-0.6.1-py3.5.egg\\quamash\\__init__.py", line 207, in timerEvent\n    handle._run()\n', '  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\asyncio\\events.py", line 127, in _run\n    self._callback(*self._args)\n', '  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\asyncio\\tasks.py", line 240, in _step\n    result = coro.send(None)\n', '  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\site-packages\\artiq\\protocols\\pipe_ipc.py", line 120, in _autoclose\n    self.server[0].close()\n', '  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\asyncio\\windows_events.py", line 284, in close\n    pipe.close()\n', '  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\asyncio\\windows_utils.py", line 170, in close\n    traceback.format_stack()))\n']
closed handle 2472 in thread 10720, trace: ['  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\site-packages\\Quamash-0.6.1-py3.5.egg\\quamash\\_windows.py", line 159, in run\n    events = self.__proactor.select(0.01)\n', '  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\site-packages\\Quamash-0.6.1-py3.5.egg\\quamash\\_windows.py", line 65, in select\n    self._poll(timeout)\n', '  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\asyncio\\windows_utils.py", line 176, in __del__\n    self.close()\n', '  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\asyncio\\windows_utils.py", line 170, in close\n    traceback.format_stack()))\n']
close pipe handle 2472
exception closing handle None in thread 3132
CloseHandle() argument must be int, not None
ERROR:dashboard:quamash._QEventLoop:Task exception was never retrieved
future: <Task finished coro=<AsyncioParentComm._autoclose() done, defined at C:\Users\LabUser\Miniconda3\envs\artiq-env\lib\site-packages\artiq\protocols\pipe_ipc.py:118> exception=TypeError('CloseHandle() argument must be int, not None',)>
Traceback (most recent call last):
  File "C:\Users\LabUser\Miniconda3\envs\artiq-env\lib\asyncio\tasks.py", line 240, in _step
    result = coro.send(None)
  File "C:\Users\LabUser\Miniconda3\envs\artiq-env\lib\site-packages\artiq\protocols\pipe_ipc.py", line 120, in _autoclose
    self.server[0].close()
  File "C:\Users\LabUser\Miniconda3\envs\artiq-env\lib\asyncio\windows_events.py", line 284, in close
    pipe.close()
  File "C:\Users\LabUser\Miniconda3\envs\artiq-env\lib\asyncio\windows_utils.py", line 162, in close
    CloseHandle(self._handle)
TypeError: CloseHandle() argument must be int, not None
INFO:applet(Camera):print:C:\Users\LabUser\Miniconda3\envs\artiq-env\python.exe: No module named oxart.applets.camera_viewer
close pipe handle 2484
closed handle 2484 in thread 3132, trace: ['  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\runpy.py", line 193, in _run_module_as_main\n    "__main__", mod_spec)\n', '  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\runpy.py", line 85, in _run_code\n    exec(code, run_globals)\n', '  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\site-packages\\artiq\\frontend\\artiq_dashboard.py", line 245, in <module>\n    main()\n', '  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\site-packages\\artiq\\frontend\\artiq_dashboard.py", line 242, in main\n    loop.run_until_complete(main_window.exit_request.wait())\n', '  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\site-packages\\Quamash-0.6.1-py3.5.egg\\quamash\\__init__.py", line 281, in run_until_complete\n    self.run_forever()\n', '  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\site-packages\\Quamash-0.6.1-py3.5.egg\\quamash\\__init__.py", line 266, in run_forever\n    rslt = self.__app.exec_()\n', '  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\site-packages\\Quamash-0.6.1-py3.5.egg\\quamash\\__init__.py", line 207, in timerEvent\n    handle._run()\n', '  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\asyncio\\events.py", line 127, in _run\n    self._callback(*self._args)\n', '  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\asyncio\\proactor_events.py", line 136, in _call_connection_lost\n    self._sock.close()\n', '  File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\asyncio\\windows_utils.py", line 170, in close\n    traceback.format_stack()))\n']
opened server pipe 1880
new pipe handle: 1880
INFO:applet(Monitor highna_apd2):print:new pipe handle: 1120
QWindowsWindow::setGeometry: Unable to set geometry 391x70+0+0 on QWindow/''. Resulting geometry:  640x480+960+465 (frame: 0, 0, 0, 0, custom margin: 0, 0, 0, 0, minimum size: 0x0, maximum size: 16777215x16777215).
QWindowsWindow::setGeometry: Unable to set geometry 391x70+0+0 on QWindow/''. Resulting geometry:  640x480+960+465 (frame: 0, 0, 0, 0, custom margin: 0, 0, 0, 0, minimum size: 0x0, maximum size: 16777215x16777215).
QWindowsWindow::setGeometry: Unable to set geometry 391x70+0+0 on QWindow/''. Resulting geometry:  640x480+960+465 (frame: 0, 0, 0, 0, custom margin: 0, 0, 0, 0, minimum size: 0x0, maximum size: 16777215x16777215).
QWindowsWindow::setGeometry: Unable to set geometry 391x70+0+0 on QWindow/''. Resulting geometry:  640x480+960+465 (frame: 0, 0, 0, 0, custom margin: 0, 0, 0, 0, minimum size: 0x0, maximum size: 16777215x16777215).
class PipeHandle:
    """Wrapper for an overlapped pipe handle which is vaguely file-object like.

    The IOCP event loop can use these instead of socket objects.
    """
    def __init__(self, handle):
        print("new pipe handle: {}".format(handle))
        self._handle = handle

    def __repr__(self):
        if self._handle is not None:
            handle = 'handle=%r' % self._handle
        else:
            handle = 'closed'
        return '<%s %s>' % (self.__class__.__name__, handle)

    @property
    def handle(self):
        return self._handle

    def fileno(self):
        if self._handle is None:
            raise ValueError("I/O operatioon on closed pipe")
        return self._handle

    def close(self, *, CloseHandle=_winapi.CloseHandle):
        if self._handle is not None:
            print("close pipe handle {}".format(self._handle))
            try:
                CloseHandle(self._handle)
            except Exception as e:
                print("exception closing handle {} in thread {}"
                      .format(self._handle, threading.current_thread().ident))
                print(e)
                raise
            print("closed handle {} in thread {}, trace: {}"
                  .format(self._handle, threading.current_thread().ident,
                          traceback.format_stack()))
            self._handle = None

    def __del__(self):
        if self._handle is not None:
            warnings.warn("unclosed %r" % self, ResourceWarning)
            self.close()

    def __enter__(self):
        return self

    def __exit__(self, t, v, tb):
        self.close()

So, the issue is that the pipe is being closed from proactor_events._call_connection_lost which is called in a different thread to the main thread of execution.

@hartytp
Copy link
Collaborator

hartytp commented May 15, 2019

I suspect the issue is that _force_close is being called, which schedules a close event in a separate thread.

https://github.com/python/asyncio/blob/a791d884b9110b59bada18c2dddd399e6adc40fd/asyncio/proactor_events.py#L111-L124

@hartytp
Copy link
Collaborator

hartytp commented May 15, 2019

__del__ being called outside of the main thread of execution:

closed handle 2928 in thread 2516, trace: 
- File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\site-packages\\Quamash-0.6.1-py3.5.egg\\quamash\\_windows.py", line 159, in run\n    events = self.__proactor.select(0.01)\n
- File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\site-packages\\Quamash-0.6.1-py3.5.egg\\quamash\\_windows.py", line 65, in select\n    self._poll(timeout)\n
- File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\asyncio\\windows_utils.py", line 176, in __del__\n    self.close()\n
- File "C:\\Users\\LabUser\\Miniconda3\\envs\\artiq-env\\lib\\asyncio\\windows_utils.py", line 170, in close\n    traceback.format_stack()))\n

This seems to be the most common race path.

@hartytp
Copy link
Collaborator

hartytp commented May 15, 2019

So, it looks like a race where __del__ is in the process of executing in a different thread (python gc?) while our _auto_close method runs. I don't fully understand python weak refs, but I would have expected the use of a weakref here to prevent this

    def close(self):
        if self._accept_pipe_future is not None:
            self._accept_pipe_future.cancel()
            self._accept_pipe_future = None
        # Close all instances which have not been connected to by a client.
        if self._address is not None:
            for pipe in self._free_instances:
                pipe.close()
            self._pipe = None
            self._address = None
            self._free_instances.clear()

@hartytp
Copy link
Collaborator

hartytp commented May 15, 2019

@dnadlinger I'm out of time to look into this. I can reproduce it pretty reliably if I restart the dashboard with some applets already open from the last time I ran it, then close the applets promptly after startup.

I still don't understand exactly where __del__ is being called, and why that doesn't invalidate the weak ref. I suspect it's probably something to do with quamash, and maybe this code isn't as threadsafe as it claims...

https://github.com/OxfordIonTrapGroup/quamash/blob/b006c9a163f55aba044a9ad8532c65c13f35121f/quamash/_windows.py#L84

@dnadlinger
Copy link
Collaborator

@sbourdeauducq: Any ideas/memories from when you wrote the ARTIQ side of things originally?

@hartytp
Copy link
Collaborator

hartytp commented May 15, 2019

I guess a reference to the pipe is being held in some other thread, and released at some point, triggering the garbage collection race. But, I don't know enough about python gc to debug that.

@hartytp
Copy link
Collaborator

hartytp commented May 15, 2019

Does anyone actually understand python garbage collection and weak references? When the last (strong) reference to an object goes out of scope, when are the weak references to it invalidated? When the gc calls __del__? If so, which thread does that occur in? and, how is one supposed to avoid races?

@hartytp
Copy link
Collaborator

hartytp commented May 15, 2019

The handle is also close()d by its del, but that shouldn't be invoked while pipe_ipc still has a reference

@dnadlinger FWIW, __del__ is invoced regularly when self._handle is not None. Asyncio has some warning code there, but it doesn't make it to the console. If you add a one-line print you'll see lots of cases of that, so no surprise we see occasional exceptions due to a race.

The traceback from __del__ always seems to look like #1216 (comment) however, I don't understand that execution flow. Is calling self__.poll triggering the python gc? If so, how does one go about debugging this kind of issue? I can't see any decent way of tracking the references to the pipe and figuring out what the actual problem is.

@dnadlinger
Copy link
Collaborator

Regarding the warning not being shown, ResourceWarning are filtered out by default (https://docs.python.org/3/library/warnings.html#default-warning-filter).

@hartytp
Copy link
Collaborator

hartytp commented May 15, 2019

@dnadlinger yes.

FWIW, if this is a gc issue then it would have to be a cyclic reference since it's not being cleaned promptly. So, we might be able to find it using something like: http://code.activestate.com/recipes/523004-find-cyclical-references/ If it's not a cyclic reference thing then I'm at a loss to explain the stack trace I posted where we jump from the poll to the PipeHandle destructor.

Edit: or simpler, turn garbage collection off (gc.set_debug(gc.DEBUG_SAVEALL))and each time we close pipe handles check whether the pipe is in the garbage pile

gc.collect()
for item in gc.garbage:
    if id(item) == id(self):
        raise Exception("Circular reference to PipeHandle")

@hartytp
Copy link
Collaborator

hartytp commented May 16, 2019

Nope, not garbage collection at all. What's happening is this: quamash processes events in a separate thread. Those events can store a reference to the handle. When they are released, for example here https://github.com/OxfordIonTrapGroup/quamash/blob/b006c9a163f55aba044a9ad8532c65c13f35121f/quamash/__init__.py#L210 __del__ gets called from that thread, leading to a possible race with the main program thread.

@hartytp
Copy link
Collaborator

hartytp commented May 16, 2019

@sbourdeauducq summarising this then:

  1. There issue is definitely a race in PipeHandle closure, as a result of __del__ being called from outside of the main python thread that our code runs in.
  2. More specifically, it seems to be due to references to the PipeHandle being released by the quamash event loop after the weakref validity has been checked in PipeServer.close, but before PipeHandle.close has completed.
  3. For example, I've seen this by adding diagnostic prints to quamash SimpleTimer class here and seen that after the timer releases its reference to the handle, the __del__ method is called from the same thread as the simple timer (which is a different thread to our code).
  4. This is a pain to debug since quamash doesn't use python threads so I can't get decent stack traces
  5. I've filed an issue with quamash PipeServer closure race harvimt/quamash#115

So, all in all, it seems pretty clear that this is a quamash issue and not an ARTIQ bug. Since these errors aren't harmful (just a Pipe being double closed), I think the best bet is to catch the exception in __auto_close and ignore it for the time being. If you're happy with that, I'll put in a PR.

@hartytp
Copy link
Collaborator

hartytp commented May 16, 2019

Actually, one thing that might fix this is modifying _autoclose to lock the qt thread, something like:

loop = asyncio.get_event_loop()
if not loop.is_running():
    self.server[0].close()
else:
    with QtCore.QMutexLocker(loop._proactor._lock):
        self.server[0].close()

Next time I can reproduce this issue, I'll try that...

@sbourdeauducq
Copy link
Member

Is this still an issue with the new Python 3.7 packages?

@sbourdeauducq
Copy link
Member

Assuming it isn't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants