-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dashboard occasionally throws "Task exception was never retrieved" errors on Windows #1216
Comments
I haven't seen this. Also on windows. |
Sometimes after an experiment terminates with an exception I see
|
Different issue? Windows vs Linux, pipe_rpc vs sync_struct, different errors, different code area, different tools and components. |
Yes, it is (and I'll break it out in a second). Just posting here as a reminder that the frequency of the occurance of these events is due to the fact that it has more than one cause... |
The latter is due to a feature not merged upstream (yet), although I thought I had fixed that particular issue a while back. |
I think this is a race condition inside asyncio.... I hacked my local copy of asyncio as follows class PipeHandle:
"""Wrapper for an overlapped pipe handle which is vaguely file-object like.
The IOCP event loop can use these instead of socket objects.
"""
def __init__(self, handle):
print("new pipe handle: {}".format(handle))
self._handle = handle
self._closed = False
def __repr__(self):
if self._handle is not None:
handle = 'handle=%r' % self._handle
else:
handle = 'closed'
return '<%s %s>' % (self.__class__.__name__, handle)
@property
def handle(self):
return self._handle
def fileno(self):
if self._handle is None:
raise ValueError("I/O operatioon on closed pipe")
return self._handle
def close(self, *, CloseHandle=_winapi.CloseHandle):
if self._handle is not None:
print("close pipe handle: {} {}".format(self._handle, self._closed))
try:
print(CloseHandle)
CloseHandle(self._handle)
except Exception as e:
print("wtf {} {} {}".format(e, self._handle, self._closed))
raise
print("closed")
self._closed = True
self._handle = None
def __del__(self):
if self._handle is not None:
warnings.warn("unclosed %r" % self, ResourceWarning)
self.close()
def __enter__(self):
return self
def __exit__(self, t, v, tb):
self.close() and I see...
I guess this is a bug to do with python's lazy evaluation or something like that? It looks like by the time @cjbe can you try patching your local asyncio here https://github.com/python/cpython/blob/a2fedd8c910cb5f5b9bd568d6fd44d63f8f5cfa5/Lib/asyncio/windows_utils.py#L105-L108 to be something like: def close(self, *, CloseHandle=_winapi.CloseHandle):
if self._handle is not None:
handle = self._handle
self._handle = None
CloseHandle(handle) |
@cjbe @dnadlinger can you apply that patch locally and confirm that it fixes the error? If so, I'll close this issue and move the discussion to asyncio. |
@hartytp: Fixed the RID namespacing-related issue in our local fork. As for the handle closing issue, doesn't this look more like a race condition between (OS) threads? If that is the case, your patch would only paper over the actual problem. (There shouldn't be any lazy evaluation of that kind in Python.) |
Did you stumble over a good way of reproducing this? Repeatedly opening/closing a bunch of applets maybe? |
(Maybe it's actually our, i.e. dashboard, code calling |
Yes, that basically does it.
hmmm...I must have misread the log I posted. The original thing I was looking for was a race where a PipeHandle is closed twice, which is what the log clearly shows is occurring. Somehow I convinced myself that wasn't the case. So it is a race between threads. |
Which two threads would that be, though? Or does |
That's a good question. I haven't figured it out. AFAICT the handle is only closed by the artiq/artiq/protocols/pipe_ipc.py Lines 118 to 125 in 84b91ee
AsyncioParentComm._auto_close here artiq/artiq/protocols/pipe_ipc.py Lines 127 to 141 in 84b91ee
So, unless there is some other code path I'm not aware of (when I have time, I'll hack in a stack trace each time we close a handle to rule this out), it looks like |
The handle is also |
Looking at the trace, I think this race is probably in a single thread (which seems most likely given the original |
But then (i.e. if it's a single thread) how would self._handle change between time of check and time of use? |
sigh...sorry, you're right. It's late. Well, then I'm back to not understanding this at all. The next step I can think of would be to add a stack trace print to the |
Improved the diagnostics to show which thread the calls are coming from:
So, we definitely see the PipeHandles being closed from two threads... class PipeHandle:
"""Wrapper for an overlapped pipe handle which is vaguely file-object like.
The IOCP event loop can use these instead of socket objects.
"""
def __init__(self, handle):
print("new pipe handle: {}".format(handle))
self._handle = handle
def __repr__(self):
if self._handle is not None:
handle = 'handle=%r' % self._handle
else:
handle = 'closed'
return '<%s %s>' % (self.__class__.__name__, handle)
@property
def handle(self):
return self._handle
def fileno(self):
if self._handle is None:
raise ValueError("I/O operatioon on closed pipe")
return self._handle
def close(self, *, CloseHandle=_winapi.CloseHandle):
if self._handle is not None:
print("close pipe handle {}".format(self._handle))
try:
CloseHandle(self._handle)
except Exception as e:
print("exception closing handle {} in thread {}"
.format(self._handle, threading.current_thread().ident))
print(e)
raise
print("closed handle {} in thread {}, trace: {}"
.format(self._handle, threading.current_thread().ident,
traceback.format_stack()))
self._handle = None
def __del__(self):
if self._handle is not None:
warnings.warn("unclosed %r" % self, ResourceWarning)
self.close()
def __enter__(self):
return self
def __exit__(self, t, v, tb):
self.close() |
With stack traces:
class PipeHandle:
"""Wrapper for an overlapped pipe handle which is vaguely file-object like.
The IOCP event loop can use these instead of socket objects.
"""
def __init__(self, handle):
print("new pipe handle: {}".format(handle))
self._handle = handle
def __repr__(self):
if self._handle is not None:
handle = 'handle=%r' % self._handle
else:
handle = 'closed'
return '<%s %s>' % (self.__class__.__name__, handle)
@property
def handle(self):
return self._handle
def fileno(self):
if self._handle is None:
raise ValueError("I/O operatioon on closed pipe")
return self._handle
def close(self, *, CloseHandle=_winapi.CloseHandle):
if self._handle is not None:
print("close pipe handle {}".format(self._handle))
try:
CloseHandle(self._handle)
except Exception as e:
print("exception closing handle {} in thread {}"
.format(self._handle, threading.current_thread().ident))
print(e)
raise
print("closed handle {} in thread {}, trace: {}"
.format(self._handle, threading.current_thread().ident,
traceback.format_stack()))
self._handle = None
def __del__(self):
if self._handle is not None:
warnings.warn("unclosed %r" % self, ResourceWarning)
self.close()
def __enter__(self):
return self
def __exit__(self, t, v, tb):
self.close() So, the issue is that the pipe is being closed from |
I suspect the issue is that |
This seems to be the most common race path. |
So, it looks like a race where def close(self):
if self._accept_pipe_future is not None:
self._accept_pipe_future.cancel()
self._accept_pipe_future = None
# Close all instances which have not been connected to by a client.
if self._address is not None:
for pipe in self._free_instances:
pipe.close()
self._pipe = None
self._address = None
self._free_instances.clear() |
@dnadlinger I'm out of time to look into this. I can reproduce it pretty reliably if I restart the dashboard with some applets already open from the last time I ran it, then close the applets promptly after startup. I still don't understand exactly where |
@sbourdeauducq: Any ideas/memories from when you wrote the ARTIQ side of things originally? |
I guess a reference to the pipe is being held in some other thread, and released at some point, triggering the garbage collection race. But, I don't know enough about python gc to debug that. |
Does anyone actually understand python garbage collection and weak references? When the last (strong) reference to an object goes out of scope, when are the weak references to it invalidated? When the gc calls |
@dnadlinger FWIW, The traceback from |
Regarding the warning not being shown, |
@dnadlinger yes. FWIW, if this is a gc issue then it would have to be a cyclic reference since it's not being cleaned promptly. So, we might be able to find it using something like: http://code.activestate.com/recipes/523004-find-cyclical-references/ If it's not a cyclic reference thing then I'm at a loss to explain the stack trace I posted where we jump from the poll to the PipeHandle destructor. Edit: or simpler, turn garbage collection off ( gc.collect()
for item in gc.garbage:
if id(item) == id(self):
raise Exception("Circular reference to PipeHandle") |
Nope, not garbage collection at all. What's happening is this: quamash processes events in a separate thread. Those events can store a reference to the handle. When they are released, for example here https://github.com/OxfordIonTrapGroup/quamash/blob/b006c9a163f55aba044a9ad8532c65c13f35121f/quamash/__init__.py#L210 |
@sbourdeauducq summarising this then:
So, all in all, it seems pretty clear that this is a quamash issue and not an ARTIQ bug. Since these errors aren't harmful (just a Pipe being double closed), I think the best bet is to catch the exception in |
Actually, one thing that might fix this is modifying loop = asyncio.get_event_loop()
if not loop.is_running():
self.server[0].close()
else:
with QtCore.QMutexLocker(loop._proactor._lock):
self.server[0].close() Next time I can reproduce this issue, I'll try that... |
Is this still an issue with the new Python 3.7 packages? |
Assuming it isn't. |
Bug Report
One-Line Summary
The dashboard occasionally throws "Task exception was never retrieved" errors on Windows, even with no user activity.
Issue Details
These errors seem to occur occasionally (~10 per day). They occur even without any user activity. For example, on a dashboard which has some applets open plotting actively changing datasets, but without any user button presses, or any applets being opened or closed.
Steps to Reproduce
No known reproduction
Actual (undesired) Behavior
The traceback is always like this:
Your System
Using Artiq from master
The text was updated successfully, but these errors were encountered: