-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
intermittent hanging #565
Comments
Obviously, debugging async-in-another-thread is hard :| Would you mind running some tests? I don't seem to see there hangs, but a number of factors might be in play. Test1: --- a/fsspec/asyn.py
+++ b/fsspec/asyn.py
@@ -53,8 +53,8 @@ def sync(loop, func, *args, callback_timeout=None, **kwargs):
if callback_timeout is not None:
future = asyncio.wait_for(future, callback_timeout)
result[0] = await future
- except Exception:
- error[0] = sys.exc_info()
+ except Exception as e:
+ error[0] = str(e)
+ del e
finally:
thread_state.asynchronous = False
e.set() |
Another possiblly useful diagnostic: --- a/fsspec/asyn.py
+++ b/fsspec/asyn.py
@@ -64,8 +65,10 @@ def sync(loop, func, *args, callback_timeout=None, **kwargs):
if not e.wait(callback_timeout):
raise TimeoutError("timed out after %s s." % (callback_timeout,))
else:
- while not e.is_set():
- e.wait(10)
+ try:
+ e.wait()
+ except KeyboardInterrupt:
+ print(asyncio.tasks.all_tasks(loop))
if error[0]:
typ, exc, tb = error[0]
raise exc.with_traceback(tb) |
cc @zflamig |
Is that a netcdf file @rabernat ? So far I've only observed this when reading them. But it may also be related to graph size. I noticed that I could sometimes get completions with small enough subsets of netcdf data but otherwise it was hanging 95% of the time. When I tried against ZARR datasets I didn't observe any hangs at all. |
@zflamig , is there a similar number and cadence of zarr reads compared to netCDF? |
No. It's just a raw binary read on a big (500 MB) file. |
I am also encountering this currently when serially reading several thousand files (~5MB maximum):
I see exactly the same hanging behavior at random points in this for loop, where a different file each time file would not open. But it hangs every time somewhere in this loop. Background:
|
@Flinz : by hanging, do you mean "pause", or really "lockup"? In the thread above, I describe some debug steps that you could take, but none of the other reporters ever did. As an aside, if you always want the whole file, you can use cat/cat_file on the filesystem instance for this. |
@martindurant i really mean locked up, indefinitely, as I can tell.
To be more precise (i didn't want to go to details) I am actually opening the files through pandas, which then goes to fsspec/gcsfs. Sorry for the obfuscation. Here is the stack trace of the call:
I can see whether I can include some of the debug steps. |
No worries; there are of course situations where you have no control over how code is called. Please do report back if you manage to get any more information about which specific coroutine is failing and how. We can of course try assigning timeouts in various places, but it is very tricky to figure out what reasonable values might be. |
@martindurant i was able to inpect the threads (using py-spy) and this is the only other thread i see related to fsspec/gcsfs
|
Is it consistently within refresh_credentials? That could be telling and give us an excellent place to make sure we have a timeout. I wonder if it might be taking very long rather than forever... |
In all processes that I've seen (I've checked around 5 instances now) it's always the same place. I'll monitor a little more and let you know if I see anything else. Regarding time: I've let jobs run more than 24h stuck in this place, so its at least approximately infinite :) |
OK, so there is unfortunate recursion here. I think the following should fix things: --- a/gcsfs/credentials.py
+++ b/gcsfs/credentials.py
@@ -180,7 +180,7 @@ class GoogleCredentials:
return # repeat to avoid race (but don't want lock in common case)
logger.debug("GCS refresh")
self.credentials.refresh(req)
- self.apply(self.heads)
+ self.credentials.apply(self.heads)
|
No, this is not a thread about my social life during COVID... 🤣
I have noticed that fsspec intermittently hangs on some operations. The only way around it is to interrupt.
Example (not reproducible because it is intermittent)
Whenever this happens, regardless of the context, I am always at the same place
waiter.acquire
when I interrupt.fsspec version '0.8.5' from pangeo cloud
cc @rsignell-usgs, who mentioned this at yestereday's Pangeo meeting
The text was updated successfully, but these errors were encountered: