intermittent hanging #565

rabernat · 2021-03-11T20:33:01Z

No, this is not a thread about my social life during COVID... 🤣

I have noticed that fsspec intermittently hangs on some operations. The only way around it is to interrupt.

Example (not reproducible because it is intermittent)

url = 'https://mghp.osn.xsede.org/cnh-bucket-1/llc4320_tests/10dayhourly/Eta.0000787968.data.shrunk'
with fsspec.open(url, 'rb') as fp:
    _ = fp.read()

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<timed exec> in <module>

/srv/conda/envs/notebook/lib/python3.8/site-packages/fsspec/implementations/http.py in read(self, length)
    434             )  # all fits in one block anyway
    435         ):
--> 436             self._fetch_all()
    437         if self.size is None:
    438             if length < 0:

/srv/conda/envs/notebook/lib/python3.8/site-packages/fsspec/asyn.py in wrapper(*args, **kwargs)
    119     def wrapper(*args, **kwargs):
    120         self = obj or args[0]
--> 121         return maybe_sync(func, self, *args, **kwargs)
    122 
    123     return wrapper

/srv/conda/envs/notebook/lib/python3.8/site-packages/fsspec/asyn.py in maybe_sync(func, self, *args, **kwargs)
     98         if inspect.iscoroutinefunction(func):
     99             # run the awaitable on the loop
--> 100             return sync(loop, func, *args, **kwargs)
    101         else:
    102             # just call the blocking function

/srv/conda/envs/notebook/lib/python3.8/site-packages/fsspec/asyn.py in sync(loop, func, callback_timeout, *args, **kwargs)
     66     else:
     67         while not e.is_set():
---> 68             e.wait(10)
     69     if error[0]:
     70         typ, exc, tb = error[0]

/srv/conda/envs/notebook/lib/python3.8/threading.py in wait(self, timeout)
    556             signaled = self._flag
    557             if not signaled:
--> 558                 signaled = self._cond.wait(timeout)
    559             return signaled
    560 

/srv/conda/envs/notebook/lib/python3.8/threading.py in wait(self, timeout)
    304             else:
    305                 if timeout > 0:
--> 306                     gotit = waiter.acquire(True, timeout)
    307                 else:
    308                     gotit = waiter.acquire(False)

KeyboardInterrupt:

Whenever this happens, regardless of the context, I am always at the same place waiter.acquire when I interrupt.

fsspec version '0.8.5' from pangeo cloud

cc @rsignell-usgs, who mentioned this at yestereday's Pangeo meeting

The text was updated successfully, but these errors were encountered:

martindurant · 2021-03-11T20:41:35Z

Obviously, debugging async-in-another-thread is hard :|

Would you mind running some tests? I don't seem to see there hangs, but a number of factors might be in play.

Test1:

--- a/fsspec/asyn.py
+++ b/fsspec/asyn.py
@@ -53,8 +53,8 @@ def sync(loop, func, *args, callback_timeout=None, **kwargs):
             if callback_timeout is not None:
                 future = asyncio.wait_for(future, callback_timeout)
             result[0] = await future
-        except Exception:
-            error[0] = sys.exc_info()
+        except Exception as e:
+            error[0] = str(e)
+            del e
         finally:
             thread_state.asynchronous = False
             e.set()

martindurant · 2021-03-11T21:39:29Z

Another possiblly useful diagnostic:

--- a/fsspec/asyn.py
+++ b/fsspec/asyn.py
@@ -64,8 +65,10 @@ def sync(loop, func, *args, callback_timeout=None, **kwargs):
         if not e.wait(callback_timeout):
             raise TimeoutError("timed out after %s s." % (callback_timeout,))
     else:
-        while not e.is_set():
-            e.wait(10)
+        try:
+            e.wait()
+        except KeyboardInterrupt:
+            print(asyncio.tasks.all_tasks(loop))
     if error[0]:
         typ, exc, tb = error[0]
         raise exc.with_traceback(tb)

rsignell-usgs · 2021-03-11T23:59:16Z

cc @zflamig

zflamig · 2021-03-12T00:41:19Z

Is that a netcdf file @rabernat ? So far I've only observed this when reading them. But it may also be related to graph size. I noticed that I could sometimes get completions with small enough subsets of netcdf data but otherwise it was hanging 95% of the time. When I tried against ZARR datasets I didn't observe any hangs at all.

martindurant · 2021-03-12T02:03:27Z

@zflamig , is there a similar number and cadence of zarr reads compared to netCDF?

rabernat · 2021-03-12T02:54:07Z

Is that a netcdf file @rabernat ?

No. It's just a raw binary read on a big (500 MB) file.

seeholza · 2024-10-03T07:36:42Z

I am also encountering this currently when serially reading several thousand files (~5MB maximum):

for fn in files:
    with fsspec.open(fn) as f:
        content = f.read()
    # do something with content that takes a 10-20secs

I see exactly the same hanging behavior at random points in this for loop, where a different file each time file would not open. But it hangs every time somewhere in this loop.

Background:

opening google cloud storage files
running on k8s
fsspec/gcsfs 2024.6.1

martindurant · 2024-10-03T12:09:52Z

@Flinz : by hanging, do you mean "pause", or really "lockup"? In the thread above, I describe some debug steps that you could take, but none of the other reporters ever did.

As an aside, if you always want the whole file, you can use cat/cat_file on the filesystem instance for this.

seeholza · 2024-10-03T14:24:16Z

@martindurant i really mean locked up, indefinitely, as I can tell.

use cat/cat_file

To be more precise (i didn't want to go to details) I am actually opening the files through pandas, which then goes to fsspec/gcsfs. Sorry for the obfuscation. Here is the stack trace of the call:

Traceback (most recent call first):
  File "/usr/lib/python3.11/threading.py", line 324, in wait
    gotit = waiter.acquire(True, timeout)
  File "/usr/lib/python3.11/threading.py", line 622, in wait
    signaled = self._cond.wait(timeout)
  File "/usr/local/lib/python3.11/dist-packages/fsspec/asyn.py", line 91, in sync
  File "/usr/local/lib/python3.11/dist-packages/fsspec/asyn.py", line 118, in wrapper
  File "/usr/local/lib/python3.11/dist-packages/gcsfs/core.py", line 1924, in _fetch_range
  File "/usr/local/lib/python3.11/dist-packages/fsspec/caching.py", line 234, in _fetch
  File "/usr/local/lib/python3.11/dist-packages/fsspec/spec.py", line 1897, in read
  <built-in method read of _io.TextIOWrapper object at remote 0x786a62123030>
  File "/usr/local/lib/python3.11/dist-packages/pandas/io/parsers/c_parser_wrapper.py", line 93, in __init__
  File "/usr/local/lib/python3.11/dist-packages/pandas/io/parsers/readers.py", line 1898, in _make_engine
  File "/usr/local/lib/python3.11/dist-packages/pandas/io/parsers/readers.py", line 1620, in __init__
  File "/usr/local/lib/python3.11/dist-packages/pandas/io/parsers/readers.py", line 620, in _read
  File "/usr/local/lib/python3.11/dist-packages/pandas/io/parsers/readers.py", line 1026, in read_csv

I can see whether I can include some of the debug steps.

martindurant · 2024-10-03T18:29:52Z

Sorry for the obfuscation.

No worries; there are of course situations where you have no control over how code is called.

Please do report back if you manage to get any more information about which specific coroutine is failing and how. We can of course try assigning timeouts in various places, but it is very tricky to figure out what reasonable values might be.

seeholza · 2024-10-04T18:27:23Z

@martindurant i was able to inpect the threads (using py-spy) and this is the only other thread i see related to fsspec/gcsfs

Thread 43 (active): "fsspecIO"
    maybe_refresh (gcsfs/credentials.py:178)
    apply (gcsfs/credentials.py:187)
    maybe_refresh (gcsfs/credentials.py:183)
    apply (gcsfs/credentials.py:187)
    _get_headers (gcsfs/core.py:400)
    _request (gcsfs/core.py:421)
    retry_request (gcsfs/retry.py:123)
    fun (decorator.py:221)
    _call (gcsfs/core.py:437)
    _get_object (gcsfs/core.py:516)
    _info (gcsfs/core.py:959)
    _exists (fsspec/asyn.py:677)
    _runner (fsspec/asyn.py:56)
    _run (asyncio/events.py:84)
    _run_once (asyncio/base_events.py:1936)
    run_forever (asyncio/base_events.py:608)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)

martindurant · 2024-10-04T18:31:00Z

Is it consistently within refresh_credentials? That could be telling and give us an excellent place to make sure we have a timeout. I wonder if it might be taking very long rather than forever...

seeholza · 2024-10-04T18:48:02Z

In all processes that I've seen (I've checked around 5 instances now) it's always the same place. I'll monitor a little more and let you know if I see anything else.

Regarding time: I've let jobs run more than 24h stuck in this place, so its at least approximately infinite :)

martindurant · 2024-10-07T13:50:14Z

OK, so there is unfortunate recursion here.

I think the following should fix things:

--- a/gcsfs/credentials.py
+++ b/gcsfs/credentials.py
@@ -180,7 +180,7 @@ class GoogleCredentials:
                     return  # repeat to avoid race (but don't want lock in common case)
                 logger.debug("GCS refresh")
                 self.credentials.refresh(req)
-                self.apply(self.heads)
+                self.credentials.apply(self.heads)

reineking · 2024-10-08T07:22:56Z

I'm seeing a very similar issue where a process starts to hang consistently after 60min when trying to write a Parquet file using fsspec/gcsfs. The issue indeed seems to be related to a credential refresh as can be seen from this profile:

martindurant mentioned this issue Mar 16, 2021

One event loop per thread #572

Merged

martindurant closed this as completed in #572 Mar 17, 2021

rabernat mentioned this issue Apr 7, 2021

Problem with cftime coordinates on sequence_dim pangeo-forge/pangeo-forge-recipes#51

Closed

rabernat mentioned this issue May 24, 2021

Intermittent hanging when opening cached files without failure. pangeo-forge/pangeo-forge-recipes#144

Closed

bnaul mentioned this issue Dec 15, 2022

Intermittent deadlock when reading file with gcsfs #1136

Open

ljstrnadiii mentioned this issue Jul 5, 2023

Tasks hang when operating on writing Zarr-backed Dataset pydata/xarray#7952

Closed

4 tasks

seeholza mentioned this issue Oct 7, 2024

Remove race condition in credentials.py fsspec/gcsfs#643

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

intermittent hanging #565

intermittent hanging #565

rabernat commented Mar 11, 2021

martindurant commented Mar 11, 2021

martindurant commented Mar 11, 2021

rsignell-usgs commented Mar 11, 2021

zflamig commented Mar 12, 2021

martindurant commented Mar 12, 2021

rabernat commented Mar 12, 2021

seeholza commented Oct 3, 2024 •

edited

Loading

martindurant commented Oct 3, 2024

seeholza commented Oct 3, 2024 •

edited

Loading

martindurant commented Oct 3, 2024

seeholza commented Oct 4, 2024

martindurant commented Oct 4, 2024

seeholza commented Oct 4, 2024

martindurant commented Oct 7, 2024

reineking commented Oct 8, 2024

intermittent hanging #565

intermittent hanging #565

Comments

rabernat commented Mar 11, 2021

martindurant commented Mar 11, 2021

martindurant commented Mar 11, 2021

rsignell-usgs commented Mar 11, 2021

zflamig commented Mar 12, 2021

martindurant commented Mar 12, 2021

rabernat commented Mar 12, 2021

seeholza commented Oct 3, 2024 • edited Loading

martindurant commented Oct 3, 2024

seeholza commented Oct 3, 2024 • edited Loading

martindurant commented Oct 3, 2024

seeholza commented Oct 4, 2024

martindurant commented Oct 4, 2024

seeholza commented Oct 4, 2024

martindurant commented Oct 7, 2024

reineking commented Oct 8, 2024

seeholza commented Oct 3, 2024 •

edited

Loading

seeholza commented Oct 3, 2024 •

edited

Loading