Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intermittent hanging #565

Closed
rabernat opened this issue Mar 11, 2021 · 15 comments · Fixed by #572
Closed

intermittent hanging #565

rabernat opened this issue Mar 11, 2021 · 15 comments · Fixed by #572

Comments

@rabernat
Copy link
Contributor

No, this is not a thread about my social life during COVID... 🤣

I have noticed that fsspec intermittently hangs on some operations. The only way around it is to interrupt.

Example (not reproducible because it is intermittent)

url = 'https://mghp.osn.xsede.org/cnh-bucket-1/llc4320_tests/10dayhourly/Eta.0000787968.data.shrunk'
with fsspec.open(url, 'rb') as fp:
    _ = fp.read()
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<timed exec> in <module>

/srv/conda/envs/notebook/lib/python3.8/site-packages/fsspec/implementations/http.py in read(self, length)
    434             )  # all fits in one block anyway
    435         ):
--> 436             self._fetch_all()
    437         if self.size is None:
    438             if length < 0:

/srv/conda/envs/notebook/lib/python3.8/site-packages/fsspec/asyn.py in wrapper(*args, **kwargs)
    119     def wrapper(*args, **kwargs):
    120         self = obj or args[0]
--> 121         return maybe_sync(func, self, *args, **kwargs)
    122 
    123     return wrapper

/srv/conda/envs/notebook/lib/python3.8/site-packages/fsspec/asyn.py in maybe_sync(func, self, *args, **kwargs)
     98         if inspect.iscoroutinefunction(func):
     99             # run the awaitable on the loop
--> 100             return sync(loop, func, *args, **kwargs)
    101         else:
    102             # just call the blocking function

/srv/conda/envs/notebook/lib/python3.8/site-packages/fsspec/asyn.py in sync(loop, func, callback_timeout, *args, **kwargs)
     66     else:
     67         while not e.is_set():
---> 68             e.wait(10)
     69     if error[0]:
     70         typ, exc, tb = error[0]

/srv/conda/envs/notebook/lib/python3.8/threading.py in wait(self, timeout)
    556             signaled = self._flag
    557             if not signaled:
--> 558                 signaled = self._cond.wait(timeout)
    559             return signaled
    560 

/srv/conda/envs/notebook/lib/python3.8/threading.py in wait(self, timeout)
    304             else:
    305                 if timeout > 0:
--> 306                     gotit = waiter.acquire(True, timeout)
    307                 else:
    308                     gotit = waiter.acquire(False)

KeyboardInterrupt: 

Whenever this happens, regardless of the context, I am always at the same place waiter.acquire when I interrupt.

fsspec version '0.8.5' from pangeo cloud

cc @rsignell-usgs, who mentioned this at yestereday's Pangeo meeting

@martindurant
Copy link
Member

Obviously, debugging async-in-another-thread is hard :|

Would you mind running some tests? I don't seem to see there hangs, but a number of factors might be in play.

Test1:

--- a/fsspec/asyn.py
+++ b/fsspec/asyn.py
@@ -53,8 +53,8 @@ def sync(loop, func, *args, callback_timeout=None, **kwargs):
             if callback_timeout is not None:
                 future = asyncio.wait_for(future, callback_timeout)
             result[0] = await future
-        except Exception:
-            error[0] = sys.exc_info()
+        except Exception as e:
+            error[0] = str(e)
+            del e
         finally:
             thread_state.asynchronous = False
             e.set()

@martindurant
Copy link
Member

Another possiblly useful diagnostic:

--- a/fsspec/asyn.py
+++ b/fsspec/asyn.py
@@ -64,8 +65,10 @@ def sync(loop, func, *args, callback_timeout=None, **kwargs):
         if not e.wait(callback_timeout):
             raise TimeoutError("timed out after %s s." % (callback_timeout,))
     else:
-        while not e.is_set():
-            e.wait(10)
+        try:
+            e.wait()
+        except KeyboardInterrupt:
+            print(asyncio.tasks.all_tasks(loop))
     if error[0]:
         typ, exc, tb = error[0]
         raise exc.with_traceback(tb)

@rsignell-usgs
Copy link
Contributor

cc @zflamig

@zflamig
Copy link

zflamig commented Mar 12, 2021

Is that a netcdf file @rabernat ? So far I've only observed this when reading them. But it may also be related to graph size. I noticed that I could sometimes get completions with small enough subsets of netcdf data but otherwise it was hanging 95% of the time. When I tried against ZARR datasets I didn't observe any hangs at all.

@martindurant
Copy link
Member

@zflamig , is there a similar number and cadence of zarr reads compared to netCDF?

@rabernat
Copy link
Contributor Author

Is that a netcdf file @rabernat ?

No. It's just a raw binary read on a big (500 MB) file.

@seeholza
Copy link

seeholza commented Oct 3, 2024

I am also encountering this currently when serially reading several thousand files (~5MB maximum):

for fn in files:
    with fsspec.open(fn) as f:
        content = f.read()
    # do something with content that takes a 10-20secs

I see exactly the same hanging behavior at random points in this for loop, where a different file each time file would not open. But it hangs every time somewhere in this loop.

Background:

  • opening google cloud storage files
  • running on k8s
  • fsspec/gcsfs 2024.6.1

@martindurant
Copy link
Member

@Flinz : by hanging, do you mean "pause", or really "lockup"? In the thread above, I describe some debug steps that you could take, but none of the other reporters ever did.

As an aside, if you always want the whole file, you can use cat/cat_file on the filesystem instance for this.

@seeholza
Copy link

seeholza commented Oct 3, 2024

@martindurant i really mean locked up, indefinitely, as I can tell.

use cat/cat_file

To be more precise (i didn't want to go to details) I am actually opening the files through pandas, which then goes to fsspec/gcsfs. Sorry for the obfuscation. Here is the stack trace of the call:

Traceback (most recent call first):
  File "/usr/lib/python3.11/threading.py", line 324, in wait
    gotit = waiter.acquire(True, timeout)
  File "/usr/lib/python3.11/threading.py", line 622, in wait
    signaled = self._cond.wait(timeout)
  File "/usr/local/lib/python3.11/dist-packages/fsspec/asyn.py", line 91, in sync
  File "/usr/local/lib/python3.11/dist-packages/fsspec/asyn.py", line 118, in wrapper
  File "/usr/local/lib/python3.11/dist-packages/gcsfs/core.py", line 1924, in _fetch_range
  File "/usr/local/lib/python3.11/dist-packages/fsspec/caching.py", line 234, in _fetch
  File "/usr/local/lib/python3.11/dist-packages/fsspec/spec.py", line 1897, in read
  <built-in method read of _io.TextIOWrapper object at remote 0x786a62123030>
  File "/usr/local/lib/python3.11/dist-packages/pandas/io/parsers/c_parser_wrapper.py", line 93, in __init__
  File "/usr/local/lib/python3.11/dist-packages/pandas/io/parsers/readers.py", line 1898, in _make_engine
  File "/usr/local/lib/python3.11/dist-packages/pandas/io/parsers/readers.py", line 1620, in __init__
  File "/usr/local/lib/python3.11/dist-packages/pandas/io/parsers/readers.py", line 620, in _read
  File "/usr/local/lib/python3.11/dist-packages/pandas/io/parsers/readers.py", line 1026, in read_csv

I can see whether I can include some of the debug steps.

@martindurant
Copy link
Member

Sorry for the obfuscation.

No worries; there are of course situations where you have no control over how code is called.

Please do report back if you manage to get any more information about which specific coroutine is failing and how. We can of course try assigning timeouts in various places, but it is very tricky to figure out what reasonable values might be.

@seeholza
Copy link

seeholza commented Oct 4, 2024

@martindurant i was able to inpect the threads (using py-spy) and this is the only other thread i see related to fsspec/gcsfs

Thread 43 (active): "fsspecIO"
    maybe_refresh (gcsfs/credentials.py:178)
    apply (gcsfs/credentials.py:187)
    maybe_refresh (gcsfs/credentials.py:183)
    apply (gcsfs/credentials.py:187)
    _get_headers (gcsfs/core.py:400)
    _request (gcsfs/core.py:421)
    retry_request (gcsfs/retry.py:123)
    fun (decorator.py:221)
    _call (gcsfs/core.py:437)
    _get_object (gcsfs/core.py:516)
    _info (gcsfs/core.py:959)
    _exists (fsspec/asyn.py:677)
    _runner (fsspec/asyn.py:56)
    _run (asyncio/events.py:84)
    _run_once (asyncio/base_events.py:1936)
    run_forever (asyncio/base_events.py:608)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)

@martindurant
Copy link
Member

Is it consistently within refresh_credentials? That could be telling and give us an excellent place to make sure we have a timeout. I wonder if it might be taking very long rather than forever...

@seeholza
Copy link

seeholza commented Oct 4, 2024

In all processes that I've seen (I've checked around 5 instances now) it's always the same place. I'll monitor a little more and let you know if I see anything else.

Regarding time: I've let jobs run more than 24h stuck in this place, so its at least approximately infinite :)

@martindurant
Copy link
Member

OK, so there is unfortunate recursion here.

I think the following should fix things:

--- a/gcsfs/credentials.py
+++ b/gcsfs/credentials.py
@@ -180,7 +180,7 @@ class GoogleCredentials:
                     return  # repeat to avoid race (but don't want lock in common case)
                 logger.debug("GCS refresh")
                 self.credentials.refresh(req)
-                self.apply(self.heads)
+                self.credentials.apply(self.heads)

@reineking
Copy link

I'm seeing a very similar issue where a process starts to hang consistently after 60min when trying to write a Parquet file using fsspec/gcsfs. The issue indeed seems to be related to a credential refresh as can be seen from this profile:
profile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants