seqrepo exceptions when handling concurrent requests #12

theferrit32 · 2023-05-31T21:38:07Z

Description

The same file exceptions related to seqrepo thread safety tracked here: clingen-data-model/architecture#548
occur when running seqrepo-rest-service. When there are concurrent requests, the global seqrepo object in the flask worker causes exceptions, and once it hits this exception, there will be an exception on all future requests that attempt to read from that file as well.

Stopping and starting the seqrepo-rest-service process resets it and temporarily resolves the problem. So the issue is with process state, there's no issue with the actual files on the filesystem.

The temporary solution to the above github issue that avoids modifying the seqrepo codebase was to add mutual exclusion to the object that contained the SeqRepo object. This enables the application to run concurrent threads except when they are executing code under that object, which is suboptimal because that is a fairly large critical section to make mutually exclusive.

Steps to reproduce

Run the server from one shell:

$ seqrepo-rest-service /path/to/seqrepo/2021-01-29

Send requests from a second shell:

$ bash -c 'for i in $(seq 1 100); do curl "http://127.0.0.1:5001/seqrepo/1/sequence/refseq%3ANM_000551.3" & done'

The text was updated successfully, but these errors were encountered:

theferrit32 · 2023-05-31T22:47:10Z

Ideally this will be resolved by this:
biocommons/biocommons.seqrepo#112

Alternatively, seqrepo-rest-service can add mutual exclusion around the entire SeqRepo object.

theferrit32 · 2023-08-29T22:04:07Z

Unfortunately biocommons/biocommons.seqrepo#117 doesn't resolve this. There is something stateful happening when a FastaFile object is constructed and destructed, outside the scope of the in-memory struct itself. Opening a bunch of FastaFile objects to a file and closing them, with these operations interleaved in some random order, might fail with os/filesystem errors, even if the FastaFile was not modified.

One workaround is to change the scope of the SeqRepo object from a per-request object to a global per-process object. Right now the SeqRepo object is constructed and the reference is added to Flask's request-scoped g object. So every request will construct a new SeqRepo object that will later be destructed.

Removing the flask.g import and adding this snippet to the top of the file changes the SeqRepo object scope to per-process, so it is only constructed once per flask worker and shared across all requests that worker handles.

class G(object):
    """
    This class contains nothing and is just a place that attributes can be stored in
    process memory, across request thread contexts
    """

g = G()

This reduces the amount of constructing and destructing of SeqRepo objects which eliminates (as far as I can tell) the filesystem errors.

reece · 2023-09-07T03:52:36Z

The above reproduction doesn't work for me. I believe you that you're seeing this error. It will be easier to figure out what's happening and to prove that we fixed it if we can reproduce it.

reece · 2023-09-07T14:34:56Z

See #14 (now closed) for additional observations and context.

reece · 2023-09-07T15:21:31Z

@theferrit32 @wlymanambry @davmlaw : I'd appreciate some help thinking through the strategy for solving this issue.

We have two seqrepo-rest-service issues (this issue and the duplicate #14), a biocommons.seqrepo issue (biocommons/biocommons.seqrepo#112), and corresponding PRs (#15 and biocommons/biocommons.seqrepo#117). I think all of this discussion is really around a single topic: concurrency issues with the seqrepo stack.

The observation is that seqrepo-rest-service fails after a period of use. @wlymanambry noticed the large number of files left open, so, the current conjecture is that we're exceeding the max number of open file descriptors per process (often 1024). The use of lru_cache is relevant. My thinking at the time was that lru_cache is limited to 128 entries, so would be exhausted before hitting the fd limit. Unfortunately, I wasn't thinking about threading, where each thread contribute to the same process limit. This sounds pretty likely to me: I'd expect 2 sqlite fds per SeqRepo instance, plus at least 2 fds per sequence file read. We could hit that limit with as few as ~500 requests. We could only hit that limit with multiple threads.

Importantly, the issue is not thread safety (i.e., corruption or blocking between threads), but rather concurrency due to exceeding the number of fds. I don't yet see any evidence for lack of thread safety. The files are all opened read-only, so I would not expect any fs-level concurrency issues.

@theferrit32 submitted two PRs, one for seqrepo-rest-service and one for biocommons.seqrepo. My instinct is that the issue is squarely with seqrepo and that if we solve that well, the s-r-s issue will disappear.

I'd like to get to a point where we can demonstrate the issue reliably, and then show that we can reliable solve the issue. Here are some experiments we can try:

Bring up seqrepo with limited fds, then issue a barrage of queries against different sequences (we need circumstances that open different sequence files). Something like (ulimit -n 50; seqrepo-rest-service /path/to/) should do it.
Remove lru_cache in fastadir. Without that caching, I don't see how we would be keeping files open.
Add a function to seqrepo to return a few stats like cache utilization and # of open fds. (For the latter, len(glob("/proc/self/fd/*") is probably the easiest way I know.)

If lru_cache is the culprit, which seems likely, then it might be that we're tripping over my premature optimization. If that's true, we should just drop it. I'm kinda skeptical that it will make a huge difference relative to other costs.

I'd appreciate comments and volunteers to try the above experiments.

theferrit32 · 2023-09-07T16:31:16Z

@reece I finally figured out what this is. It's not the same thing that was in clingen-data-model/architecture#548. That issue 548 is related to thread safety of the lru_cache in FastaDir.

This one is with the open file limit (ulimit -n) combined with cpython not always immediately freeing an object when it's no longer referenced (I can replicate this sometimes with a made-up example). And the exception message is not informative so I thought it was related to the other issue 548. The open call that fails is within the htslib C code, not python, and it doesn't check the errno to include a description of why it failed like python does.

The error message from the C library is:

[E::fai_load3_core] Failed to open FASTA file /Users/kferrite/dev/biocommons.seqrepo/seqrepo/2021-01-29/sequences/2016/0824/050304/1472014984.5124342.fa.bgz

And in the python/cython:

Traceback (most recent call last):
  File "/Users/kferrite/dev/seqrepo-rest-service/venv/lib/python3.11/site-packages/flask/app.py", line 2552, in __call__
    return self.wsgi_app(environ, start_response)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kferrite/dev/seqrepo-rest-service/venv/lib/python3.11/site-packages/flask/app.py", line 2532, in wsgi_app
    response = self.handle_exception(e)
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kferrite/dev/seqrepo-rest-service/venv/lib/python3.11/site-packages/flask/app.py", line 2529, in wsgi_app
    response = self.full_dispatch_request()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kferrite/dev/seqrepo-rest-service/venv/lib/python3.11/site-packages/flask/app.py", line 1825, in full_dispatch_request
    rv = self.handle_user_exception(e)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kferrite/dev/seqrepo-rest-service/venv/lib/python3.11/site-packages/flask/app.py", line 1823, in full_dispatch_request
    rv = self.dispatch_request()
         ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kferrite/dev/seqrepo-rest-service/venv/lib/python3.11/site-packages/flask/app.py", line 1799, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kferrite/dev/seqrepo-rest-service/venv/lib/python3.11/site-packages/connexion/decorators/decorator.py", line 68, in wrapper
    response = function(request)
               ^^^^^^^^^^^^^^^^^
  File "/Users/kferrite/dev/seqrepo-rest-service/venv/lib/python3.11/site-packages/connexion/decorators/uri_parsing.py", line 149, in wrapper
    response = function(request)
               ^^^^^^^^^^^^^^^^^
  File "/Users/kferrite/dev/seqrepo-rest-service/venv/lib/python3.11/site-packages/connexion/decorators/validation.py", line 399, in wrapper
    return function(request)
           ^^^^^^^^^^^^^^^^^
  File "/Users/kferrite/dev/seqrepo-rest-service/venv/lib/python3.11/site-packages/connexion/decorators/produces.py", line 41, in wrapper
    response = function(request)
               ^^^^^^^^^^^^^^^^^
  File "/Users/kferrite/dev/seqrepo-rest-service/venv/lib/python3.11/site-packages/connexion/decorators/response.py", line 112, in wrapper
    response = function(request)
               ^^^^^^^^^^^^^^^^^
  File "/Users/kferrite/dev/seqrepo-rest-service/venv/lib/python3.11/site-packages/connexion/decorators/parameter.py", line 120, in wrapper
    return function(**kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/Users/kferrite/dev/seqrepo-rest-service/src/seqrepo_rest_service/seqrepo/routes/sequence.py", line 24, in get
    return sr.sequences.fetch(seq_id, start, end), 200
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kferrite/dev/biocommons.seqrepo/src/biocommons/seqrepo/fastadir/fastadir.py", line 148, in fetch
    with self._open_for_reading(path) as fabgz:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kferrite/dev/biocommons.seqrepo/src/biocommons/seqrepo/fastadir/fastadir.py", line 243, in _open_for_reading
    return LockableFabgzReader(path)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kferrite/dev/biocommons.seqrepo/src/biocommons/seqrepo/fastadir/fastadir.py", line 41, in __init__
    self.fabgz_reader = FabgzReader(path)
                        ^^^^^^^^^^^^^^^^^
  File "/Users/kferrite/dev/biocommons.seqrepo/src/biocommons/seqrepo/fastadir/fabgz.py", line 64, in __init__
    self._fh = FastaFile(filename)
               ^^^^^^^^^^^^^^^^^^^
  File "pysam/libcfaidx.pyx", line 121, in pysam.libcfaidx.FastaFile.__cinit__
  File "pysam/libcfaidx.pyx", line 181, in pysam.libcfaidx.FastaFile._open
OSError: error when opening file `/Users/kferrite/dev/biocommons.seqrepo/seqrepo/2021-01-29/sequences/2016/0824/050304/1472014984.5124342.fa.bgz`

the log message:
https://github.com/pysam-developers/pysam/blob/f386240783a8515cf8334324136bc1f03ddcf36b/htslib/faidx.c#L579

open call falling when it hits the max file descriptors (errno EMFILE):
https://github.com/pysam-developers/pysam/blob/f386240783a8515cf8334324136bc1f03ddcf36b/htslib/hfile.c#L635

theferrit32 · 2023-09-07T16:47:00Z

Here's a snippet that consistently (for me) replicates cpython not immediately freeing objects when the reference count reaches 0. If you remove the lru_cache in this snippet is still replicates sometimes but not every time. Whether it replicates might have something to do with how complicated the graph of references is and this is just a small example.

https://gist.github.com/theferrit32/7080f884d27e84701239aacb0f54ad82

davmlaw · 2023-09-09T01:05:59Z

Is there a way to run the server so requests are handled with a process rather than via threads? That would add overhead but would potentially free up problems if the processes are allowed to die after a while.

Looks like your caching relies on it being in the same thread, you could potentially fix that by moving the cache to Redis

I haven't looked at how you are doing things but running out of resoueces due to DB connections reminds me of 'connection pooling' - can you reuse the same ones? If you stick with threads and multithreaded sharing isn't allowed you could pool per thread perhaps using threading.local

reece · 2023-09-12T04:32:08Z

Is there a way to run the server so requests are handled with a process rather than via threads? That would add overhead but would potentially free up problems if the processes are allowed to die after a while.

Yes. It depends on the WSGI server one chooses. e.g., gunicorn supports process workers.

However, I think the real problem is still in seqrepo. It is reasonable to want to use threading with seqrepo, so another app that has nothing to do with the rest service could trigger this bug in principle.

reece · 2023-09-12T18:48:56Z

Seems to me that the issue is entirely with seqrepo. See biocommons/biocommons.seqrepo#112 (comment).

Can we close this issue and #15?

reece · 2023-09-15T04:27:52Z

See conversation in #15. Closing.

theferrit32 · 2023-10-06T02:03:29Z

@reece this issue still arises but I think we've narrowed down the cause. I will reopen another ticket here because I think this still needs some resolution in the rest service itself. Probably along the lines of PR #15, which moves the SeqRepo object reference from request-local to process-local in order to reduce the number of objects constructed.

Limit number of SeqRepo instances

theferrit32 mentioned this issue Jun 1, 2023

Add GA4GH GKS normalization notebook clingen-data-model/clingen-hail-reports#10

Closed

theferrit32 mentioned this issue Sep 5, 2023

Limit number of SeqRepo instances #15

Merged

reece mentioned this issue Sep 7, 2023

Service maxes out number of open files #14

Closed

reece closed this as completed Sep 15, 2023

reece added a commit that referenced this issue Nov 13, 2023

Closes #12: Use thread local storage for SeqRepo

e4124b9

Limit number of SeqRepo instances

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

seqrepo exceptions when handling concurrent requests #12

seqrepo exceptions when handling concurrent requests #12

theferrit32 commented May 31, 2023

theferrit32 commented May 31, 2023

theferrit32 commented Aug 29, 2023

reece commented Sep 7, 2023

reece commented Sep 7, 2023

reece commented Sep 7, 2023 •

edited

Loading

theferrit32 commented Sep 7, 2023 •

edited

Loading

theferrit32 commented Sep 7, 2023

davmlaw commented Sep 9, 2023

reece commented Sep 12, 2023

reece commented Sep 12, 2023

reece commented Sep 15, 2023

theferrit32 commented Oct 6, 2023

seqrepo exceptions when handling concurrent requests #12

seqrepo exceptions when handling concurrent requests #12

Comments

theferrit32 commented May 31, 2023

Description

Steps to reproduce

theferrit32 commented May 31, 2023

theferrit32 commented Aug 29, 2023

reece commented Sep 7, 2023

reece commented Sep 7, 2023

reece commented Sep 7, 2023 • edited Loading

theferrit32 commented Sep 7, 2023 • edited Loading

theferrit32 commented Sep 7, 2023

davmlaw commented Sep 9, 2023

reece commented Sep 12, 2023

reece commented Sep 12, 2023

reece commented Sep 15, 2023

theferrit32 commented Oct 6, 2023

reece commented Sep 7, 2023 •

edited

Loading

theferrit32 commented Sep 7, 2023 •

edited

Loading