-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
seqrepo exceptions when handling concurrent requests #12
Comments
Ideally this will be resolved by this: Alternatively, seqrepo-rest-service can add mutual exclusion around the entire SeqRepo object. |
Unfortunately biocommons/biocommons.seqrepo#117 doesn't resolve this. There is something stateful happening when a FastaFile object is constructed and destructed, outside the scope of the in-memory struct itself. Opening a bunch of FastaFile objects to a file and closing them, with these operations interleaved in some random order, might fail with os/filesystem errors, even if the FastaFile was not modified. One workaround is to change the scope of the Removing the class G(object):
"""
This class contains nothing and is just a place that attributes can be stored in
process memory, across request thread contexts
"""
g = G() This reduces the amount of constructing and destructing of SeqRepo objects which eliminates (as far as I can tell) the filesystem errors. |
The above reproduction doesn't work for me. I believe you that you're seeing this error. It will be easier to figure out what's happening and to prove that we fixed it if we can reproduce it. |
See #14 (now closed) for additional observations and context. |
@theferrit32 @wlymanambry @davmlaw : I'd appreciate some help thinking through the strategy for solving this issue. We have two seqrepo-rest-service issues (this issue and the duplicate #14), a biocommons.seqrepo issue (biocommons/biocommons.seqrepo#112), and corresponding PRs (#15 and biocommons/biocommons.seqrepo#117). I think all of this discussion is really around a single topic: concurrency issues with the seqrepo stack. The observation is that seqrepo-rest-service fails after a period of use. @wlymanambry noticed the large number of files left open, so, the current conjecture is that we're exceeding the max number of open file descriptors per process (often 1024). The use of Importantly, the issue is not thread safety (i.e., corruption or blocking between threads), but rather concurrency due to exceeding the number of fds. I don't yet see any evidence for lack of thread safety. The files are all opened read-only, so I would not expect any fs-level concurrency issues. @theferrit32 submitted two PRs, one for seqrepo-rest-service and one for biocommons.seqrepo. My instinct is that the issue is squarely with seqrepo and that if we solve that well, the s-r-s issue will disappear. I'd like to get to a point where we can demonstrate the issue reliably, and then show that we can reliable solve the issue. Here are some experiments we can try:
If I'd appreciate comments and volunteers to try the above experiments. |
@reece I finally figured out what this is. It's not the same thing that was in clingen-data-model/architecture#548. That issue 548 is related to thread safety of the This one is with the open file limit (ulimit -n) combined with cpython not always immediately freeing an object when it's no longer referenced (I can replicate this sometimes with a made-up example). And the exception message is not informative so I thought it was related to the other issue 548. The The error message from the C library is:
And in the python/cython:
the log message: open call falling when it hits the max file descriptors (errno EMFILE): |
Here's a snippet that consistently (for me) replicates cpython not immediately freeing objects when the reference count reaches 0. If you remove the https://gist.github.com/theferrit32/7080f884d27e84701239aacb0f54ad82 |
Is there a way to run the server so requests are handled with a process rather than via threads? That would add overhead but would potentially free up problems if the processes are allowed to die after a while. Looks like your caching relies on it being in the same thread, you could potentially fix that by moving the cache to Redis I haven't looked at how you are doing things but running out of resoueces due to DB connections reminds me of 'connection pooling' - can you reuse the same ones? If you stick with threads and multithreaded sharing isn't allowed you could pool per thread perhaps using |
Yes. It depends on the WSGI server one chooses. e.g., gunicorn supports process workers. However, I think the real problem is still in seqrepo. It is reasonable to want to use threading with seqrepo, so another app that has nothing to do with the rest service could trigger this bug in principle. |
Seems to me that the issue is entirely with seqrepo. See biocommons/biocommons.seqrepo#112 (comment). Can we close this issue and #15? |
See conversation in #15. Closing. |
@reece this issue still arises but I think we've narrowed down the cause. I will reopen another ticket here because I think this still needs some resolution in the rest service itself. Probably along the lines of PR #15, which moves the |
Description
The same file exceptions related to seqrepo thread safety tracked here: clingen-data-model/architecture#548
occur when running
seqrepo-rest-service
. When there are concurrent requests, the global seqrepo object in the flask worker causes exceptions, and once it hits this exception, there will be an exception on all future requests that attempt to read from that file as well.Stopping and starting the
seqrepo-rest-service
process resets it and temporarily resolves the problem. So the issue is with process state, there's no issue with the actual files on the filesystem.The temporary solution to the above github issue that avoids modifying the seqrepo codebase was to add mutual exclusion to the object that contained the SeqRepo object. This enables the application to run concurrent threads except when they are executing code under that object, which is suboptimal because that is a fairly large critical section to make mutually exclusive.
Steps to reproduce
Run the server from one shell:
Send requests from a second shell:
The text was updated successfully, but these errors were encountered: