-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check locked tables before certain actions #5047
base: production
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See commit on comments (didn't know you intended to make a PR, my bad)
from django.db import connection | ||
from time import sleep | ||
|
||
def check_locked_tables(table_names=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, say you start the workbench after no table is being used, wouldn't it still block the other transactions that want to update some records? Classical example will be workbench matching/uploading collection objects, and some other person then making collection objects. You are guaranteed to get a failure on the other user. It tries a full table lock, and that will never proceed without workbench committing.
This doesn't solve the problem at all. At best, it avoids a deadlock. Figure out why it does. I don't think this solution is good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, what I really want to figure out is how this issue might be causing Django to crash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be open to adjusting database parameters in a case like this. This seems to be a rare case, so clients running their own databases might still run into this issue, although I don't think there have been any reports of it. Right now, it's more import to make sure the specify cloud instances don't run into this problem. Recreating the problem on the test server might be a good idea, to see if we can crash a Django instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember you saying there were 20 connections. I suspect they were all updates, being queued at the database. We also have just a dual core on the canada server. I was going to test this out last night, but I ran out of time. I was going to upload a workbench dataset on the sp7demofish-canada, while making tree merge requests. If it is just a django issue (20 connections overloading), then the size of database wouldn't matter. If the django instance crashes, we'll know that it is because of connection overload.
To attempt to replicate this issue, I spun up a test server with the following specifications: Setup DetailsHardware:CPU: 1 AMD vCPU (I believe third generation AMD EPYC processor based on DO documentation) User setup:Alice: Alice is both Sp7 and Sp6 admin. Bug in ProductionDescriptionUploading a workbench of 5000 Collection Objects causes the system to crash. Crash is defined by the root path "/" not being accessible to any user and unresponsiveness through the user interface for the user uploading through the workbench. The crash happens at an indeterminate time during the upload, however has consistently happened three times, with no crashes when a workbench upload is not occurring. The system recovers within 20-40 minutes. Details
HypothesisSome sort of database timeout or lock is causing the crash. The exact link between the database lock and Django crashing is uncertain. Testing on Testing ServerThe workbench upload by Alice contains the following fields with arbitrary values (n=3000):
Baseline: Alice uploads 3000 CO records.CPU goes to 70%-80%, CPU load goes to 1. Memory nominal. Test 1: Alice uploads 3000 CO records. Bob attempts to add 1 CO record after the workbench has locked the record.Alice uploads CO 1-3000 through the workbench. After the workbench progress bar shows that the workbench has moved past record 100, user Bob attempts to save CO 1. Bob experiences a loading screen for 50 seconds, followed by an error message indicating that a database lock prevented the record from saving. Access to the root path by either user is unaffected. Alice experiences no delays or issues with the workbench upload. Root page does not hang. CPU load spikes above 1 and CPU % goes to 90% Test 2: Alice uploads 3000 CO records. Bob attempts to add 3 CO records sequentially after the workbench has locked the record.Same results as above. Bob experiences a loading screen for 50 seconds for the first record, 51 seconds for the second, and 52 seconds for the third. In other words, the loading is not additive, the one second difference is the time it takes me to click "save" between the different tabs. Test 3: Alice uploads 3000 CO records. Bob attempts to add 9 CO records sequentially after the workbench has locked the record.Bob experiences a loading screen for about 1min 20s, still 50 seconds for the first record. Requests for the root page hang for both users. Alice's upload popup does not accurately display the progress (gets stuck and time remaining ticks up instead of down). However, the upload ultimately succeeds unknown to the user and is revealed after pressing stop. Bob experiences a 504 Gateway timeout for the last of the 9 records. Load spikes to over 1 and CPU utilization hits 100%. Test 4: Alice uploads 3000 CO records. Bob attempts to add 1 CO outside of the range already uploaded by Alice's workbenchAlice uploads CO 1-3000 through the workbench. When the progress bar reaches record 100, Bob attempts to upload CO 2000. Bob successfully saves the record. Alice's workbench stops with an error when it reaches row 2000. Therefore, Alice's workbench is not locking the entire CO table at the onset. Questions Remaining
|
Very thorough testing, and different cases, with decent granularity for someone unaware of code. I like you! I'm not surprised with most of the results, other than Test4. Notice how it takes 50 seconds for timeout on CO form. That's just the time database gives up acquiring a lock on (called Response for first Do you mean that there were locking issues in the workbench itself, when no body was active, in the current results? Or, did you mean in the original crashes, there were locking issues? I noticed that there was a tree merge operation ongoing when workbench was uploading/validating (can't remember, I think validating). Tree merge causes an implicit full-table lock so workbench tree inserts will cause lock issue. This is based on me looking around in the logs weeks ago (when this issue was originally brought up) Was there another instance of locking issue other than that? Response for second Possibly. Can be reasonably be understood, if CPU is overloaded. Are you running specify7 and specify7-worker in docker? Is the database also running on the same server? Since you have just one 1vCPU, can you set the specify7-worker, and if database is present, COMBINED CPU limit to be <= 0.8? If database is on a different server, just set worker's CPU limit to 0.8. Then, your CPU load shouldn't be more than 0.8, and you should be able to get to "/" page, hopefully, everytime, when wb is progressing. If you are able to do so, then, we've effectively proven that this is a CPU issue. Mind that this will make workbench a bit slower (less CPU). Response for third For variability of time at three vs 9 records, I suspect that's just because of HTTP/1.1 limitation which prevents more than 6 concurrent web-requests. If you try 5, theoretically you'd not have this lagging. Or, use a different browser (I'd wager that you were using the same browser). If you are curious, open the network tab in debug console, and you'd see requests being stalled (not waiting for a response) by the browser. Regarding Test4, what's the error that you get? Yes, workbench doesn't "lock" records anywhere, at least not at a code level (ie, we don't do a "select for update" anywhere in workbench). BUT, just because we don't do, doesn't mean database doesn't. Anyways, I don't think you'd have gotten a lock error, it was probably a uniqueness rule error (depending on if you are hardcoding catalognumbers or using the wildcard). I'll need to debug this particular case myself. There really (or shouldn't be, yet) a conflicting CO -- every CO creation acquires a full-table lock, in all cases, so everything should be conflicting, which is why this is surprising. Can you send all the crash reports? Thanks for doing these thorough testing! |
Checking for tree merges during prod incidentsI ran the following query in spauditlog to check for tree operations during the incident windows. The duration represents the time that the root path is inaccessible.
Action = "Tree Move" OR "Tree Merge" OR "Tree Synonymize" Records are returned, and I sorted them by date so I could scroll through them. I don't see any tree operations occurring on any of those three dates. Not disputing the fact, just if there were tree operations I don't have visibility for them through the tool. During these incidents, other users were unable to access the instance through any path. I am person uploading through the workbench, and it experiences the same symptoms displayed in the test cases, the time remaining ticks up and the UI displays no progress. My hunch is that one other user experiences a lock (the cause) and then all other users experience an unresponsive server (the effect). The workbench becomes usable again at the same time as access to the root path is restored for all users (the workbench user is also unable to go to Confirming database lockout time
Constraining worker and db containers to under 0.8 CPU collectivelyI set the worker limit as 0.4 and the db container limit as 0.4. I confirmed that overall the limit functionality within docker compose was successful by setting a memory limit and then running With the limits in place, I redid the test with Bob attempting to save 9 CO records sequentially. This time, Bob experiences a 504 Gateway timeout starting on the 4th save attempt (4/9), with all of the remaining save attempts also reporting a 504 Gateway timeout. The first 3 save attempts will display the I replicated this test 3 times to reduce the likelihood of flukes and to get consistent results. The 4th save attempt is always the one to report 504 on Bob's side. Alice's workbench UI was unable to recover in tests 1 and 3, however in test 2 it unexpectedly resumed as normal, with the expected time decreasing and the progress bar operational. Based on Digital Ocean's metrics, server load no longer exceeded 1 in any of the 3 replicates. Clarifying workbench error in test 4Apologies, yes it is a uniqueness rule error, not a table lock error. This (to me) is the expected behaviour. Alice's workbench is unable to upload a record that Bob has already saved. Crash and Error Logs
|
You're on point for that! However, it's quite special circumstances. I was able to recreate 504 issues, when uploading CO records, and saving a new CO with autonumbering. The issue here is that CO save causes a write lock request on "Collection" table, which blocks any subsequent read requests (every request to "/" needs collection query). However, the write isn't granted, yet (because of workbench, it'll be allowed only when workbench finishes its transaction). That's why any web-request gets blocked. Setting lock_wait_timeout to a low value resolves the issue (temporarily)
This causes the CO save to timeout early (allowing other web-requests to proceed). This is definitely a temporary solution, though. Getting rid of full-table locking (or at least reads instead of writes) would be more ideal |
I am okay with the |
hmm maybe, and likely. We should be able to utilize |
Fixes #
This might be a possible solution to the beaty instance problem where locked tables are causing database connection timeouts, which is eventually causing Django to crash. This solution creates a function to check for database table locks. This check can be performed before problematic actions run, such as workbench uploads.
Please comment which actions might be good to include in this PR and which tables should be checked.
Checklist
and self-explanatory (or properly documented)
Testing instructions