Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check locked tables before certain actions #5047

Draft
wants to merge 3 commits into
base: production
Choose a base branch
from
Draft

Conversation

acwhite211
Copy link
Member

Fixes #

This might be a possible solution to the beaty instance problem where locked tables are causing database connection timeouts, which is eventually causing Django to crash. This solution creates a function to check for database table locks. This check can be performed before problematic actions run, such as workbench uploads.

Please comment which actions might be good to include in this PR and which tables should be checked.

Checklist

  • Self-review the PR after opening it to make sure the changes look good
    and self-explanatory (or properly documented)
  • Add automated tests
  • Add relevant issue to release milestone

Testing instructions

Copy link
Collaborator

@realVinayak realVinayak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See commit on comments (didn't know you intended to make a PR, my bad)

4d92d16#r143561584

from django.db import connection
from time import sleep

def check_locked_tables(table_names=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, say you start the workbench after no table is being used, wouldn't it still block the other transactions that want to update some records? Classical example will be workbench matching/uploading collection objects, and some other person then making collection objects. You are guaranteed to get a failure on the other user. It tries a full table lock, and that will never proceed without workbench committing.

This doesn't solve the problem at all. At best, it avoids a deadlock. Figure out why it does. I don't think this solution is good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, what I really want to figure out is how this issue might be causing Django to crash.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be open to adjusting database parameters in a case like this. This seems to be a rare case, so clients running their own databases might still run into this issue, although I don't think there have been any reports of it. Right now, it's more import to make sure the specify cloud instances don't run into this problem. Recreating the problem on the test server might be a good idea, to see if we can crash a Django instance.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember you saying there were 20 connections. I suspect they were all updates, being queued at the database. We also have just a dual core on the canada server. I was going to test this out last night, but I ran out of time. I was going to upload a workbench dataset on the sp7demofish-canada, while making tree merge requests. If it is just a django issue (20 connections overloading), then the size of database wouldn't matter. If the django instance crashes, we'll know that it is because of connection overload.

@mpitblado
Copy link

mpitblado commented Jul 30, 2024

To attempt to replicate this issue, I spun up a test server with the following specifications:

Setup Details

Hardware:

CPU: 1 AMD vCPU (I believe third generation AMD EPYC processor based on DO documentation)
Memory: 2GB
OS: Debian 12 x64
Specify version: v7.9.6.2

User setup:

Alice: Alice is both Sp7 and Sp6 admin.
Bob: Bob is a collection admin

Bug in Production

Description

Uploading a workbench of 5000 Collection Objects causes the system to crash. Crash is defined by the root path "/" not being accessible to any user and unresponsiveness through the user interface for the user uploading through the workbench. The crash happens at an indeterminate time during the upload, however has consistently happened three times, with no crashes when a workbench upload is not occurring. The system recovers within 20-40 minutes.

Details

  • Via checking the Sp7 audit logs, no other agents within the same discipline were active at the time of the most recent crash. One agent within the same division was active. In total, there were two other users actively making changes on the day of the crash, though the exact timing is not available without more specific logs.
  • The workbench contains a field mapped to the Agent table, which is shared across disciplines/divisions. There are values throughout the 5000 records, so the crash did not occur when the workbench got to a row in which this field had a value.
  • The workbench will proceed to a successful upload. After the crash, going back into the collection shows the workbench upload as successful. The UI will present the upload as unsuccessful or in general be indecipherable to the user performing the upload.

Hypothesis

Some sort of database timeout or lock is causing the crash. The exact link between the database lock and Django crashing is uncertain.

Testing on Testing Server

The workbench upload by Alice contains the following fields with arbitrary values (n=3000):

  • catalogNumber
  • Text1-Text8 on CO table
  • Locality Name

Baseline: Alice uploads 3000 CO records.

CPU goes to 70%-80%, CPU load goes to 1. Memory nominal.

Test 1: Alice uploads 3000 CO records. Bob attempts to add 1 CO record after the workbench has locked the record.

Alice uploads CO 1-3000 through the workbench. After the workbench progress bar shows that the workbench has moved past record 100, user Bob attempts to save CO 1. Bob experiences a loading screen for 50 seconds, followed by an error message indicating that a database lock prevented the record from saving. Access to the root path by either user is unaffected. Alice experiences no delays or issues with the workbench upload. Root page does not hang. CPU load spikes above 1 and CPU % goes to 90%

Test 2: Alice uploads 3000 CO records. Bob attempts to add 3 CO records sequentially after the workbench has locked the record.

Same results as above. Bob experiences a loading screen for 50 seconds for the first record, 51 seconds for the second, and 52 seconds for the third. In other words, the loading is not additive, the one second difference is the time it takes me to click "save" between the different tabs.

Test 3: Alice uploads 3000 CO records. Bob attempts to add 9 CO records sequentially after the workbench has locked the record.

Bob experiences a loading screen for about 1min 20s, still 50 seconds for the first record. Requests for the root page hang for both users. Alice's upload popup does not accurately display the progress (gets stuck and time remaining ticks up instead of down). However, the upload ultimately succeeds unknown to the user and is revealed after pressing stop. Bob experiences a 504 Gateway timeout for the last of the 9 records. Load spikes to over 1 and CPU utilization hits 100%.

Test 4: Alice uploads 3000 CO records. Bob attempts to add 1 CO outside of the range already uploaded by Alice's workbench

Alice uploads CO 1-3000 through the workbench. When the progress bar reaches record 100, Bob attempts to upload CO 2000. Bob successfully saves the record. Alice's workbench stops with an error when it reaches row 2000. Therefore, Alice's workbench is not locking the entire CO table at the onset.

Questions Remaining

  • Above I have intentionally caused a lock by uploading a conflicting CO. However, there is another cause of the lock happening in production as there is no user trying to upload CO's to the collection in question. What is that cause?
  • Could the variability in the time that the instance takes to recover (20-40 mins) be explained by the remaining time the workbench needs to upload? Ultimately the workbenches will successfully get there, the UI just doesn't show it and becomes non-interactive. Without errors, these 5000 sets usually take about 40 mins to complete.
  • What is it about 9 conflicts compared to 3 that causes the system to behave differently? Both the root path error and the workbench becoming non-responsive present themselves in the test with 9 records but not with 3.

@realVinayak
Copy link
Collaborator

realVinayak commented Jul 30, 2024

Very thorough testing, and different cases, with decent granularity for someone unaware of code. I like you!

I'm not surprised with most of the results, other than Test4.

Notice how it takes 50 seconds for timeout on CO form. That's just the time database gives up acquiring a lock on (called lock_wait_timeout). You can check it by doing select @@innodb_lock_wait_timeout in your db server. It'd be, very very likely, 50 seconds.

Response for first

Do you mean that there were locking issues in the workbench itself, when no body was active, in the current results? Or, did you mean in the original crashes, there were locking issues? I noticed that there was a tree merge operation ongoing when workbench was uploading/validating (can't remember, I think validating). Tree merge causes an implicit full-table lock so workbench tree inserts will cause lock issue. This is based on me looking around in the logs weeks ago (when this issue was originally brought up)

Was there another instance of locking issue other than that?

Response for second

Possibly. Can be reasonably be understood, if CPU is overloaded. Are you running specify7 and specify7-worker in docker? Is the database also running on the same server? Since you have just one 1vCPU, can you set the specify7-worker, and if database is present, COMBINED CPU limit to be <= 0.8? If database is on a different server, just set worker's CPU limit to 0.8. Then, your CPU load shouldn't be more than 0.8, and you should be able to get to "/" page, hopefully, everytime, when wb is progressing. If you are able to do so, then, we've effectively proven that this is a CPU issue. Mind that this will make workbench a bit slower (less CPU).

Response for third

For variability of time at three vs 9 records, I suspect that's just because of HTTP/1.1 limitation which prevents more than 6 concurrent web-requests. If you try 5, theoretically you'd not have this lagging. Or, use a different browser (I'd wager that you were using the same browser). If you are curious, open the network tab in debug console, and you'd see requests being stalled (not waiting for a response) by the browser.

Regarding Test4, what's the error that you get? Yes, workbench doesn't "lock" records anywhere, at least not at a code level (ie, we don't do a "select for update" anywhere in workbench). BUT, just because we don't do, doesn't mean database doesn't. Anyways, I don't think you'd have gotten a lock error, it was probably a uniqueness rule error (depending on if you are hardcoding catalognumbers or using the wildcard). I'll need to debug this particular case myself. There really (or shouldn't be, yet) a conflicting CO -- every CO creation acquires a full-table lock, in all cases, so everything should be conflicting, which is why this is surprising.

Can you send all the crash reports? Thanks for doing these thorough testing!

@mpitblado
Copy link

Checking for tree merges during prod incidents

I ran the following query in spauditlog to check for tree operations during the incident windows. The duration represents the time that the root path is inaccessible.

Prod incident 1: June 18 approx 11:13 PDT, duration approx 19 minutes
Prod incident 2: June 24 approx 13:42 PDT, duration approx 23 minutes
Prod incident 3: July 19 approx 12:49 PDT, duration approx 43 minutes

Action = "Tree Move" OR "Tree Merge" OR "Tree Synonymize"
Timestamp created (year) = 2024

Records are returned, and I sorted them by date so I could scroll through them. I don't see any tree operations occurring on any of those three dates. Not disputing the fact, just if there were tree operations I don't have visibility for them through the tool.

During these incidents, other users were unable to access the instance through any path. I am person uploading through the workbench, and it experiences the same symptoms displayed in the test cases, the time remaining ticks up and the UI displays no progress. My hunch is that one other user experiences a lock (the cause) and then all other users experience an unresponsive server (the effect). The workbench becomes usable again at the same time as access to the root path is restored for all users (the workbench user is also unable to go to / during this time)

Confirming database lockout time

lock_wait_timeout is set to 86400 in the database, however innodb_lock_wait_timeout is indeed set to 50.

Constraining worker and db containers to under 0.8 CPU collectively

I set the worker limit as 0.4 and the db container limit as 0.4. I confirmed that overall the limit functionality within docker compose was successful by setting a memory limit and then running docker stats --no-stream. Seeing that the memory limit had been successfully applied, I removed the memory limit and kept the CPU limits in place.

With the limits in place, I redid the test with Bob attempting to save 9 CO records sequentially. This time, Bob experiences a 504 Gateway timeout starting on the 4th save attempt (4/9), with all of the remaining save attempts also reporting a 504 Gateway timeout. The first 3 save attempts will display the OperationalError "Lock wait timeout exceeded" message previously observed. Critically, both Bob and Alice are still unable to access the / page during the lockout. A 504 Gateway error is returned eventually for these hung root pages for both users. Bob and Alice operate in different browsers, with Bob using Safari and Alice using Firefox.

I replicated this test 3 times to reduce the likelihood of flukes and to get consistent results. The 4th save attempt is always the one to report 504 on Bob's side. Alice's workbench UI was unable to recover in tests 1 and 3, however in test 2 it unexpectedly resumed as normal, with the expected time decreasing and the progress bar operational.

Based on Digital Ocean's metrics, server load no longer exceeded 1 in any of the 3 replicates.

Clarifying workbench error in test 4

Apologies, yes it is a uniqueness rule error, not a table lock error. This (to me) is the expected behaviour. Alice's workbench is unable to upload a record that Bob has already saved.

Crash and Error Logs

  • Prod crash report June 24 2024, Paleo collection. Sent to support email (There's a lot of content in the full crash report and am unsure if any of it is sensitive).
  • timeout-error-bob.txt

@realVinayak
Copy link
Collaborator

My hunch is that one other user experiences a lock (the cause) and then all other users experience an unresponsive server (the effect)

You're on point for that! However, it's quite special circumstances. I was able to recreate 504 issues, when uploading CO records, and saving a new CO with autonumbering. The issue here is that CO save causes a write lock request on "Collection" table, which blocks any subsequent read requests (every request to "/" needs collection query). However, the write isn't granted, yet (because of workbench, it'll be allowed only when workbench finishes its transaction). That's why any web-request gets blocked.

Setting lock_wait_timeout to a low value resolves the issue (temporarily)

set global lock_wait_timeout = 10

This causes the CO save to timeout early (allowing other web-requests to proceed). This is definitely a temporary solution, though. Getting rid of full-table locking (or at least reads instead of writes) would be more ideal

@mpitblado
Copy link

I am okay with the lock_wait_timeout change being applied to production. I think that running that command requires elevated privileges so would have to be on your end.

@realVinayak
Copy link
Collaborator

hmm maybe, and likely. We should be able to utilize init_command to make it a property for just your database (although, canada serveer is small, so playing around wouldn't hurt)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Dev Attention Needed
Development

Successfully merging this pull request may close these issues.

5 participants