Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Select JudgeServerLogic #391

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

rrutwik
Copy link

@rrutwik rrutwik commented Nov 17, 2021

Change Select JudgeServerLogic to prevent deadlock.

servers = JudgeServer.objects.select_for_update().filter(is_disabled=False).order_by("task_number")
            servers = [s for s in servers if s.status == "normal"]
            for server in servers:   => this will throw deadlock error, if order is changed due to change in task number by some other thread.

Change Select JudgeServerLogic to prevent deadlock
@Beichi-CHs
Copy link
Contributor

How the deadlock caused?

@rrutwik
Copy link
Author

rrutwik commented Nov 20, 2021

This => JudgeServer.objects.select_for_update().filter(is_disabled=False).order_by("task_number") [for server in servers: ]

this will result in deadlock, if order is changed due to change in task number by some other thread....

@Beichi-CHs
Copy link
Contributor

filter(is_disabled=False, last_heartbeat__gt=health_time).annotate(percent=ExpressionWrapper((1.0000 * F('task_number')) / F('cpu_core'), output_field=FloatField())).order_by("percent")

this will select the judge server which is the fastest and has fewest tasks?

@rrutwik
Copy link
Author

rrutwik commented Nov 20, 2021

yes, least (current total task/core ratio)

@virusdefender
Copy link
Contributor

thank you for your contribution, but i still do not understand the reason of the deadlock, could you give me more information about if, for example: the database deadlock log, the django error log, how to reproduce the bug.

@rrutwik
Copy link
Author

rrutwik commented Nov 21, 2021

There was whole error log in dramatiq logs, but i guess it stores only 10 most recent files.
Here is a log from gunicorn logs, which is because of this.
DETAIL: Process 2904 waits for ExclusiveLock on tuple (2,32) of relation 16635 of database 16384; blocked by process 2834. Process 2834 waits for ShareLock on transaction 26060303; blocked by process 1859. Process 1859 waits for ShareLock on transaction 26060317; blocked by process 2656. Process 2656 waits for AccessExclusiveLock on tuple (2,32) of relation 16635 of database 16384; blocked by process 2904. HINT: See server log for query details. Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 85, in _execute return self.cursor.execute(sql, params) psycopg2.extensions.TransactionRollbackError: deadlock detected

To reproduce this bug, you need to have 3 to 4 (8 core) judge servers processors, and just DDOS it with 30 to 60 submissions per second. You will be able to find this error in dramatiq logs. [2 judge servers would be okay i guess, but 3-4 would be much better as ordering by task_number will throw more error]

This can help:- https://stackoverflow.com/a/42731706

@rrutwik rrutwik closed this Apr 11, 2022
@virusdefender virusdefender reopened this May 9, 2022
@rrutwik
Copy link
Author

rrutwik commented Jul 23, 2022

@virusdefender Hi, any updates on this PR ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants