-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RQ: implement reliable timeout #4305
Conversation
RQ's timeout functionality is called in the work horse process: https://github.com/rq/rq/blob/e43bce4467c3e1800d75d9cedf75ab6e7e01fe8c/rq/worker.py#L815 |
As suggested in rq/rq#323, the most reliable approach here would be to handle this both using a limit inside the work horse (as it is currently implemented) and falling back to a limit in the worker. This is effectively implementing a mechanism that resembles Celery's hard & soft limits. One problem I see is that if we implement the hard limit using an alarm, we'll have to give up on work horse monitoring by the worker, as it uses an alarm as well, and a process can sign up for one alarm at a time. We can get around this using a threading timer. WDYT? |
What do you refer to here? |
This line will set up an alarm that will trigger a |
@rauchy we can have a single alarm signal used for both things, no? |
You can only schedule a signal alarm. Are you suggesting to catch |
Exactly. The downside is indeed a real one. We can experiment with lowering the interval for the monitor thing, but need to make sure it has no unwanted side effects. |
redash/worker.py
Outdated
@@ -93,3 +97,68 @@ def add_periodic_tasks(sender, **kwargs): | |||
for params in extensions.periodic_tasks.values(): | |||
# Add it to Celery's periodic task registry, too. | |||
sender.add_periodic_task(**params) | |||
|
|||
|
|||
class HardTimeLimitingWorker(Worker): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we move this into its own module in the redash.tasks
package? The redash.schedule
module belongs there as well.
I would move all the rq stuff from redash.worker
into redash.tasks
and eventually remove this module when we say goodbye to Celery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also worth adding some documentation on why we added this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one small thing.
redash/tasks/__init__.py
Outdated
@@ -3,3 +3,5 @@ | |||
refresh_schemas, cleanup_query_results, empty_schedules) | |||
from .alerts import check_alerts_for_query | |||
from .failure_report import send_aggregated_errors | |||
from .hard_limiting_worker import * | |||
from .schedule import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: please don't do *
imports.
One more thing for this PR: we need to make sure that all the tasks have sensible timeout setting. |
All jobs have the default timeout of 3 minutes at the moment, except for Obviously, when we get to query executions, these will be dynamic. |
Do we need to merge this or #4413 covers this too? |
grace_period = 15 | ||
|
||
def soft_limit_exceeded(self, job): | ||
seconds_under_monitor = (utcnow() - self.monitor_started).seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small consideration. The timedelta docs say that .seconds is:
0 <= seconds < 3600*24 (the number of seconds in one day)
If jobs are supported that run longer than a day, probably should use .total_seconds() instead.
seconds_under_monitor = (utcnow() - self.monitor_started).seconds | |
seconds_under_monitor = (utcnow() - self.monitor_started).total_seconds() |
#4413 is merged, so I guess we can close this one? |
Ping, @rauchy. |
Yes, this is all included in |
RQ's timeout implementation is based on a signal invoked in the "work horse" process, which might be blocked by the executing code. We should implement a more reliable timeout -- probably by handling it in the parent process.
Relevant RQ issues: rq/rq#323, rq/rq#1142