Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RQ: implement reliable timeout #4305

Closed
wants to merge 9 commits into from
Closed

RQ: implement reliable timeout #4305

wants to merge 9 commits into from

Conversation

rauchy
Copy link
Contributor

@rauchy rauchy commented Nov 13, 2019

RQ's timeout implementation is based on a signal invoked in the "work horse" process, which might be blocked by the executing code. We should implement a more reliable timeout -- probably by handling it in the parent process.

Relevant RQ issues: rq/rq#323, rq/rq#1142

@arikfr
Copy link
Member Author

arikfr commented Oct 28, 2019

RQ's timeout functionality is called in the work horse process:

https://github.com/rq/rq/blob/e43bce4467c3e1800d75d9cedf75ab6e7e01fe8c/rq/worker.py#L815

@rauchy
Copy link
Contributor

rauchy commented Nov 13, 2019

As suggested in rq/rq#323, the most reliable approach here would be to handle this both using a limit inside the work horse (as it is currently implemented) and falling back to a limit in the worker. This is effectively implementing a mechanism that resembles Celery's hard & soft limits.

One problem I see is that if we implement the hard limit using an alarm, we'll have to give up on work horse monitoring by the worker, as it uses an alarm as well, and a process can sign up for one alarm at a time. We can get around this using a threading timer. WDYT?

@arikfr
Copy link
Member Author

arikfr commented Nov 13, 2019

we'll have to give up on work horse monitoring by worker

What do you refer to here?

@rauchy
Copy link
Contributor

rauchy commented Nov 13, 2019

This line will set up an alarm that will trigger a HorseMonitorTimeoutException after (probably) 30s and will help maintain an active heartbeat. Using Unix alarms, we'll have to give up on that functionality. The only way I can see to keep both intact is a threading timer for the job timeout.

@arikfr
Copy link
Member Author

arikfr commented Nov 13, 2019

@rauchy we can have a single alarm signal used for both things, no?

@rauchy
Copy link
Contributor

rauchy commented Nov 13, 2019

You can only schedule a signal alarm. Are you suggesting to catch HorseMonitorTimeoutException and check if job.timeout has already passed? The downside to that is that we are bound to intervals of job_monitoring_interval, so it could be raised anywhere between 0-30s (typically) after the timeout has passed, which might be too soon. Then again, we can simply wait for a delta of 15 seconds beyond the timeout so it'll be killed on the next iteration.

@arikfr
Copy link
Member Author

arikfr commented Nov 13, 2019

Are you suggesting to catch HorseMonitorTimeoutException and check if job.timeout has already passed?

Exactly.

The downside is indeed a real one. We can experiment with lowering the interval for the monitor thing, but need to make sure it has no unwanted side effects.

@rauchy rauchy requested a review from arikfr November 13, 2019 21:32
redash/worker.py Outdated
@@ -93,3 +97,68 @@ def add_periodic_tasks(sender, **kwargs):
for params in extensions.periodic_tasks.values():
# Add it to Celery's periodic task registry, too.
sender.add_periodic_task(**params)


class HardTimeLimitingWorker(Worker):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we move this into its own module in the redash.tasks package? The redash.schedule module belongs there as well.

I would move all the rq stuff from redash.worker into redash.tasks and eventually remove this module when we say goodbye to Celery.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also worth adding some documentation on why we added this class.

Copy link
Member

@arikfr arikfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one small thing.

@@ -3,3 +3,5 @@
refresh_schemas, cleanup_query_results, empty_schedules)
from .alerts import check_alerts_for_query
from .failure_report import send_aggregated_errors
from .hard_limiting_worker import *
from .schedule import *
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: please don't do * imports.

@arikfr
Copy link
Member Author

arikfr commented Nov 17, 2019

One more thing for this PR: we need to make sure that all the tasks have sensible timeout setting.

@rauchy
Copy link
Contributor

rauchy commented Nov 26, 2019

One more thing for this PR: we need to make sure that all the tasks have sensible timeout setting.

All jobs have the default timeout of 3 minutes at the moment, except for sync_user_details which maxes out at 60 seconds. These feel like sensible values for me. WDYT?

Obviously, when we get to query executions, these will be dynamic.

@arikfr
Copy link
Member Author

arikfr commented Dec 5, 2019

Do we need to merge this or #4413 covers this too?

@rauchy
Copy link
Contributor

rauchy commented Dec 10, 2019

I based #4413 on this because of the similarities in the custom worker. Given that #4413 will take some more time for testing, if you think this is ready to ship - let's merge it on its own.

grace_period = 15

def soft_limit_exceeded(self, job):
seconds_under_monitor = (utcnow() - self.monitor_started).seconds

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small consideration. The timedelta docs say that .seconds is:
0 <= seconds < 3600*24 (the number of seconds in one day)
If jobs are supported that run longer than a day, probably should use .total_seconds() instead.

Suggested change
seconds_under_monitor = (utcnow() - self.monitor_started).seconds
seconds_under_monitor = (utcnow() - self.monitor_started).total_seconds()

@arikfr
Copy link
Member Author

arikfr commented Jan 19, 2020

#4413 is merged, so I guess we can close this one?

@arikfr
Copy link
Member Author

arikfr commented Jan 21, 2020

Ping, @rauchy.

@rauchy
Copy link
Contributor

rauchy commented Jan 21, 2020

Yes, this is all included in master by now.

@rauchy rauchy closed this Jan 21, 2020
@jezdez jezdez deleted the hard-time-limit branch January 22, 2020 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants