Celery for background tasks? #66

davidread · 2014-07-01T15:19:05Z

CKAN has long had some integration with Celery for performing background tasks. However there have been issues and some people say we should do something else. It would be good to resolve this as we need it for ckan/ckan#1796 among other things.

Celery for:

it is the most used queue for python. Is default for Django.
it has lots of features you might want, like robustness, retries.

Celery against:

past problems with using RabbitMQ as a backend. @kindly and @davidread have seen it occasionally just lock up and not run new tasks on the queue until it is restarted. This appeared to be with the version of RabbitMQ (2.7.1 - that comes with Ubuntu 12.04) and may be fixed in more recent version. In the meantime, kindly switched data.gov to harvest using the Redis backend and has not looked back. davidread wants to switch DGU.
seems a bit difficult to see what's on the queue, what's going on, control. Maybe we just need to read the docs and learn it properly, although basically it is quite complicated.

Alternatives:

Pika - amqp library written in python, which might make it easier to use
RQ (python-rq) - a thin wrapper on Redis to make it work as a queue

Tryggvi suggested these Celery tips: https://denibertovic.com/posts/celery-best-practices/ such as using Flower to monitor Celery nicely.

davidread · 2014-07-01T15:21:48Z

Tryggvi suggested in one of his projects the benefit in separating the code between front-end and back-end tasks (and Pika better for that than Celery), but I think in CKAN having a separate install process for back-end tasks is going to be more hassle than it's worth.

wardi · 2014-07-02T15:28:01Z

Anyone have any feelings about Skytools/PgQ ?

rossjones · 2014-07-02T15:48:30Z

I don't like the idea of using a database as a queue. I like the idea of a redis backed queue (because we can also use redis for sessions).

While we're throwing out suggestions, and not technically a queue, but what about http://gearman.org/? It'd be nice to open up background processing to other languages and you get a choice about where you persist data (memcached, pg etc).

wardi · 2014-07-02T15:55:52Z

Does this really count as "using a database as a queue"? It's custom queueing code used by Skype that just happens to be available via SQL commands on a db we already have.

rossjones · 2014-07-02T15:59:39Z

Perhaps. I'm just nervous about things I've never heard of before, as it often means it isn't very widely used. Maybe I'm just being pessimistic :) It does seem to be reasonably active though - https://github.com/markokr/skytools

What's the setup/install like?

wardi · 2014-07-02T16:12:21Z

apt-get install skytools seems to be an option for installation

nickstenning · 2014-07-03T10:27:04Z

Cross-posted from ckan/ckan#1796

I'm in favour of having a mechanism for processing delayed jobs in CKAN core. Celery is the go-to for such a system in a Python application, so unless there are clear and well-argued reasons for doing anything else, let's use that.

As for backend, Redis is certainly simpler to deploy and manage than Rabbit, and can be configured to have appropriate persistence properties for a queue (you should use AOF mode when using Redis as a queue).

(In a perfect world, I'd also kill ckan-service-provider and datapusher in favour of such a system, but I think that's a different discussion).

davidread · 2014-07-03T15:25:39Z

@wardi Redis works mostly in memory, which is more appropriate for adding and removing things from a queue frequently, compared to a more disk intensive relational database. But I imagine you had your reasons for suggesting Skytools, so let's hear them.

@nickstenning I'm very happy for encourage in all debates plenty of partially-formed reasons and gut-reactions - let's keep this open. And I think we're settled on Redis - there is no proposal to change back to rabbitmq. Good tip on the AOF - we can add that when developing the docs for background tasks.

davidread · 2014-07-03T15:26:25Z

btw what's ckan-service-provider? And what does datapusher use for a queue?

wardi · 2014-07-03T15:54:01Z

@davidread celery and redis are new things for me, and I'm an extremely lazy person.

skytools is also new for me, but seems less scary because it's based on something I do know.

I understand how to scale out wsgi processes, and I can set up replication and fail-over with postgres. solr doesn't seem to have any distributed options, so I just rebuild it if it goes away (but there's no data lost so no big deal). What's the best way to run redis so that we don't lose jobs?

nickstenning · 2014-07-03T17:44:29Z

btw what's ckan-service-provider? And what does datapusher use for a queue?

Datapusher uses its own queue which it stores (by default) in a SQLite database, built on top of APScheduler. I'm sure it's fine as far as it goes, but it smells to me strongly of NIH and could be easily replaced with a short Celery task.

skytools is also new for me, but seems less scary because it's based on something I do know.

Absolutely, but there's a huge amount of code you'd need to write if you want to use this. As I understand it skytools is a thin Python wrapper over a bunch of PL/PGSQL and C, and exposes a generic consumer/producer queue API. That's a long way from being a complete job runner, which I would expect to provide such features as:

log collection and archival
job timeouts
retries
periodic and repeat scheduling

Celery provides all of these and more, whereas skytools provides approximately none (which is fine, as it's not trying to fill the same space -- it's a much lower-level tool).

wardi · 2014-07-03T17:54:34Z

distributed operation, timeouts and retries sounds good for the sort of thing datapusher does. Also for what the qa extension does.

I was thinking of background tasks like "update the organization information for 10K datasets in a local SOLR core". To me that calls for something simpler.

nickstenning · 2014-07-03T17:57:41Z

What's the best way to run redis so that we don't lose jobs?

It rather depends on what scenario you're imagining. Probably the most common failure mode will be a celeryd crash. To protect against this you need a protocol which supports message acknowledgements, such as AMQP: hence Rabbit. With Redis in AOF mode with CELERYD_PREFETCH_MULTIPLIER=1 then a celeryd crash will lose at most N jobs, where N is the number of celery workers. (As I understand it -- although it's quite possible that there are worse scenarios.)

Other possible failure modes:

Redis craps all over its own database: no idea how likely this is, I've never seen it happen
Hardware failure underneath Redis: you'll need to look into a distributed Redis setup -- see the sentinel documentation but there are all kinds of exciting crevices to fall into here, and I would freely accept that I'd rather use Postgres in this scenario.

Unfortunately as far as I'm aware there just isn't a decent background job library for Python that works with Postgres yet. (Although with the addition of NOTIFY and LISTEN in 9.3 there's no particular reason you couldn't implement a passable queue on top of it.)

nickstenning · 2014-07-03T17:59:11Z

I was thinking of background tasks like "update the organization information for 10K datasets in a local SOLR core". To me that calls for something simpler.

Well, maybe, but one task scheduler is probably simpler than two, and you're certainly not obliged to use all of Celery's features!

wardi · 2014-07-03T18:32:38Z

full disclosure of my biases:

For data.gc.ca we don't/won't likely use datapusher or the qa extension so don't need celery's advanced features
Getting new software/services approved for use on our servers is like pulling teeth, only much slower

So, I probably shouldn't participate in this discussion :-)

nigelbabu · 2014-07-04T05:28:38Z

I haven't use celery enough to comment. I only have one point to make. Whatever we pick, let's please consistently use that for doing background tasks across CKAN, which makes it less of a pain.

davidread · 2014-07-04T10:34:02Z

@wardi Celery is just python code, so would it need approval from your organization? Redis is pretty mainstream, so getting approval shouldn't be any more tough than other things I imagine. And I guess you could use postgres as a back end for Celery. But it's surely a good reason to avoid chopping and changing in the future.

Since we're going with queues in core ckan, then I think we should embrace it for indexing of all packages. This would be better than running a paster command that takes an hour or so before returning when you restore a database. And we could even put a progress bar in the package search UI, for a sysadmin to keep tabs on the indexing and to explain the low package count. It's not very necessary, but would ensure the queue software gets installed correctly and give devs a clear view of it how it works.

wardi · 2015-04-28T17:18:02Z

@brew Here's the ticket mentioned at the meeting this morning. As discussed above let's settle on Celery + Redis (non-distributed) as the standard approach for queues in ckan. I'm planning to build in that direction with my docker stuff.

rossjones · 2015-10-06T08:06:52Z

I know this seems like it has already been decided, but having looked deeper at it, http://python-rq.org looks very interesting. It's easy to install and configure, and seems widely used (and suggested by) Heroku and is actively developed (https://github.com/nvie/rq).

davidread · 2015-10-07T16:15:36Z

RQ has a small code-base, which is good, and we don't make use of the Celery features it leaves out: AMQP routing/delivery rules, tasks written in non-python languages. However install, setup of tasks and running tasks seem very similar to Celery (particular the versions newer than the one we're on at DGU), so I can't see much advantage in switching on the face of it. But if you do get a chance to convert archiver across to it and see if it is any simpler in reality, then great!

TkTech · 2015-10-07T16:45:26Z

I use both rq and celery on a variety of projects. Both have their places, and celery is significantly more featured than rq.

For CKANs use case, rq is completely sufficient and easy to integrate. Its code complexity is far below that of celery and debugging it is downright enjoyable compared to celery. Its performance is also excellent (mostly because it tightly binds itself to redis instead of trying to support a wide variety of brokers and result stores).

In the cases where you need tens of thousands of workers across thousands of cores, extremely complex routing and highly scaled queues, I would definitely recommend celery.

For CKAN, where the general usage will likely be periods of heavy bulk loading followed by periodic bulk updates and individual record updates, I would just go with rq and keep it as simple as possible - probably just two queues (queue-default (for all tasks) and queue-ui (for user-triggered events such as reindexing a single dataset)).

davidread · 2015-10-07T16:57:23Z

@TkTech thanks v. much for weighing in on this. It sounds very much like we should give it a shot with rq.

wardi · 2015-10-07T17:04:20Z

@TkTech If I want to schedule jobs like I would with cron, how would I do that with rq? I haven't found a nice way to run the cron daemon in the foreground (for use in docker) and I was hoping there would be a solution for periodic jobs in our queue of choice.

TkTech · 2015-10-07T17:11:09Z

@wardi You would typically do that with cron (in the case of rq) or with beat (in the case of celery). In both cases, a separate process needs to be run to start the jobs (you can technically run beat inside of a worker, but you would never do this except for local development).

There is also the rq-scheduler 3rd party project which is stable, popular, and extremely easy to use.

For integration, it's easy to embed both rq-schedular and rq workers into a paster command (or some other convenience). For example, here is how I run workers using the same command line as I use for most general tasks, while using the configuration from a flask app:

    if args['worker']:
        with app.app_context():
            with Connection(Redis.from_url(app.config['BROKER_URL'])):
                qs = [Queue(n) for n in args['--names']] or [Queue()]
                w = Worker(qs)
                w.work()

synd-cli worker --names=queue-default --names=queue-ui

deniszgonjanin · 2015-10-07T20:34:22Z

Good to see rq has a scheduler. It's extremely important to have one for CKAN I think. Cron jobs are too simple, and their configuration lives outside of CKAN source, db, or config, introducing state where state shouldn't be. They have to be set up manually, and they have to be migrated manually as well. Cron really becomes inadequate for anything but the simplest CKAN deployments.

It sounds like we could give rq a shot - it seems like a better alternative, but before it's decided we should address:

Celery works well enough right now w/ redis as a back-end. Is rq better by enough of a margin to justify the work involved in switching?
Assuming we don't want to keep celery support as a legacy, switching to rq is a non-backwards compatible change. Extensions will need to be updated. As with the previous point, is this worth the hassle?

TkTech · 2015-10-07T20:44:35Z

Celery is hardly used at the moment and I could find no extension that depends on it. The change should have no API impact.

thriuin · 2015-10-07T20:52:56Z

+1 for rq

deniszgonjanin · 2015-10-07T21:00:37Z

It's used in a few extensions that I know of:

We can migrate those easily enough, but celery is also used in at least a few large CKAN projects of orgs and governments that don't always release their code publicly. Don't assume that this change won't impact anybody, that's a terrible way to build a good open source project.

rossjones · 2015-10-07T21:04:04Z

I agree, perhaps abstracting it out might be best.

But if govs are not releasing their code related to ckan, they are breaching the licence :(

thriuin · 2015-10-07T21:21:45Z

Just for the record (and I know no one said otherwise) Cdn open data does release all of our code to GitHub - unless there is something Ian isn't telling me ;-)

deniszgonjanin · 2015-10-07T21:26:35Z

@rossjones @thriuin that's a good point. We could send out a notice to ckan-dev asking if anybody knows where celery is being used, and to point us to the code. If we don't find (m)any cases, we can move to rq?

rossjones · 2015-10-07T21:28:40Z

We can always implement rq alongside, move the core extensions and let people know then should move before release 2.x.y if they're depending on celery?

amercader · 2015-10-08T12:37:56Z

Came hear to talk about schedulers, I'm glad rq has that covered :)

In terms of deprecating celery I don't think that is a major issue. Celery is not even a requirement for CKAN so if somebody or some extension is using it they will be already taking care of installing it. We can announce deprecation (once rq support is implemented and tested!) and keep the celery code on core for a release (what @rossjones said essentially). Maybe write a short guide about how to migrate jobs from celery to rq if necessary.

CarlQLange · 2016-04-19T08:30:44Z

Just dropping in to ask if there's a definitive doc for using celery with CKAN? All I've been able to find is cobbled together from readmes of a few extensions. Cheers!

torfsen · 2016-04-19T09:16:08Z

@CarlQLange: There is a section about background tasks in the CKAN documentation.

CarlQLange · 2016-04-19T09:18:38Z

@torfsen Aha! Thank you so much!

rossjones · 2016-04-22T12:17:00Z

This idea now has a

torfsen · 2016-07-14T14:14:19Z

There is now a new PR for this, see ckan/ckan#3165.

amercader · 2016-10-27T10:47:09Z

Background jobs are now merged to master thanks to the brilliant job by @torfsen. Check the docs for more details:

http://docs.ckan.org/en/latest/maintaining/background-tasks.html

CarlQLange · 2016-10-27T11:58:02Z

Wow, that looks fantastic. Great job @torfsen.

jqnatividad · 2017-02-15T17:15:38Z

Hi @amercader,
I know this issue has been closed, but just wondering if https://github.com/datacats/ckanext-webhooks is still usable given that it uses celery?

Also, the background-tasks doc uses webhooks as the first example of what background tasks are useful for. Is that a "for example" an aspirational or concrete example :)

torfsen · 2017-02-16T11:15:37Z

@jqnatividad, the old Celery system is deprecated but still available. Hence anything that is working now should continue to work. AFAIK there is currently no time plan for removing the Celery system.

amercader · 2017-02-16T12:48:51Z

@jqnatividad, @torfsen wrote a great section on migrating to the new queue framework and on how to support both systems so it should be really easy to update ckanext-webhooks to support it.

torfsen · 2017-02-16T12:59:21Z

@jqnatividad, the documentation @amercader is talking about is here: Migrating from CKAN’s previous background job system

davidread mentioned this issue Jul 1, 2014

Changing group name can time out ckan/ckan#1796

Closed

nigelbabu mentioned this issue Jul 4, 2014

DataStore to CSV service, for download of large resources. #34

Open

amercader mentioned this issue Mar 24, 2015

search index not updated with new Organization information ckan/ckan#2360

Closed

amercader mentioned this issue Oct 2, 2015

Updating organisation does not update organisation in its packages ckan/ckan#2659

Closed

rossjones mentioned this issue Oct 23, 2015

[WIP] Initial implementation of providing async calls via RQ ckan/ckan#2706

Closed

rossjones mentioned this issue Apr 22, 2016

Provide support for background tasks ckan/ckan#2977

Closed

rossjones added the Bounty label Apr 22, 2016

torfsen mentioned this issue Jul 28, 2016

High-level background tasks #180

Open

amercader closed this as completed Oct 27, 2016

Celery for background tasks? #66

Celery for background tasks? #66

Comments

davidread commented Jul 1, 2014

davidread commented Jul 1, 2014

wardi commented Jul 2, 2014

rossjones commented Jul 2, 2014

wardi commented Jul 2, 2014

rossjones commented Jul 2, 2014

wardi commented Jul 2, 2014

nickstenning commented Jul 3, 2014

davidread commented Jul 3, 2014

davidread commented Jul 3, 2014

wardi commented Jul 3, 2014

nickstenning commented Jul 3, 2014

wardi commented Jul 3, 2014

nickstenning commented Jul 3, 2014

nickstenning commented Jul 3, 2014

wardi commented Jul 3, 2014

nigelbabu commented Jul 4, 2014

davidread commented Jul 4, 2014

wardi commented Apr 28, 2015

rossjones commented Oct 6, 2015

davidread commented Oct 7, 2015

TkTech commented Oct 7, 2015

davidread commented Oct 7, 2015

wardi commented Oct 7, 2015

TkTech commented Oct 7, 2015

deniszgonjanin commented Oct 7, 2015

TkTech commented Oct 7, 2015

thriuin commented Oct 7, 2015

deniszgonjanin commented Oct 7, 2015

rossjones commented Oct 7, 2015

thriuin commented Oct 7, 2015

deniszgonjanin commented Oct 7, 2015

rossjones commented Oct 7, 2015

amercader commented Oct 8, 2015

CarlQLange commented Apr 19, 2016 • edited Loading

torfsen commented Apr 19, 2016

CarlQLange commented Apr 19, 2016

rossjones commented Apr 22, 2016

torfsen commented Jul 14, 2016

amercader commented Oct 27, 2016

CarlQLange commented Oct 27, 2016

jqnatividad commented Feb 15, 2017

torfsen commented Feb 16, 2017

amercader commented Feb 16, 2017

torfsen commented Feb 16, 2017

CarlQLange commented Apr 19, 2016 •

edited

Loading