-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Celery for background tasks? #66
Comments
Tryggvi suggested in one of his projects the benefit in separating the code between front-end and back-end tasks (and Pika better for that than Celery), but I think in CKAN having a separate install process for back-end tasks is going to be more hassle than it's worth. |
Anyone have any feelings about Skytools/PgQ ? |
I don't like the idea of using a database as a queue. I like the idea of a redis backed queue (because we can also use redis for sessions). While we're throwing out suggestions, and not technically a queue, but what about http://gearman.org/? It'd be nice to open up background processing to other languages and you get a choice about where you persist data (memcached, pg etc). |
Does this really count as "using a database as a queue"? It's custom queueing code used by Skype that just happens to be available via SQL commands on a db we already have. |
Perhaps. I'm just nervous about things I've never heard of before, as it often means it isn't very widely used. Maybe I'm just being pessimistic :) It does seem to be reasonably active though - https://github.com/markokr/skytools What's the setup/install like? |
|
Cross-posted from ckan/ckan#1796 I'm in favour of having a mechanism for processing delayed jobs in CKAN core. Celery is the go-to for such a system in a Python application, so unless there are clear and well-argued reasons for doing anything else, let's use that. As for backend, Redis is certainly simpler to deploy and manage than Rabbit, and can be configured to have appropriate persistence properties for a queue (you should use AOF mode when using Redis as a queue). (In a perfect world, I'd also kill ckan-service-provider and datapusher in favour of such a system, but I think that's a different discussion). |
@wardi Redis works mostly in memory, which is more appropriate for adding and removing things from a queue frequently, compared to a more disk intensive relational database. But I imagine you had your reasons for suggesting Skytools, so let's hear them. @nickstenning I'm very happy for encourage in all debates plenty of partially-formed reasons and gut-reactions - let's keep this open. And I think we're settled on Redis - there is no proposal to change back to rabbitmq. Good tip on the AOF - we can add that when developing the docs for background tasks. |
btw what's ckan-service-provider? And what does datapusher use for a queue? |
@davidread celery and redis are new things for me, and I'm an extremely lazy person. skytools is also new for me, but seems less scary because it's based on something I do know. I understand how to scale out wsgi processes, and I can set up replication and fail-over with postgres. solr doesn't seem to have any distributed options, so I just rebuild it if it goes away (but there's no data lost so no big deal). What's the best way to run redis so that we don't lose jobs? |
Datapusher uses its own queue which it stores (by default) in a SQLite database, built on top of
Absolutely, but there's a huge amount of code you'd need to write if you want to use this. As I understand it skytools is a thin Python wrapper over a bunch of PL/PGSQL and C, and exposes a generic consumer/producer queue API. That's a long way from being a complete job runner, which I would expect to provide such features as:
Celery provides all of these and more, whereas skytools provides approximately none (which is fine, as it's not trying to fill the same space -- it's a much lower-level tool). |
distributed operation, timeouts and retries sounds good for the sort of thing datapusher does. Also for what the qa extension does. I was thinking of background tasks like "update the organization information for 10K datasets in a local SOLR core". To me that calls for something simpler. |
It rather depends on what scenario you're imagining. Probably the most common failure mode will be a celeryd crash. To protect against this you need a protocol which supports message acknowledgements, such as AMQP: hence Rabbit. With Redis in AOF mode with Other possible failure modes:
Unfortunately as far as I'm aware there just isn't a decent background job library for Python that works with Postgres yet. (Although with the addition of |
Well, maybe, but one task scheduler is probably simpler than two, and you're certainly not obliged to use all of Celery's features! |
full disclosure of my biases:
So, I probably shouldn't participate in this discussion :-) |
I haven't use celery enough to comment. I only have one point to make. Whatever we pick, let's please consistently use that for doing background tasks across CKAN, which makes it less of a pain. |
@wardi Celery is just python code, so would it need approval from your organization? Redis is pretty mainstream, so getting approval shouldn't be any more tough than other things I imagine. And I guess you could use postgres as a back end for Celery. But it's surely a good reason to avoid chopping and changing in the future. Since we're going with queues in core ckan, then I think we should embrace it for indexing of all packages. This would be better than running a paster command that takes an hour or so before returning when you restore a database. And we could even put a progress bar in the package search UI, for a sysadmin to keep tabs on the indexing and to explain the low package count. It's not very necessary, but would ensure the queue software gets installed correctly and give devs a clear view of it how it works. |
@brew Here's the ticket mentioned at the meeting this morning. As discussed above let's settle on Celery + Redis (non-distributed) as the standard approach for queues in ckan. I'm planning to build in that direction with my docker stuff. |
I know this seems like it has already been decided, but having looked deeper at it, http://python-rq.org looks very interesting. It's easy to install and configure, and seems widely used (and suggested by) Heroku and is actively developed (https://github.com/nvie/rq). |
RQ has a small code-base, which is good, and we don't make use of the Celery features it leaves out: AMQP routing/delivery rules, tasks written in non-python languages. However install, setup of tasks and running tasks seem very similar to Celery (particular the versions newer than the one we're on at DGU), so I can't see much advantage in switching on the face of it. But if you do get a chance to convert archiver across to it and see if it is any simpler in reality, then great! |
I use both rq and celery on a variety of projects. Both have their places, and celery is significantly more featured than rq. For CKANs use case, rq is completely sufficient and easy to integrate. Its code complexity is far below that of celery and debugging it is downright enjoyable compared to celery. Its performance is also excellent (mostly because it tightly binds itself to redis instead of trying to support a wide variety of brokers and result stores). In the cases where you need tens of thousands of workers across thousands of cores, extremely complex routing and highly scaled queues, I would definitely recommend celery. For CKAN, where the general usage will likely be periods of heavy bulk loading followed by periodic bulk updates and individual record updates, I would just go with rq and keep it as simple as possible - probably just two queues (queue-default (for all tasks) and queue-ui (for user-triggered events such as reindexing a single dataset)). |
@TkTech thanks v. much for weighing in on this. It sounds very much like we should give it a shot with rq. |
@TkTech If I want to schedule jobs like I would with cron, how would I do that with rq? I haven't found a nice way to run the cron daemon in the foreground (for use in docker) and I was hoping there would be a solution for periodic jobs in our queue of choice. |
@wardi You would typically do that with cron (in the case of rq) or with beat (in the case of celery). In both cases, a separate process needs to be run to start the jobs (you can technically run beat inside of a worker, but you would never do this except for local development). There is also the rq-scheduler 3rd party project which is stable, popular, and extremely easy to use. For integration, it's easy to embed both rq-schedular and rq workers into a paster command (or some other convenience). For example, here is how I run workers using the same command line as I use for most general tasks, while using the configuration from a flask app: if args['worker']:
with app.app_context():
with Connection(Redis.from_url(app.config['BROKER_URL'])):
qs = [Queue(n) for n in args['--names']] or [Queue()]
w = Worker(qs)
w.work()
|
Good to see rq has a scheduler. It's extremely important to have one for CKAN I think. Cron jobs are too simple, and their configuration lives outside of CKAN source, db, or config, introducing state where state shouldn't be. They have to be set up manually, and they have to be migrated manually as well. Cron really becomes inadequate for anything but the simplest CKAN deployments. It sounds like we could give rq a shot - it seems like a better alternative, but before it's decided we should address:
|
Celery is hardly used at the moment and I could find no extension that depends on it. The change should have no API impact. |
+1 for rq |
It's used in a few extensions that I know of:
We can migrate those easily enough, but celery is also used in at least a few large CKAN projects of orgs and governments that don't always release their code publicly. Don't assume that this change won't impact anybody, that's a terrible way to build a good open source project. |
I agree, perhaps abstracting it out might be best. But if govs are not releasing their code related to ckan, they are breaching the licence :( |
Just for the record (and I know no one said otherwise) Cdn open data does release all of our code to GitHub - unless there is something Ian isn't telling me ;-) |
@rossjones @thriuin that's a good point. We could send out a notice to ckan-dev asking if anybody knows where celery is being used, and to point us to the code. If we don't find (m)any cases, we can move to rq? |
We can always implement rq alongside, move the core extensions and let people know then should move before release 2.x.y if they're depending on celery? |
Came hear to talk about schedulers, I'm glad rq has that covered :) In terms of deprecating celery I don't think that is a major issue. Celery is not even a requirement for CKAN so if somebody or some extension is using it they will be already taking care of installing it. We can announce deprecation (once rq support is implemented and tested!) and keep the celery code on core for a release (what @rossjones said essentially). Maybe write a short guide about how to migrate jobs from celery to rq if necessary. |
Just dropping in to ask if there's a definitive doc for using celery with CKAN? All I've been able to find is cobbled together from readmes of a few extensions. Cheers! |
@CarlQLange: There is a section about background tasks in the CKAN documentation. |
@torfsen Aha! Thank you so much! |
There is now a new PR for this, see ckan/ckan#3165. |
Background jobs are now merged to master thanks to the brilliant job by @torfsen. Check the docs for more details: http://docs.ckan.org/en/latest/maintaining/background-tasks.html |
Wow, that looks fantastic. Great job @torfsen. |
Hi @amercader, Also, the background-tasks doc uses webhooks as the first example of what background tasks are useful for. Is that a "for example" an aspirational or concrete example :) |
@jqnatividad, the old Celery system is deprecated but still available. Hence anything that is working now should continue to work. AFAIK there is currently no time plan for removing the Celery system. |
@jqnatividad, @torfsen wrote a great section on migrating to the new queue framework and on how to support both systems so it should be really easy to update ckanext-webhooks to support it. |
@jqnatividad, the documentation @amercader is talking about is here: Migrating from CKAN’s previous background job system |
CKAN has long had some integration with Celery for performing background tasks. However there have been issues and some people say we should do something else. It would be good to resolve this as we need it for ckan/ckan#1796 among other things.
Celery for:
Celery against:
Alternatives:
Tryggvi suggested these Celery tips: https://denibertovic.com/posts/celery-best-practices/ such as using Flower to monitor Celery nicely.
The text was updated successfully, but these errors were encountered: