-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Supervisor removal #790
Supervisor removal #790
Conversation
For the test that fails: please run aiida/docs/update_req_for_rtd.py locally and commit the changes it makes. It's possible that the requirements for RTD are not needed to be maintained anymore, but for now we're updating them "by hand" with the script. Thanks! |
55dbcc8
to
295c15c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @dev-zero ,
a few small requests:
- could you rebase/merge/solve the conflicts?
- at the end of the tests (in .travis-data/test_daemon.py or something like this) at the end the daemon log file is printed. Now this is not printed anymore because the filename changed. Can you adapt the test script?
- in the travis output, I see there are three processes. This means that there are three independent processes or workers for the daemon that act in parallel? Unfortunately, if this is the case at the moment we must limit to only 1 worker... There are some parts of the code (I think mainly the old workflows at this stage) that would be run multiple times if multiple workers start working with them...
- Apart for this, I understand correctly that this is ready to be merged?
295c15c
to
3c6690d
Compare
|
command=celery worker -A tasks --loglevel=INFO --beat --schedule={daemon_dir}/celerybeat-schedule | ||
directory={aiida_code_home}/daemon/ | ||
user={local_user} | ||
numprocs=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the line that was limiting the # of procs to 1 (and so probably indirectly the # of celery workers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it does not. It only tells supervisor to start only one celery worker
process instead of multiple ones.
Here is the the analysis with the current develop branch and using supervisor:
(venv) tiziano@tcpc18 ~/work/aiida/aiida_core (develop *=) $ verdi daemon start
11/04/2017 02:45:19 PM, alembic.runtime.migration: [INFO] Context impl PostgresqlImpl.
11/04/2017 02:45:19 PM, alembic.runtime.migration: [INFO] Will assume transactional DDL.
Clearing all locks ...
Starting AiiDA Daemon ...
Daemon started
(venv) tiziano@tcpc18 ~/work/aiida/aiida_core (develop *=) $ verdi daemon status
11/04/2017 02:48:26 PM, alembic.runtime.migration: [INFO] Context impl PostgresqlImpl.
11/04/2017 02:48:26 PM, alembic.runtime.migration: [INFO] Will assume transactional DDL.
# Most recent daemon timestamp:0h:00m:02s ago
## Found 1 process running:
* aiida-daemon[aiida-daemon] RUNNING pid 17019, uptime 0:03:05
(venv) tiziano@tcpc18 ~/work/aiida/aiida_core (develop *=) $ pstree -p 17019
celery(17019)─┬─celery(17033)
├─celery(17034)
├─celery(17035)
├─celery(17036)
├─celery(17037)
├─celery(17038)
├─celery(17039)
├─celery(17040)
├─celery(17044)
├─{celery}(17041)
├─{celery}(17042)
└─{celery}(17043)
and here is the output without supervisor:
(venv) tiziano@tcpc18 ~/work/aiida/aiida_core (supervisor-removal *>) $ verdi daemon start
11/04/2017 02:51:59 PM, alembic.runtime.migration: [INFO] Context impl PostgresqlImpl.
11/04/2017 02:51:59 PM, alembic.runtime.migration: [INFO] Will assume transactional DDL.
Clearing all locks ...
Starting AiiDA Daemon (log file: /users/tiziano/.aiida/daemon/log/celery.log)...
Daemon started
(venv) tiziano@tcpc18 ~/work/aiida/aiida_core (supervisor-removal *>) $ verdi daemon status
11/04/2017 02:52:13 PM, alembic.runtime.migration: [INFO] Context impl PostgresqlImpl.
11/04/2017 02:52:13 PM, alembic.runtime.migration: [INFO] Will assume transactional DDL.
# Most recent daemon timestamp:0h:00m:01s ago
Daemon is running as pid 17868 since 2017-11-04 14:51:59.590000, child processes:
* celery[17899] sleeping, started at 2017-11-04 14:52:01
* celery[17900] sleeping, started at 2017-11-04 14:52:01
* celery[17901] sleeping, started at 2017-11-04 14:52:01
* celery[17902] sleeping, started at 2017-11-04 14:52:01
* celery[17903] sleeping, started at 2017-11-04 14:52:01
* celery[17904] sleeping, started at 2017-11-04 14:52:01
* celery[17905] sleeping, started at 2017-11-04 14:52:01
* celery[17906] sleeping, started at 2017-11-04 14:52:01
* celery[17910] sleeping, started at 2017-11-04 14:52:01
(venv) tiziano@tcpc18 ~/work/aiida/aiida_core (supervisor-removal *>) $ pstree -p 17868
celery(17868)─┬─celery(17899)
├─celery(17900)
├─celery(17901)
├─celery(17902)
├─celery(17903)
├─celery(17904)
├─celery(17905)
├─celery(17906)
├─celery(17910)
├─{celery}(17907)
├─{celery}(17908)
└─{celery}(17909)
I used verdi setup ...
using the SQLA backend to setup my profile, so I don't think there is something wrong with it.
Can you please check on one of your setups (preferably on one with more than 1 CPU).
I just put a comment to the line that I think was limiting to 1 worker only |
@giovannipizzi I looked into the worker-limitation: I can't confirm that with my setup. The number of celery workers does not seem to be limited at any point. |
Ok, interesting, I was convinced that there was only 1 worker... Maybe I was wrong, or maybe this changed over time. Let's merge this then. Do you know in the new setup how to limit the number of processes? |
This is a very interesting topic. It's a pity that I didn't notice it before. So the command should change from The result of the pstree for the first command is
And for the second
I don't understand why there are 2 subprocesses with the
http://docs.celeryproject.org/en/latest/reference/celery.bin.worker.html#cmdoption-celery-worker-c I believe that all these are very related to the workflow testing and debugging that we currently perform and it is a good opportunity to have a small chat on:
I don't know if @lekah is also interested into this discussion. |
I guess the reason for the supervisor being there was that it was expected that there would be more background processes in the future which needed to be managed, plus that the supervisor should always be restarted in case it crashes. The second feature we don't have anymore now that supervisor is gone, but Celery does a good job to restart its workers (which are spawn using fork and therefore run in a completely separate environment), so a full crash should never happen and if it does (for example if the filesystem is full), we want the user to investigate. About the 2 processes with Now, for the concurrency issues: all the tasks in
which should write-lock only the selected row until the end of the transaction. If you can't restrict the lock to a small and otherwise mostly independent resource/row, this kind of locking may introduce other problems like stalling workers and you therefore should rather use |
No description provided.