Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

verdi daemon status command fails #2485

Closed
ltalirz opened this issue Feb 15, 2019 · 9 comments · Fixed by #3683
Closed

verdi daemon status command fails #2485

ltalirz opened this issue Feb 15, 2019 · 9 comments · Fixed by #3683

Comments

@ltalirz
Copy link
Member

ltalirz commented Feb 15, 2019

This is on provenance_redesign:

From time to time, I get an error when running verdi daemon status

$ verdi daemon status
Profile: test_qb
Traceback (most recent call last):
  File "/Users/leopold/Applications/miniconda3/envs/aiida_rmq/bin/verdi", line 10, in <module>
    sys.exit(verdi())
  File "/Users/leopold/Applications/miniconda3/envs/aiida_rmq/lib/python2.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/Users/leopold/Applications/miniconda3/envs/aiida_rmq/lib/python2.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/Users/leopold/Applications/miniconda3/envs/aiida_rmq/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/leopold/Applications/miniconda3/envs/aiida_rmq/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/leopold/Applications/miniconda3/envs/aiida_rmq/lib/python2.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/leopold/Applications/miniconda3/envs/aiida_rmq/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/Users/leopold/Personal/Postdoc-MARVEL/repos/aiida/aiida_rmq/aiida/cmdline/commands/cmd_daemon.py", line 82, in status
    result = get_daemon_status(client)
  File "/Users/leopold/Personal/Postdoc-MARVEL/repos/aiida/aiida_rmq/aiida/cmdline/utils/daemon.py", line 72, in get_daemon_status
    worker_row = [worker_pid, worker_info['mem'], worker_info['cpu'], format_local_time(worker_info['create_time'])]
TypeError: string indices must be integers, not str
@ltalirz
Copy link
Member Author

ltalirz commented May 22, 2019

Haven't seen this in a while. Closing now - happy to reopen if this resurfaces.

@ltalirz ltalirz closed this as completed May 22, 2019
@ConradJohnston
Copy link
Contributor

Hi @ltalirz ,
I'm seeing this with a fresh install of the develop branch.

@ltalirz
Copy link
Member Author

ltalirz commented Dec 17, 2019

Hi @ConradJohnston - thanks for the report.
Would you mind printing the content of worker_info (and perhaps even worker_response)?

@ltalirz ltalirz reopened this Dec 17, 2019
@ltalirz ltalirz closed this as completed Dec 17, 2019
@ltalirz ltalirz reopened this Dec 17, 2019
@ConradJohnston
Copy link
Contributor

ConradJohnston commented Dec 17, 2019

Some more info:

I can replicate the fault quite reliably If I run the command in quick succession like this:
verdi daemon status ; verdi daemon status

It seems to be due to this line:

worker_response = client.get_worker_info()

EDIT:
In réponse to Leo's comment:

This is the content of worker_response when it works:
{'status': 'ok', 'time': 1576585659.221961, 'name': 'aiida-production', 'info': {'4990': {'mem_info1': '37M', 'mem_info2': '4G', 'cpu': 0.0, 'mem': 0.231, 'ctime': '0:00.48', 'pid': 4990, 'username': 'cjohnson', 'nice': 0, 'create_time': 1576585658.730482, 'age': 0.48801684379577637, 'cmdline': 'python3.6', 'children': [], 'started': 1576585658.7299762, 'wid': 1}}, 'id': '4e1d768a522a44b59f85039806f9af14'}

and when it fails:
{'status': 'ok', 'time': 1576585660.2456262, 'name': 'aiida-production', 'info': {'4990': 'No such process (stopped?)'}, 'id': '148af3087f9347fb98ef3e58985e6e84'}

@ltalirz
Copy link
Member Author

ltalirz commented Dec 17, 2019

@sphuber as discovered by conrad, worker_response['info'] contains an error message when it doesn't find the worker.
Could you perhaps provide some guidance on where this should be fixed?

@sphuber
Copy link
Contributor

sphuber commented Dec 17, 2019

I myself cannot reproduce the behavior even when calling the command twice consecutively and even for a very busy daemon. However this is on a powerful server. I take it this problem is transient @ConradJohnston and the command will work when issued again some time after it failed? It just seems that when called in quick succession sometimes the circus daemon process fails to poll one or multiple of the daemon workers. I guess there will always be a possibility for this so we should simply add error handling code in the get_daemon_status function. I will make a PR.

@ConradJohnston
Copy link
Contributor

@sphuber - It's indeed transient. I cannot always reproduce it, even when using a loop to hammer the DB, while at other times it simply happens. There does seem to be some sort of performance issue occurring with my fresh Postgres installation though, which I suppose this is a symptom of.

@sphuber
Copy link
Contributor

sphuber commented Dec 17, 2019

This should have nothing to do with the database, it does not touch it at all.

@ConradJohnston
Copy link
Contributor

ConradJohnston commented Dec 17, 2019

@sphuber Hmm, I'm experiencing this quite frequently even without issuing the commands in succession. Your PR gives some relief - but what is the underlying cause of this problem? I haven't seen this behaviour for other installations on other machines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants