Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving status overview #3029

Open
tino opened this issue Aug 15, 2017 · 2 comments
Open

Improving status overview #3029

tino opened this issue Aug 15, 2017 · 2 comments

Comments

@tino
Copy link

tino commented Aug 15, 2017

Hi all,

Thanks for the hard work, and quick developments on Nomad! @dadgar mentioned the improvements for the status command in #2969, which sounds really good!

I'm coming from deploying by hand (with Fabric), and thus always getting immediate feedback and status about jobs and deployments. Nomad is really fast, but also a bit obscure in what is going on. I am using Hashi-UI as well as the cli to get insight in what is going on, but that hasn't been too easy.

Therefore I have a suggestion for the output of status: that it is easier to infer the health of jobs. Currently:

⌘ nomad status
ID                                    Type                 Priority  Status   Submit Date
app-production                        service              50        running  08/14/17 22:05:40 CEST
app-release                           service              50        running  08/15/17 09:34:04 CEST
django-script                         batch/parameterized  50        running  08/01/17 23:22:48 CEST
migrate                               batch/parameterized  50        running  08/13/17 20:55:52 CEST
migrate/dispatch-1502781729-48935f2d  batch                50        dead     08/15/17 09:22:09 CEST
nginx                                 system               50        running  08/09/17 23:21:20 CEST

shows me that all are running, but nothing more.

Then I run:

⌘ nomad status app-production
ID            = app-production
Name          = app-production
Submit Date   = 08/14/17 22:05:40 CEST
Type          = service
Priority      = 50
Datacenters   = NL1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group         Queued  Starting  Running  Failed  Complete  Lost
api                0       0         2        0       12        0
celerybeat         0       0         1        0       6         0
default_worker     0       0         2        0       12        0
monitoring_worker  0       0         2        0       12        0
priority_worker    0       0         2        0       12        0
web                0       0         2        0       9         0

Latest Deployment
ID          = 7e37943a
Status      = failed
Description = Failed due to unhealthy allocations

Deployed
Task Group  Auto Revert  Desired  Placed  Healthy  Unhealthy
api         true         2        2       2        0
web         true         2        1       0        1

Allocations
ID        Node ID   Task Group         Version  Desired  Status   Created At
35fd2089  bfbeae63  api                35       run      running  08/14/17 22:06:07 CEST
1e807ba4  fd664622  monitoring_worker  35       run      running  08/14/17 22:05:41 CEST
ad19d577  fd664622  default_worker     35       run      running  08/14/17 22:05:41 CEST
24dc3c4b  fd664622  api                35       run      running  08/14/17 22:05:41 CEST
27f1601c  fd664622  priority_worker    35       run      running  08/14/17 22:05:41 CEST
2e4e79d5  fd664622  web                35       run      running  08/14/17 22:05:41 CEST
1296b499  bfbeae63  priority_worker    35       run      running  08/14/17 22:05:41 CEST
5c30510f  fd664622  default_worker     35       run      running  08/14/17 22:05:41 CEST
ae86f9e8  bfbeae63  celerybeat         35       run      running  08/14/17 22:05:41 CEST
c1b3487c  bfbeae63  monitoring_worker  35       run      running  08/14/17 22:05:41 CEST
221fd4b3  bfbeae63  web                34       run      running  08/13/17 18:20:20 CEST

which produces a lof of output, and also mentions "running" in the summary, and says everything is running fine in the "Summary" table. Only by looking at the "Latest deployment", or spotting different version numbers in the allocations table, is it that I learn that something isn't right.

I would like a bigger or more clear sign that something isn't right. Maybe Status = running (unhealthy) would be enough.

Related, I'm confused by the "Deployed" table: web has 2 desired, 1 placed, 0 healthy and 1 unhealthy. As you can see from the allocations, there are 2 web tasks running 34 & 35. Both are running (and thus "placed" somewhere), one is healthy (34) and one is unhealthy (35). So that line should imho be 2 2 1 1.

@shantanugadgil
Copy link
Contributor

+1 to the thoughts.
Immediately knowing what is wrong (if anything) would be indeed useful.
Usually just status doesn't help and an alloc-status is required.

@tino
Copy link
Author

tino commented Feb 15, 2018

Yup.

One more thing on this. I actually started to write a wrapper (in Python, a bit as an experiment) that takes the output of nomad run and that uses blocking queries and updates the output until none of the allocations are pending.

It would be awesome if the run command could actually do this for me, with a -follow flag or something like it.

I understand that you don't use this in environments where things have 10+ allocations, but for everything below this makes nomad a lot more convenient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants