Improving status overview #3029

tino · 2017-08-15T11:40:30Z

Hi all,

Thanks for the hard work, and quick developments on Nomad! @dadgar mentioned the improvements for the status command in #2969, which sounds really good!

I'm coming from deploying by hand (with Fabric), and thus always getting immediate feedback and status about jobs and deployments. Nomad is really fast, but also a bit obscure in what is going on. I am using Hashi-UI as well as the cli to get insight in what is going on, but that hasn't been too easy.

Therefore I have a suggestion for the output of status: that it is easier to infer the health of jobs. Currently:

⌘ nomad status
ID                                    Type                 Priority  Status   Submit Date
app-production                        service              50        running  08/14/17 22:05:40 CEST
app-release                           service              50        running  08/15/17 09:34:04 CEST
django-script                         batch/parameterized  50        running  08/01/17 23:22:48 CEST
migrate                               batch/parameterized  50        running  08/13/17 20:55:52 CEST
migrate/dispatch-1502781729-48935f2d  batch                50        dead     08/15/17 09:22:09 CEST
nginx                                 system               50        running  08/09/17 23:21:20 CEST

shows me that all are running, but nothing more.

Then I run:

⌘ nomad status app-production
ID            = app-production
Name          = app-production
Submit Date   = 08/14/17 22:05:40 CEST
Type          = service
Priority      = 50
Datacenters   = NL1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group         Queued  Starting  Running  Failed  Complete  Lost
api                0       0         2        0       12        0
celerybeat         0       0         1        0       6         0
default_worker     0       0         2        0       12        0
monitoring_worker  0       0         2        0       12        0
priority_worker    0       0         2        0       12        0
web                0       0         2        0       9         0

Latest Deployment
ID          = 7e37943a
Status      = failed
Description = Failed due to unhealthy allocations

Deployed
Task Group  Auto Revert  Desired  Placed  Healthy  Unhealthy
api         true         2        2       2        0
web         true         2        1       0        1

Allocations
ID        Node ID   Task Group         Version  Desired  Status   Created At
35fd2089  bfbeae63  api                35       run      running  08/14/17 22:06:07 CEST
1e807ba4  fd664622  monitoring_worker  35       run      running  08/14/17 22:05:41 CEST
ad19d577  fd664622  default_worker     35       run      running  08/14/17 22:05:41 CEST
24dc3c4b  fd664622  api                35       run      running  08/14/17 22:05:41 CEST
27f1601c  fd664622  priority_worker    35       run      running  08/14/17 22:05:41 CEST
2e4e79d5  fd664622  web                35       run      running  08/14/17 22:05:41 CEST
1296b499  bfbeae63  priority_worker    35       run      running  08/14/17 22:05:41 CEST
5c30510f  fd664622  default_worker     35       run      running  08/14/17 22:05:41 CEST
ae86f9e8  bfbeae63  celerybeat         35       run      running  08/14/17 22:05:41 CEST
c1b3487c  bfbeae63  monitoring_worker  35       run      running  08/14/17 22:05:41 CEST
221fd4b3  bfbeae63  web                34       run      running  08/13/17 18:20:20 CEST

which produces a lof of output, and also mentions "running" in the summary, and says everything is running fine in the "Summary" table. Only by looking at the "Latest deployment", or spotting different version numbers in the allocations table, is it that I learn that something isn't right.

I would like a bigger or more clear sign that something isn't right. Maybe Status = running (unhealthy) would be enough.

Related, I'm confused by the "Deployed" table: web has 2 desired, 1 placed, 0 healthy and 1 unhealthy. As you can see from the allocations, there are 2 web tasks running 34 & 35. Both are running (and thus "placed" somewhere), one is healthy (34) and one is unhealthy (35). So that line should imho be 2 2 1 1.

The text was updated successfully, but these errors were encountered:

shantanugadgil · 2017-08-15T17:40:46Z

+1 to the thoughts.
Immediately knowing what is wrong (if anything) would be indeed useful.
Usually just status doesn't help and an alloc-status is required.

tino · 2018-02-15T20:41:15Z

Yup.

One more thing on this. I actually started to write a wrapper (in Python, a bit as an experiment) that takes the output of nomad run and that uses blocking queries and updates the output until none of the allocations are pending.

It would be awesome if the run command could actually do this for me, with a -follow flag or something like it.

I understand that you don't use this in environments where things have 10+ allocations, but for everything below this makes nomad a lot more convenient.

schmichael added theme/cli type/enhancement labels Aug 15, 2017

cgbaker mentioned this issue Feb 26, 2019

[Cli Improvement] 'nomad status' not displaying partial running task group #2134

Closed

cgbaker mentioned this issue Oct 7, 2019

[question] How to Long Poll job status #6436

Closed

mikenomitch mentioned this issue May 18, 2022

Add real time alloc summary for latest job version #13053

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving status overview #3029

Improving status overview #3029

tino commented Aug 15, 2017 •

edited

Loading

shantanugadgil commented Aug 15, 2017

tino commented Feb 15, 2018

Improving status overview #3029

Improving status overview #3029

Comments

tino commented Aug 15, 2017 • edited Loading

shantanugadgil commented Aug 15, 2017

tino commented Feb 15, 2018

tino commented Aug 15, 2017 •

edited

Loading