-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Publishing metrics for job summary #3467
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick PR! Looks good, left a couple comments.
nomad/leader.go
Outdated
} | ||
summary := raw.(*structs.JobSummary) | ||
for name, tgSummary := range summary.Summary { | ||
metrics.SetGauge([]string{"nomad", "job_summary", summary.JobID, name, "queued"}, float32(tgSummary.Queued)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of Nomad 0.7, metrics are published in a tagged format. See:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I saw that, but I wasn't sure what the tags are going to be in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the job should be a tag.
nomad/leader.go
Outdated
ws := memdb.NewWatchSet() | ||
iter, err := state.JobSummaries(ws) | ||
if err != nil { | ||
timer.Reset(s.config.StatsCollectionInterval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to reset the timer in each case statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chelseakomlo Since we are resetting the timer at the end of the loop, we need to reset it before doing a continue
in the error branches. If we don't reset the timer, we won't publish anymore metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the timing doesn't need to be exact, could you reset the timer at the beginning of the loop for all cases? Otherwise then maybe add comments to explain why timers are set only for some branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chelseakomlo I was just thinking since the time taken by the work is not O(1) but O(n) we might cascade a few ticks if the iteration takes longer for some reason like GC pauses or underlying OS issues. I can add comments to explain this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is worth the simpler code to just reset once at the top of the case
@chelseakomlo I added support for tagged metrics, currently making job id and taskgroup name as tags. Please let me know if that addresses your comment regarding that? |
From the metrics endpoint, with this patch -
|
bfa9994
to
5351b10
Compare
metrics.SetGaugeWithLabels([]string{"nomad", "job_summary", "lost"}, | ||
float32(tgSummary.Lost), labels) | ||
} | ||
if s.config.BackwardsCompatibleMetrics { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like the irony of a new telemetry feature have a BackwardsCompatibleMetrics
check :P
@diptanu can you update the Changelog for this |
Hi, I can't find this metrics in here https://www.nomadproject.io/docs/agent/telemetry.html. Can we update the documentation accordingly? This metrics are definitely useful. |
@jesusvazquez Good call. See #3467 |
Thanks @schmichael I was about to implement my own solution for this and to my surprise it was already there!. Keep up the good work. 💪 |
Document job telemetry from #3467
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Fixes #3465
Publishing the job metrics in the leader node since they need to be published only from one of the servers.
Going to add docs, and related things once the implementation looks good to folks.
@dadgar @chelseakomlo