Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publishing metrics for job summary #3467

Merged
merged 4 commits into from
Nov 1, 2017
Merged

Conversation

diptanu
Copy link
Contributor

@diptanu diptanu commented Oct 30, 2017

Fixes #3465

Publishing the job metrics in the leader node since they need to be published only from one of the servers.

Going to add docs, and related things once the implementation looks good to folks.

@dadgar @chelseakomlo

Copy link
Contributor

@chelseakomlo chelseakomlo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick PR! Looks good, left a couple comments.

nomad/leader.go Outdated
}
summary := raw.(*structs.JobSummary)
for name, tgSummary := range summary.Summary {
metrics.SetGauge([]string{"nomad", "job_summary", summary.JobID, name, "queued"}, float32(tgSummary.Queued))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I saw that, but I wasn't sure what the tags are going to be in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the job should be a tag.

nomad/leader.go Outdated
ws := memdb.NewWatchSet()
iter, err := state.JobSummaries(ws)
if err != nil {
timer.Reset(s.config.StatsCollectionInterval)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to reset the timer in each case statement?

Copy link
Contributor Author

@diptanu diptanu Oct 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chelseakomlo Since we are resetting the timer at the end of the loop, we need to reset it before doing a continue in the error branches. If we don't reset the timer, we won't publish anymore metrics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the timing doesn't need to be exact, could you reset the timer at the beginning of the loop for all cases? Otherwise then maybe add comments to explain why timers are set only for some branches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chelseakomlo I was just thinking since the time taken by the work is not O(1) but O(n) we might cascade a few ticks if the iteration takes longer for some reason like GC pauses or underlying OS issues. I can add comments to explain this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is worth the simpler code to just reset once at the top of the case

@diptanu
Copy link
Contributor Author

diptanu commented Oct 30, 2017

@chelseakomlo I added support for tagged metrics, currently making job id and taskgroup name as tags. Please let me know if that addresses your comment regarding that?

@diptanu
Copy link
Contributor Author

diptanu commented Oct 30, 2017

From the metrics endpoint, with this patch -

{
Labels: {
job: "example",
task_group: "cache"
},
Name: "nomad.nomad.job_summary.complete",
Value: 0
},
{
Labels: {
job: "example",
task_group: "cache"
},
Name: "nomad.nomad.job_summary.failed",
Value: 0
},
{
Labels: {
job: "example",
task_group: "cache"
},
Name: "nomad.nomad.job_summary.lost",
Value: 0
},
{
Labels: {
job: "example",
task_group: "cache"
},
Name: "nomad.nomad.job_summary.queued",
Value: 0
},
{
Labels: {
job: "example",
task_group: "cache"
},
Name: "nomad.nomad.job_summary.running",
Value: 1
},
{
Labels: {
job: "example",
task_group: "cache"
},
Name: "nomad.nomad.job_summary.starting",
Value: 0
},

@diptanu diptanu force-pushed the f-publish-job-summary-metrics branch from bfa9994 to 5351b10 Compare November 1, 2017 20:15
@diptanu diptanu merged commit 9acb30e into master Nov 1, 2017
@diptanu diptanu deleted the f-publish-job-summary-metrics branch November 1, 2017 20:16
metrics.SetGaugeWithLabels([]string{"nomad", "job_summary", "lost"},
float32(tgSummary.Lost), labels)
}
if s.config.BackwardsCompatibleMetrics {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like the irony of a new telemetry feature have a BackwardsCompatibleMetrics check :P

@preetapan
Copy link
Contributor

@diptanu can you update the Changelog for this

@jesusvazquez
Copy link
Contributor

Hi, I can't find this metrics in here https://www.nomadproject.io/docs/agent/telemetry.html. Can we update the documentation accordingly? This metrics are definitely useful.

@schmichael
Copy link
Member

@jesusvazquez Good call. See #3467

@jesusvazquez
Copy link
Contributor

Thanks @schmichael I was about to implement my own solution for this and to my surprise it was already there!.

Keep up the good work. 💪

schmichael added a commit that referenced this pull request Apr 18, 2018
@github-actions
Copy link

github-actions bot commented Mar 6, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants