Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Telemetry for Dispatch Jobs #4422

Open
Miserlou opened this issue Jun 15, 2018 · 2 comments
Open

Improve Telemetry for Dispatch Jobs #4422

Miserlou opened this issue Jun 15, 2018 · 2 comments

Comments

@Miserlou
Copy link

Miserlou commented Jun 15, 2018

Issue

We use Nomad to dispatch many hundreds of thousands tens of millions of dispatch jobs.

Currently, the telemetry about these dispatched jobs is extremely poor. There are no summary statistics. The only slightly useful endpoint is stats.gauges.nomad.nomad.blocked_evals.total_blocked.mbp, but this isn't particularly useful in a complex system because it isn't broken down by type.

There is a stats.gauges.nomad.nomad.job_summary.complete category, but unfortunately it doesn't actually provide any summary statistics, it's just a list of hundreds of thousands of dispatch job names with a value of 0. This is almost worse than useless.

screen shot 2018-06-15 at 1 26 23 pm

It would be excellent if there were a stats.gauges.nomad.nomad.dispatch_summary where dispatches could be broken down by type so that we could see avg/max usage for cpu/iops/disk/memory for each of our dispatch job types. Without this, there is no useful telemetry for Nomad Dispatch based systems.

@Miserlou
Copy link
Author

Miserlou commented Jun 15, 2018

Related-ish: #4422

@Miserlou
Copy link
Author

We've found that not only is the current telemetry worse than useless, we've found that it actually adds about 15-25 seconds of overhead per connection, making it worse-than-worse-than-useless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants