Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add stats to rkt driver #2400

Closed
blalor opened this issue Mar 5, 2017 · 9 comments
Closed

Add stats to rkt driver #2400

blalor opened this issue Mar 5, 2017 · 9 comments

Comments

@blalor
Copy link
Contributor

blalor commented Mar 5, 2017

Nomad v0.5.4

The rkt driver doesn't support stat collection/reporting. I initially thought it'd be possible to just report on the cpu, memory, etc. used by the executor, but since rkt handles container isolation and sets up cgroups for the pod, I don't think that will work.

rkt itself doesn't seem to expose any stats. but the rkt api does expose the cgroup. So:

  1. add rkt api client to nomad
  2. capture uuid of spawned pod with --uuid-file-save
  3. call the api method InspectPod, which will return the cgroup, like "/machine.slice/machine-rkt\\x2d5922fb6f\\x2db4a9\\x2d4408\\x2daf94\\x2d419a4a6efbfe.scope"
  4. use something like github.com/crosbymichael/cgroups#Stats to get the stats for the cgroup

I know very little about cgroups. On one of my instances:

$ ls -d /sys/fs/cgroup/*"/machine.slice/machine-rkt\\x2d5922fb6f\\x2db4a9\\x2d4408\\x2daf94\\x2d419a4a6efbfe.scope"
/sys/fs/cgroup/blkio/machine.slice/machine-rkt\x2d5922fb6f\x2db4a9\x2d4408\x2daf94\x2d419a4a6efbfe.scope
/sys/fs/cgroup/cpuacct/machine.slice/machine-rkt\x2d5922fb6f\x2db4a9\x2d4408\x2daf94\x2d419a4a6efbfe.scope
/sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-rkt\x2d5922fb6f\x2db4a9\x2d4408\x2daf94\x2d419a4a6efbfe.scope
/sys/fs/cgroup/cpu/machine.slice/machine-rkt\x2d5922fb6f\x2db4a9\x2d4408\x2daf94\x2d419a4a6efbfe.scope
/sys/fs/cgroup/cpuset/machine.slice/machine-rkt\x2d5922fb6f\x2db4a9\x2d4408\x2daf94\x2d419a4a6efbfe.scope
/sys/fs/cgroup/devices/machine.slice/machine-rkt\x2d5922fb6f\x2db4a9\x2d4408\x2daf94\x2d419a4a6efbfe.scope
/sys/fs/cgroup/freezer/machine.slice/machine-rkt\x2d5922fb6f\x2db4a9\x2d4408\x2daf94\x2d419a4a6efbfe.scope
/sys/fs/cgroup/hugetlb/machine.slice/machine-rkt\x2d5922fb6f\x2db4a9\x2d4408\x2daf94\x2d419a4a6efbfe.scope
/sys/fs/cgroup/memory/machine.slice/machine-rkt\x2d5922fb6f\x2db4a9\x2d4408\x2daf94\x2d419a4a6efbfe.scope
/sys/fs/cgroup/net_cls/machine.slice/machine-rkt\x2d5922fb6f\x2db4a9\x2d4408\x2daf94\x2d419a4a6efbfe.scope
/sys/fs/cgroup/net_cls,net_prio/machine.slice/machine-rkt\x2d5922fb6f\x2db4a9\x2d4408\x2daf94\x2d419a4a6efbfe.scope
/sys/fs/cgroup/net_prio/machine.slice/machine-rkt\x2d5922fb6f\x2db4a9\x2d4408\x2daf94\x2d419a4a6efbfe.scope
/sys/fs/cgroup/perf_event/machine.slice/machine-rkt\x2d5922fb6f\x2db4a9\x2d4408\x2daf94\x2d419a4a6efbfe.scope
/sys/fs/cgroup/pids/machine.slice/machine-rkt\x2d5922fb6f\x2db4a9\x2d4408\x2daf94\x2d419a4a6efbfe.scope
/sys/fs/cgroup/systemd/machine.slice/machine-rkt\x2d5922fb6f\x2db4a9\x2d4408\x2daf94\x2d419a4a6efbfe.scope

Does this seem like a reasonable approach?

@dadgar
Copy link
Contributor

dadgar commented Mar 6, 2017

Seems reasonable. Main concern is that they do not guarantee any stability of the API.

@blalor
Copy link
Contributor Author

blalor commented Mar 6, 2017

agreed, that's a risk. the fall-back behavior would have to be to behave as it does today (returning nothing).

I notice that the executor already provides some stats via cgroup inspection; would it be possible to just defer to that, as the exec driver does?

@dadgar
Copy link
Contributor

dadgar commented Mar 6, 2017

@blalor It would be. The main thing to get would be the cgroup parent (path at which the cgroup files are written) so that stats could be collected.

@dadgar
Copy link
Contributor

dadgar commented Mar 6, 2017

@blalor Are you using rkt in production?

@blalor
Copy link
Contributor Author

blalor commented Mar 6, 2017

Getting there. It and Nomad (along with Consul, Vault, and Terraform) are core components of new infrastructure we're rolling out.

@blalor
Copy link
Contributor Author

blalor commented Mar 6, 2017

@blalor It would be. The main thing to get would be the cgroup parent (path at which the cgroup files are written) so that stats could be collected.

I'm showing my ignorance of cgroups, but aren't the rkt-created cgroups children of the Nomad executor's?

@dadgar
Copy link
Contributor

dadgar commented Mar 6, 2017

@blalor Yeah you are right. Its been a while since I looked at rkt code. It may be simpler to get this behavior. It may just be not using rkt's isolation in preference of Nomads and then stats would come for free

@ashald
Copy link

ashald commented Feb 27, 2018

We rely on couple of metrics in order to auto-scale deployments on Nomad, including resource usage. Without this feature supported by Nomad natively and without "pods" in Nomad (containers sharing namespaces but having different resource limits) we have to report to dirty hacks such as injecting primitive process manager into a container and running a custom script that analyzes container's cgroup stats on tmpfs and exports that data using our monitoring network.

This not only increases complexity tremendously by adding a bunch of moving parts but also not as reliable since resource exhaustion within a container might potentially lead to missing data if the script cannot execute.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants