Add stats to rkt driver #2400

blalor · 2017-03-05T13:05:54Z

Nomad v0.5.4

The rkt driver doesn't support stat collection/reporting. I initially thought it'd be possible to just report on the cpu, memory, etc. used by the executor, but since rkt handles container isolation and sets up cgroups for the pod, I don't think that will work.

rkt itself doesn't seem to expose any stats. but the rkt api does expose the cgroup. So:

add rkt api client to nomad
capture uuid of spawned pod with --uuid-file-save
call the api method InspectPod, which will return the cgroup, like "/machine.slice/machine-rkt\\x2d5922fb6f\\x2db4a9\\x2d4408\\x2daf94\\x2d419a4a6efbfe.scope"
use something like github.com/crosbymichael/cgroups#Stats to get the stats for the cgroup

I know very little about cgroups. On one of my instances:

$ ls -d /sys/fs/cgroup/*"/machine.slice/machine-rkt\\x2d5922fb6f\\x2db4a9\\x2d4408\\x2daf94\\x2d419a4a6efbfe.scope"
/sys/fs/cgroup/blkio/machine.slice/machine-rkt\x2d5922fb6f\x2db4a9\x2d4408\x2daf94\x2d419a4a6efbfe.scope
/sys/fs/cgroup/cpuacct/machine.slice/machine-rkt\x2d5922fb6f\x2db4a9\x2d4408\x2daf94\x2d419a4a6efbfe.scope
/sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-rkt\x2d5922fb6f\x2db4a9\x2d4408\x2daf94\x2d419a4a6efbfe.scope
/sys/fs/cgroup/cpu/machine.slice/machine-rkt\x2d5922fb6f\x2db4a9\x2d4408\x2daf94\x2d419a4a6efbfe.scope
/sys/fs/cgroup/cpuset/machine.slice/machine-rkt\x2d5922fb6f\x2db4a9\x2d4408\x2daf94\x2d419a4a6efbfe.scope
/sys/fs/cgroup/devices/machine.slice/machine-rkt\x2d5922fb6f\x2db4a9\x2d4408\x2daf94\x2d419a4a6efbfe.scope
/sys/fs/cgroup/freezer/machine.slice/machine-rkt\x2d5922fb6f\x2db4a9\x2d4408\x2daf94\x2d419a4a6efbfe.scope
/sys/fs/cgroup/hugetlb/machine.slice/machine-rkt\x2d5922fb6f\x2db4a9\x2d4408\x2daf94\x2d419a4a6efbfe.scope
/sys/fs/cgroup/memory/machine.slice/machine-rkt\x2d5922fb6f\x2db4a9\x2d4408\x2daf94\x2d419a4a6efbfe.scope
/sys/fs/cgroup/net_cls/machine.slice/machine-rkt\x2d5922fb6f\x2db4a9\x2d4408\x2daf94\x2d419a4a6efbfe.scope
/sys/fs/cgroup/net_cls,net_prio/machine.slice/machine-rkt\x2d5922fb6f\x2db4a9\x2d4408\x2daf94\x2d419a4a6efbfe.scope
/sys/fs/cgroup/net_prio/machine.slice/machine-rkt\x2d5922fb6f\x2db4a9\x2d4408\x2daf94\x2d419a4a6efbfe.scope
/sys/fs/cgroup/perf_event/machine.slice/machine-rkt\x2d5922fb6f\x2db4a9\x2d4408\x2daf94\x2d419a4a6efbfe.scope
/sys/fs/cgroup/pids/machine.slice/machine-rkt\x2d5922fb6f\x2db4a9\x2d4408\x2daf94\x2d419a4a6efbfe.scope
/sys/fs/cgroup/systemd/machine.slice/machine-rkt\x2d5922fb6f\x2db4a9\x2d4408\x2daf94\x2d419a4a6efbfe.scope

Does this seem like a reasonable approach?

The text was updated successfully, but these errors were encountered:

dadgar · 2017-03-06T20:58:19Z

Seems reasonable. Main concern is that they do not guarantee any stability of the API.

blalor · 2017-03-06T21:07:31Z

agreed, that's a risk. the fall-back behavior would have to be to behave as it does today (returning nothing).

I notice that the executor already provides some stats via cgroup inspection; would it be possible to just defer to that, as the exec driver does?

dadgar · 2017-03-06T21:54:52Z

@blalor It would be. The main thing to get would be the cgroup parent (path at which the cgroup files are written) so that stats could be collected.

dadgar · 2017-03-06T21:55:15Z

@blalor Are you using rkt in production?

blalor · 2017-03-06T22:09:30Z

Getting there. It and Nomad (along with Consul, Vault, and Terraform) are core components of new infrastructure we're rolling out.

blalor · 2017-03-06T22:11:01Z

@blalor It would be. The main thing to get would be the cgroup parent (path at which the cgroup files are written) so that stats could be collected.

I'm showing my ignorance of cgroups, but aren't the rkt-created cgroups children of the Nomad executor's?

dadgar · 2017-03-06T22:42:52Z

@blalor Yeah you are right. Its been a while since I looked at rkt code. It may be simpler to get this behavior. It may just be not using rkt's isolation in preference of Nomads and then stats would come for free

ashald · 2018-02-27T15:35:42Z

We rely on couple of metrics in order to auto-scale deployments on Nomad, including resource usage. Without this feature supported by Nomad natively and without "pods" in Nomad (containers sharing namespaces but having different resource limits) we have to report to dirty hacks such as injecting primitive process manager into a container and running a custom script that analyzes container's cgroup stats on tmpfs and exports that data using our monitoring network.

This not only increases complexity tremendously by adding a bunch of moving parts but also not as reliable since resource exhaustion within a container might potentially lead to missing data if the script cannot execute.

github-actions · 2022-11-30T02:18:30Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar added theme/driver/rkt type/enhancement labels Mar 6, 2017

schmichael mentioned this issue Apr 19, 2018

rkt: create parent cgroup to enable stats #4188

Merged

schmichael closed this as completed in #4188 Apr 24, 2018

github-actions bot locked as resolved and limited conversation to collaborators Nov 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add stats to rkt driver #2400

Add stats to rkt driver #2400

blalor commented Mar 5, 2017 •

edited

Loading

dadgar commented Mar 6, 2017

blalor commented Mar 6, 2017

dadgar commented Mar 6, 2017

dadgar commented Mar 6, 2017

blalor commented Mar 6, 2017

blalor commented Mar 6, 2017

dadgar commented Mar 6, 2017

ashald commented Feb 27, 2018

github-actions bot commented Nov 30, 2022

Add stats to rkt driver #2400

Add stats to rkt driver #2400

Comments

blalor commented Mar 5, 2017 • edited Loading

dadgar commented Mar 6, 2017

blalor commented Mar 6, 2017

dadgar commented Mar 6, 2017

dadgar commented Mar 6, 2017

blalor commented Mar 6, 2017

blalor commented Mar 6, 2017

dadgar commented Mar 6, 2017

ashald commented Feb 27, 2018

github-actions bot commented Nov 30, 2022

blalor commented Mar 5, 2017 •

edited

Loading