Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Usage of any Allocation is always 0 bytes when using cgroups v2 #12088

Closed
AlekseyMelikov opened this issue Feb 18, 2022 · 16 comments
Closed
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/cgroups cgroups issues type/enhancement
Milestone

Comments

@AlekseyMelikov
Copy link

I am having this issue now. I don’t remember when it appeared, but I remember that the problem was still in the versions:

Nomad v1.2.1 (719c53ac0ebee95d902faafe59a30422a091bc31)
Consul v1.10.4 Revision 7bbad6fe Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
Docker version 20.10.11, build dea9396

I have now updated to

Nomad v1.2.6 (a6c6b475db5073e33885377b4a5c733e1161020c)
Consul v1.11.3 Revision e319d7ed Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
Docker version 20.10.12, build e91ed57

Linux 5c24b868 5.10.0-11-amd64 #1 SMP Debian 5.10.92-1 (2022-01-18) x86_64 GNU/Linux

No LSB modules are available.
Distributor ID:	Debian
Description:	Debian GNU/Linux 11 (bullseye)
Release:	11
Codename:	bullseye

but the problem persists.

Description of the problem - Memory Usage of any Allocation is always 0 bytes
2

Host Resource Utilization is showing correctly
1

docker stats

CONTAINER ID   NAME            CPU %     MEM USAGE / LIMIT   MEM %     NET I/O   BLOCK I/O         PIDS
406f2[EDITED]   [EDITED]         0.03%     30.83MiB / 100MiB   30.83%    0B / 0B   4.1kB / 13.5MB    8
c667b[EDITED]   [EDITED]         0.00%     212KiB / 3.745GiB   0.01%     0B / 0B   0B / 0B           1
e46fe[EDITED]   [EDITED]         0.02%     40.43MiB / 200MiB   20.21%    0B / 0B   365kB / 369kB     8
47ae4[EDITED]   [EDITED]         0.00%     216KiB / 3.745GiB   0.01%     0B / 0B   0B / 0B           1
c8258[EDITED]   [EDITED]         0.03%     48.54MiB / 200MiB   24.27%    0B / 0B   0B / 8.19kB       11
22961[EDITED]   [EDITED]  	 0.01%     7.527MiB / 50MiB    15.05%    0B / 0B   750kB / 0B        2
21e1b[EDITED]   [EDITED]         0.28%     95.11MiB / 400MiB   23.78%    0B / 0B   58.5MB / 47MB     19
0f64b[EDITED]   [EDITED]         0.05%     58.89MiB / 100MiB   58.89%    0B / 0B   51.7MB / 2.09MB   18
caa34[EDITED]   [EDITED]         0.14%     42.91MiB / 100MiB   42.91%    0B / 0B   34.8MB / 0B       10
d13ea[EDITED]   [EDITED]   	 0.01%     10.52MiB / 50MiB    21.03%    0B / 0B   30.4MB / 0B       2
d3689[EDITED]   [EDITED]         1.87%     246.3MiB / 400MiB   61.58%    0B / 0B   33.7MB / 1.5MB    8
db532[EDITED]   [EDITED]         0.20%     129.3MiB / 600MiB   21.54%    0B / 0B   59MB / 57.1MB     31
60f28[EDITED]   [EDITED]         2.15%     12.92MiB / 100MiB   12.92%    0B / 0B   12.6MB / 6MB      5
e4914[EDITED]   [EDITED]         0.01%     16.39MiB / 50MiB    32.78%    0B / 0B   2.2MB / 69.6kB    7
1as2c[EDITED]   [EDITED]         0.38%     80.51MiB / 400MiB   20.13%    0B / 0B   54.7MB / 373kB    7
dd8bb[EDITED]   [EDITED]         0.12%     39.33MiB / 100MiB   39.33%    0B / 0B   29MB / 0B         12

cat /proc/cgroups

#subsys_name	hierarchy	num_cgroups	enabled
cpuset	0	102	1
cpu	0	102	1
cpuacct	0	102	1
blkio	0	102	1
memory	0	102	1
devices	0	102	1
freezer	0	102	1
net_cls	0	102	1
perf_event	0	102	1
net_prio	0	102	1
hugetlb	0	102	1
pids	0	102	1
rdma	0	102	1
Nomad logs
Feb 13 08:26:58 host-name systemd[1]: Started Nomad.
Feb 13 08:26:59 host-name nomad[697]: ==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
Feb 13 08:26:59 host-name nomad[697]: ==> Loaded configuration from /etc/nomad.d/client.hcl, /etc/nomad.d/nomad.hcl, /etc/nomad.d/server.hcl
Feb 13 08:26:59 host-name nomad[697]: ==> Starting Nomad agent...
Feb 13 08:27:00 host-name nomad[697]: ==> Nomad agent configuration:
Feb 13 08:27:00 host-name nomad[697]:        Advertise Addrs: HTTP: 172.16.0.2:4646; RPC: 172.16.0.2:4647; Serf: 172.16.0.2:4648
Feb 13 08:27:00 host-name nomad[697]:             Bind Addrs: HTTP: [0.0.0.0:4646]; RPC: 0.0.0.0:4647; Serf: 0.0.0.0:4648
Feb 13 08:27:00 host-name nomad[697]:                 Client: true
Feb 13 08:27:00 host-name nomad[697]:              Log Level: INFO
Feb 13 08:27:00 host-name nomad[697]:                 Region: global (DC: dc-name)
Feb 13 08:27:00 host-name nomad[697]:                 Server: true
Feb 13 08:27:00 host-name nomad[697]:                Version: 1.2.6
Feb 13 08:27:00 host-name nomad[697]: ==> Nomad agent started! Log data will stream in below:
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:26:59.273Z [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:26:59.273Z [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:26:59.273Z [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:26:59.273Z [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:26:59.273Z [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:26:59.331Z [INFO]  nomad.raft: restored from snapshot: id=61-278615-1644597830311
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:26:59.443Z [INFO]  nomad.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:172.16.0.2:4647 Address:172.16.0.2:4647}]"
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:26:59.443Z [INFO]  nomad.raft: entering follower state: follower="Node at 172.16.0.2:4647 [Follower]" leader=
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:26:59.445Z [INFO]  nomad: serf: EventMemberJoin: host-name.global 172.16.0.2
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:26:59.445Z [INFO]  nomad: starting scheduling worker(s): num_workers=2 schedulers=["service", "batch", "system", "sysbatch", "_core"]
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:26:59.446Z [INFO]  nomad: serf: Attempting re-join to previously known node: dc-name-host-name.global: 172.16.0.2:4648
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:26:59.446Z [INFO]  nomad: started scheduling worker(s): num_workers=2 schedulers=["service", "batch", "system", "sysbatch", "_core"]
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:26:59.447Z [INFO]  nomad: serf: Re-joined to previously known node: dc-name-host-name.global: 172.16.0.2:4648
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:26:59.447Z [INFO]  client: using state directory: state_dir=/opt/nomad/data/client
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:26:59.448Z [INFO]  nomad: adding server: server="host-name.global (Addr: 172.16.0.2:4647) (DC: dc-name)"
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:26:59.449Z [INFO]  client: using alloc directory: alloc_dir=/opt/nomad/data/alloc
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:26:59.449Z [INFO]  client: using dynamic ports: min=20000 max=32000 reserved=""
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:26:59.449Z [WARN]  client: could not initialize cpuset cgroup subsystem, cpuset management disabled: error="not implemented for cgroup v2 unified hierarchy"
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:26:59.642Z [INFO]  client.fingerprint_mgr.cgroup: cgroups are available
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:26:59.646Z [WARN]  client.fingerprint_mgr.cpu: failed to detect set of reservable cores: error="not implemented for cgroup v2 unified hierarchy"
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:26:59.693Z [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=eth0
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:26:59.695Z [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=lo
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:26:59.699Z [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=eth0
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:26:59.705Z [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=ens10
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:26:59.720Z [INFO]  client.plugin: starting plugin manager: plugin-type=csi
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:26:59.720Z [INFO]  client.plugin: starting plugin manager: plugin-type=driver
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:26:59.721Z [INFO]  client.plugin: starting plugin manager: plugin-type=device
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:27:00.116Z [ERROR] client.driver_mgr.exec: failed to reattach to executor: driver=exec error="error creating rpc client for executor plugin: Reattachment process not found" task_id=2cdc7213-925b-7b29-8aa1-28f4ad0e03d2/[EDITED]
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:27:00.136Z [INFO]  client: started client: node_id=f03bd130-5e77-6809-81f2-5470f161b8d5
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:27:00.138Z [INFO]  client.gc: marking allocation for GC: alloc_id=e946c362-18e7-e330-f335-9cbe04ccc5ad
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:27:00.141Z [INFO]  client.gc: marking allocation for GC: alloc_id=32054f2c-da70-2e80-fe9b-6d0ef865fd80
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:27:00.141Z [INFO]  client.gc: marking allocation for GC: alloc_id=78ee19e2-302e-c65e-7dd7-221f911fc9fc
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:27:00.141Z [INFO]  client.gc: marking allocation for GC: alloc_id=ced42f55-4d57-d0de-df46-278440049f0a
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:27:00.141Z [INFO]  client.gc: marking allocation for GC: alloc_id=26d044c4-7388-f49e-4c93-0367b0783bf2
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:27:00.141Z [INFO]  client.gc: marking allocation for GC: alloc_id=5ccc1846-fa40-52d6-4f6f-02d9baa0523b
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:27:00.141Z [INFO]  client.gc: marking allocation for GC: alloc_id=9d3b36fa-7012-9569-d939-b2b102490570
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:27:00.141Z [INFO]  client.gc: marking allocation for GC: alloc_id=a52e1bf7-2c51-a4f4-6bba-43668b8ad84a
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:27:00.141Z [INFO]  client.gc: marking allocation for GC: alloc_id=bb17689f-cc33-e4c0-ef21-abe80f0dd0ac
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:27:00.141Z [INFO]  client.gc: marking allocation for GC: alloc_id=0ddd1372-12e8-4c80-260f-b64c712bdee6
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:27:00.142Z [INFO]  client.gc: marking allocation for GC: alloc_id=2cdc7213-925b-7b29-8aa1-28f4ad0e03d2
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:27:00.142Z [INFO]  client.gc: marking allocation for GC: alloc_id=482d787f-af8a-17fe-74d7-2953d798768e
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:27:00.142Z [INFO]  client.gc: marking allocation for GC: alloc_id=27e811a4-9990-8c47-76df-f3806f11bbaa
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:27:00.142Z [INFO]  client.gc: marking allocation for GC: alloc_id=cd0083e7-adc0-cb28-f4b4-ad11fde6a550
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:27:00.142Z [INFO]  client.gc: marking allocation for GC: alloc_id=9a57aff0-3441-8082-ece8-2ebc4b1ef382
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:27:00.987Z [WARN]  nomad.raft: heartbeat timeout reached, starting election: last-leader=
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:27:00.987Z [INFO]  nomad.raft: entering candidate state: node="Node at 172.16.0.2:4647 [Candidate]" term=62
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:27:00.992Z [INFO]  nomad.raft: election won: tally=1
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:27:00.992Z [INFO]  nomad.raft: entering leader state: leader="Node at 172.16.0.2:4647 [Leader]"
Feb 13 08:27:00 host-name nomad[697]:     2022-02-13T08:27:00.993Z [INFO]  nomad: cluster leadership acquired
Feb 13 08:27:01 host-name nomad[697]:     2022-02-13T08:27:01.079Z [INFO]  client: node registration complete
Feb 13 08:27:09 host-name nomad[697]:     2022-02-13T08:27:09.960Z [INFO]  client: node registration complete
Feb 13 08:27:14 host-name nomad[697]:     2022-02-13T08:27:14.655Z [INFO]  client.fingerprint_mgr.consul: consul agent is available
Feb 13 08:27:18 host-name nomad[697]:     2022-02-13T08:27:18.224Z [INFO]  agent: (runner) creating new runner (dry: false, once: false)
Feb 13 08:27:18 host-name nomad[697]:     2022-02-13T08:27:18.225Z [INFO]  agent: (runner) creating watcher
Feb 13 08:27:18 host-name nomad[697]:     2022-02-13T08:27:18.228Z [INFO]  agent: (runner) starting
Feb 13 08:27:18 host-name nomad[697]:     2022-02-13T08:27:18.231Z [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/ea8246f8-ad62-8218-7424-ef2d2f765293[EDITED]
Feb 13 08:27:18 host-name nomad[697]:     2022-02-13T08:27:18.232Z [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/ea8246f8-ad62-8218-7424-ef2d2f765293[EDITED]
Feb 13 08:27:18 host-name nomad[697]:     2022-02-13T08:27:18.233Z [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/ea8246f8-ad62-8218-7424-ef2d2f765293[EDITED]
Feb 13 08:27:18 host-name nomad[697]:     2022-02-13T08:27:18.299Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=92de7609-8869-9feb-5613-1d8f1d818e59 task=oathkeeper @module=logmon path=/opt/nomad/data/alloc/92de7609-8869-9feb-5613-1d8f1d818e59/alloc/logs/[EDITED]
Feb 13 08:27:18 host-name nomad[697]:     2022-02-13T08:27:18.304Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=92de7609-8869-9feb-5613-1d8f1d818e59 task=oathkeeper @module=logmon path=/opt/nomad/data/alloc/92de7609-8869-9feb-5613-1d8f1d818e59/alloc/logs/[EDITED]
Feb 13 08:30:54 host-name nomad[697]:     2022-02-13T08:30:54.615Z [ERROR] http: request failed: method=GET path=/v1/client/allocation/undefined/stats error="alloc lookup failed: index error: UUID must be 36 characters" code=500
Consul logs
Feb 13 08:26:58 host-name systemd[1]: Started "HashiCorp Consul - A service mesh solution".
Feb 13 08:26:59 host-name consul[689]: ==> Starting Consul agent...
Feb 13 08:26:59 host-name consul[689]:            Version: '1.11.3'
Feb 13 08:26:59 host-name consul[689]:            Node ID: '64ad536f-4aca-61cf-a324-f98f0ed1677e'
Feb 13 08:26:59 host-name consul[689]:          Node name: 'host-name'
Feb 13 08:26:59 host-name consul[689]:         Datacenter: 'dc-name' (Segment: '<all>')
Feb 13 08:26:59 host-name consul[689]:             Server: true (Bootstrap: true)
Feb 13 08:26:59 host-name consul[689]:        Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, gRPC: 8502, DNS: 8600)
Feb 13 08:26:59 host-name consul[689]:       Cluster Addr: 172.16.0.2 (LAN: 8301, WAN: 8302)
Feb 13 08:26:59 host-name consul[689]:            Encrypt: Gossip: true, TLS-Outgoing: true, TLS-Incoming: true, Auto-Encrypt-TLS: false
Feb 13 08:26:59 host-name consul[689]: ==> Log data will now stream in as it occurs:
Feb 13 08:26:59 host-name consul[689]: 2022-02-13T08:26:59.366Z [WARN]  agent: BootstrapExpect is set to 1; this is the same as Bootstrap mode.
Feb 13 08:26:59 host-name consul[689]: 2022-02-13T08:26:59.366Z [WARN]  agent: bootstrap = true: do not enable unless necessary
Feb 13 08:26:59 host-name consul[689]: 2022-02-13T08:26:59.417Z [WARN]  agent.auto_config: BootstrapExpect is set to 1; this is the same as Bootstrap mode.
Feb 13 08:26:59 host-name consul[689]: 2022-02-13T08:26:59.417Z [WARN]  agent.auto_config: bootstrap = true: do not enable unless necessary
Feb 13 08:26:59 host-name consul[689]: 2022-02-13T08:26:59.442Z [INFO]  agent.server.raft: restored from snapshot: id=48-1409193-1644602007773
Feb 13 08:27:00 host-name consul[689]: 2022-02-13T08:27:00.011Z [INFO]  agent.server.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:64ad536f-4aca-61cf-a324-f98f0ed1677e Address:172.16.0.2:8300}]"
Feb 13 08:27:00 host-name consul[689]: 2022-02-13T08:27:00.011Z [INFO]  agent.server.raft: entering follower state: follower="Node at 172.16.0.2:8300 [Follower]" leader=
Feb 13 08:27:00 host-name consul[689]: 2022-02-13T08:27:00.013Z [INFO]  agent.server.serf.wan: serf: EventMemberJoin: host-name.dc-name 172.16.0.2
Feb 13 08:27:00 host-name consul[689]: 2022-02-13T08:27:00.014Z [INFO]  agent.server.serf.wan: serf: Attempting re-join to previously known node: dc-name-host-name.dc-name: 172.16.0.2:8302
Feb 13 08:27:00 host-name consul[689]: 2022-02-13T08:27:00.015Z [INFO]  agent.server.serf.wan: serf: Re-joined to previously known node: dc-name-host-name.dc-name: 172.16.0.2:8302
Feb 13 08:27:00 host-name consul[689]: 2022-02-13T08:27:00.018Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: host-name 172.16.0.2
Feb 13 08:27:00 host-name consul[689]: 2022-02-13T08:27:00.018Z [INFO]  agent.router: Initializing LAN area manager
Feb 13 08:27:00 host-name consul[689]: 2022-02-13T08:27:00.018Z [INFO]  agent.server.serf.lan: serf: Attempting re-join to previously known node: dc-name-host-name: 172.16.0.2:8301
Feb 13 08:27:00 host-name consul[689]: 2022-02-13T08:27:00.019Z [INFO]  agent.server.serf.lan: serf: Re-joined to previously known node: dc-name-host-name: 172.16.0.2:8301
Feb 13 08:27:00 host-name consul[689]: 2022-02-13T08:27:00.020Z [INFO]  agent.server: Adding LAN server: server="host-name (Addr: tcp/172.16.0.2:8300) (DC: dc-name)"
Feb 13 08:27:00 host-name consul[689]: 2022-02-13T08:27:00.020Z [INFO]  agent.server: Handled event for server in area: event=member-join server=host-name.dc-name area=wan
Feb 13 08:27:00 host-name consul[689]: 2022-02-13T08:27:00.021Z [WARN]  agent: grpc: addrConn.createTransport failed to connect to {dc-name-172.16.0.2:8300 0 host-name <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp <nil>->172.16.0.2:8300: operation was canceled". Reconnecting...
Feb 13 08:27:00 host-name consul[689]: 2022-02-13T08:27:00.035Z [INFO]  agent: Started DNS server: address=0.0.0.0:8600 network=tcp
Feb 13 08:27:00 host-name consul[689]: 2022-02-13T08:27:00.036Z [INFO]  agent: Started DNS server: address=0.0.0.0:8600 network=udp
Feb 13 08:27:00 host-name consul[689]: 2022-02-13T08:27:00.036Z [INFO]  agent: Starting server: address=[::]:8500 network=tcp protocol=http
Feb 13 08:27:00 host-name consul[689]: 2022-02-13T08:27:00.043Z [WARN]  agent: DEPRECATED Backwards compatibility with pre-1.9 metrics enabled. These metrics will be removed in a future version of Consul. Set `telemetry { disable_compat_1.9 = true }` to disable them.
Feb 13 08:27:00 host-name consul[689]: 2022-02-13T08:27:00.044Z [INFO]  agent: Started gRPC server: address=[::]:8502 network=tcp
Feb 13 08:27:00 host-name consul[689]: 2022-02-13T08:27:00.045Z [INFO]  agent: started state syncer
Feb 13 08:27:00 host-name consul[689]: 2022-02-13T08:27:00.045Z [INFO]  agent: Consul agent running!
Feb 13 08:27:01 host-name consul[689]: 2022-02-13T08:27:01.232Z [WARN]  agent.server.raft: heartbeat timeout reached, starting election: last-leader=
Feb 13 08:27:01 host-name consul[689]: 2022-02-13T08:27:01.232Z [INFO]  agent.server.raft: entering candidate state: node="Node at 172.16.0.2:8300 [Candidate]" term=50
Feb 13 08:27:01 host-name consul[689]: 2022-02-13T08:27:01.238Z [INFO]  agent.server.raft: election won: tally=1
Feb 13 08:27:01 host-name consul[689]: 2022-02-13T08:27:01.238Z [INFO]  agent.server.raft: entering leader state: leader="Node at 172.16.0.2:8300 [Leader]"
Feb 13 08:27:01 host-name consul[689]: 2022-02-13T08:27:01.238Z [INFO]  agent.server: cluster leadership acquired
Feb 13 08:27:01 host-name consul[689]: 2022-02-13T08:27:01.239Z [INFO]  agent.server: New leader elected: payload=host-name
Feb 13 08:27:01 host-name consul[689]: 2022-02-13T08:27:01.722Z [INFO]  agent: Synced node info
Feb 13 08:27:01 host-name consul[689]: 2022-02-13T08:27:01.723Z [INFO]  agent.server: initializing acls
Feb 13 08:27:01 host-name consul[689]: 2022-02-13T08:27:01.723Z [INFO]  agent.leader: started routine: routine="legacy ACL token upgrade"
Feb 13 08:27:01 host-name consul[689]: 2022-02-13T08:27:01.723Z [INFO]  agent.leader: started routine: routine="acl token reaping"
Feb 13 08:27:01 host-name consul[689]: 2022-02-13T08:27:01.726Z [INFO]  agent.leader: started routine: routine="federation state anti-entropy"
Feb 13 08:27:01 host-name consul[689]: 2022-02-13T08:27:01.726Z [INFO]  agent.leader: started routine: routine="federation state pruning"
Feb 13 08:27:01 host-name consul[689]: 2022-02-13T08:27:01.727Z [INFO]  connect.ca: initialized primary datacenter CA from existing CARoot with provider: provider=consul
Feb 13 08:27:01 host-name consul[689]: 2022-02-13T08:27:01.727Z [INFO]  agent.leader: started routine: routine="intermediate cert renew watch"
Feb 13 08:27:01 host-name consul[689]: 2022-02-13T08:27:01.727Z [INFO]  agent.leader: started routine: routine="CA root pruning"
Feb 13 08:27:01 host-name consul[689]: 2022-02-13T08:27:01.727Z [INFO]  agent.leader: started routine: routine="CA root expiration metric"
Feb 13 08:27:01 host-name consul[689]: 2022-02-13T08:27:01.727Z [INFO]  agent.leader: started routine: routine="CA signing expiration metric"
Feb 13 08:27:01 host-name consul[689]: 2022-02-13T08:27:01.727Z [INFO]  agent.leader: started routine: routine="virtual IP version check"
Feb 13 08:27:01 host-name consul[689]: 2022-02-13T08:27:01.727Z [INFO]  agent.server: deregistering member: member=c807ea31 partition=default reason=reaped
Feb 13 08:27:01 host-name consul[689]: 2022-02-13T08:27:01.739Z [INFO]  agent: Deregistered service: service=_nomad-task-32054f2c-da70-2e80-fe9b-6d0ef865fd80-group-prometheus-prometheus-prometheus
Feb 13 08:27:01 host-name consul[689]: 2022-02-13T08:27:01.740Z [INFO]  agent: Deregistered service: service=_nomad-task-5ccc1846-fa40-52d6-4f6f-02d9baa0523b-group-envoy-envoy-
Feb 13 08:27:01 host-name consul[689]: 2022-02-13T08:27:01.741Z [INFO]  agent: Synced check: check=_nomad-check-ed7b6ce5bc6c5af7ca61be80e1c836df45b44455
Feb 13 08:27:01 host-name consul[689]: 2022-02-13T08:27:01.743Z [INFO]  agent: Synced check: check=_nomad-check-689ac211032c7cf7fd8003fb2c299ee11fd17a58
Feb 13 08:27:01 host-name consul[689]: 2022-02-13T08:27:01.745Z [INFO]  agent: Synced check: check=_nomad-check-b305d921ca9bcc3f684e0294bcc862256703ff60
Feb 13 08:27:05 host-name consul[689]: 2022-02-13T08:27:05.017Z [INFO]  agent: Synced check: check=_nomad-check-689ac211032c7cf7fd8003fb2c299ee11fd17a58
Feb 13 08:27:05 host-name consul[689]: 2022-02-13T08:27:05.397Z [INFO]  agent: Synced check: check=_nomad-check-b305d921ca9bcc3f684e0294bcc862256703ff60
Feb 13 08:27:05 host-name consul[689]: 2022-02-13T08:27:05.399Z [INFO]  agent: Synced check: check=_nomad-check-ed7b6ce5bc6c5af7ca61be80e1c836df45b44455
Feb 13 08:27:28 host-name consul[689]: 2022-02-13T08:27:28.952Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: c807ea31 172.16.0.3
Feb 13 08:27:28 host-name consul[689]: 2022-02-13T08:27:28.952Z [INFO]  agent.server: member joined, marking health alive: member=c807ea31 partition=default
Feb 13 08:27:34 host-name consul[689]: 2022-02-13T08:27:34.281Z [ERROR] agent.dns: recurse failed: error="read udp 116.203.25.69:36315->1.1.1.1:53: i/o timeout"
Feb 13 08:27:34 host-name consul[689]: 2022-02-13T08:27:34.291Z [ERROR] agent.dns: recurse failed: error="read udp 116.203.25.69:37048->1.1.1.1:53: i/o timeout"
@thatsk
Copy link

thatsk commented Feb 18, 2022

when i restart it shows real stats

@thatsk
Copy link

thatsk commented Feb 18, 2022

why suddenly it shows 0 and after restarting it starts collecting data

@DerekStrickland DerekStrickland self-assigned this Feb 18, 2022
@DerekStrickland
Copy link
Contributor

Hi @AlekseyMelikov. Thanks for using Nomad!

I'm sorry you are having issues. Is there any chance you can share your jobspec and agent config(s) with us so we can try to replicate your environment? The host OS/version would be really helpful information too.

@AlekseyMelikov
Copy link
Author

Sorry, a lot of information. I tried to attach everything that seemed important to me. There is no way to attach job specifications yet. But if you really need them to find the problem, let me know.


Host 1

uname -a

Linux 5c24b868 5.10.0-11-amd64 #1 SMP Debian 5.10.92-1 (2022-01-18) x86_64 GNU/Linux

lsb_release -a

No LSB modules are available.
Distributor ID:	Debian
Description:	Debian GNU/Linux 11 (bullseye)
Release:	11
Codename:	bullseye

ip a

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 96:00:00:c2:12:71 brd ff:ff:ff:ff:ff:ff
    altname enp0s3
    inet EXTERNAL-IPv4/32 brd EXTERNAL-IPv4 scope global dynamic eth0
       valid_lft 79973sec preferred_lft 79973sec
    inet6 EXTERNAL-IPv6/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 EXTERNAL-IPv6/64 scope link 
       valid_lft forever preferred_lft forever
3: ens10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 86:00:00:d5:a9:69 brd ff:ff:ff:ff:ff:ff
    altname enp0s10
    inet 172.16.0.2/32 brd 172.16.0.2 scope global dynamic ens10
       valid_lft 69226sec preferred_lft 69226sec
    inet6 fe80::8400:ff:fed5:a969/64 scope link 
       valid_lft forever preferred_lft forever
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 02:42:7b:51:73:64 brd ff:ff:ff:ff:ff:ff
    inet 10.17.0.1/16 brd 10.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
5: netname0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 3a:14:bd:12:f7:8e brd ff:ff:ff:ff:ff:ff
    inet 10.88.0.1/16 brd 10.88.255.255 scope global netname0
       valid_lft forever preferred_lft forever
    inet6 fe80::3814:bdff:fe12:f78e/64 scope link 
       valid_lft forever preferred_lft forever
15: veth2a7393f9@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master netname0 state UP group default 
    link/ether 4e:a6:c5:12:00:ab brd ff:ff:ff:ff:ff:ff link-netnsid 9
    inet6 fe80::4ca6:c5ff:fe12:ab/64 scope link 
       valid_lft forever preferred_lft forever
18: vethb07f2451@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master netname0 state UP group default 
    link/ether c2:47:72:51:1b:08 brd ff:ff:ff:ff:ff:ff link-netnsid 12
    inet6 fe80::c047:72ff:fe51:1b08/64 scope link 
       valid_lft forever preferred_lft forever
48: vethef32f761@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master netname0 state UP group default 
    link/ether 1e:b5:86:6b:d1:4c brd ff:ff:ff:ff:ff:ff link-netnsid 4
    inet6 fe80::1cb5:86ff:fe6b:d14c/64 scope link 
       valid_lft forever preferred_lft forever
50: veth2ebfc1c1@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master netname0 state UP group default 
    link/ether 66:d7:73:25:4b:0b brd ff:ff:ff:ff:ff:ff link-netnsid 5
    inet6 fe80::d8db:36ff:feda:dcce/64 scope link 
       valid_lft forever preferred_lft forever
51: vethf551528a@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master netname0 state UP group default 
    link/ether 26:e8:48:31:bd:a4 brd ff:ff:ff:ff:ff:ff link-netnsid 1
    inet6 fe80::e01b:84ff:fed6:d4d0/64 scope link 
       valid_lft forever preferred_lft forever
52: veth07d531dd@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master netname0 state UP group default 
    link/ether 4e:71:77:e9:e6:20 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::8c64:d1ff:fe40:ccac/64 scope link 
       valid_lft forever preferred_lft forever
53: veth539fa97d@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master netname0 state UP group default 
    link/ether 5a:f2:f1:28:53:22 brd ff:ff:ff:ff:ff:ff link-netnsid 7
    inet6 fe80::58f2:f1ff:fe28:5322/64 scope link 
       valid_lft forever preferred_lft forever
54: vethf0d9cd78@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master netname0 state UP group default 
    link/ether 46:3e:87:7d:81:0f brd ff:ff:ff:ff:ff:ff link-netnsid 10
    inet6 fe80::d8a0:3bff:febc:95fb/64 scope link 
       valid_lft forever preferred_lft forever
55: veth13307f4a@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master netname0 state UP group default 
    link/ether fe:f8:3e:27:b9:16 brd ff:ff:ff:ff:ff:ff link-netnsid 2
    inet6 fe80::f093:c2ff:fef0:9c95/64 scope link 
       valid_lft forever preferred_lft forever
56: veth49949bd2@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master netname0 state UP group default 
    link/ether a2:cb:bd:45:8b:12 brd ff:ff:ff:ff:ff:ff link-netnsid 6
    inet6 fe80::4424:71ff:fe3d:81f4/64 scope link 
       valid_lft forever preferred_lft forever
57: vethf667613f@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master netname0 state UP group default 
    link/ether fa:9a:de:81:5b:e0 brd ff:ff:ff:ff:ff:ff link-netnsid 8
    inet6 fe80::6031:5ff:fe9a:8b6b/64 scope link 
       valid_lft forever preferred_lft forever
58: vethcdaaf6bb@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master netname0 state UP group default 
    link/ether a6:af:2c:23:df:7a brd ff:ff:ff:ff:ff:ff link-netnsid 11
    inet6 fe80::78c7:cdff:fe10:e58a/64 scope link 
       valid_lft forever preferred_lft forever
59: veth0d6b1fc0@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master netname0 state UP group default 
    link/ether 9a:47:6e:3c:31:60 brd ff:ff:ff:ff:ff:ff link-netnsid 3
    inet6 fe80::e0be:44ff:feb7:a45c/64 scope link 
       valid_lft forever preferred_lft forever

ip r

default via 172.31.1.1 dev eth0 
10.17.0.0/16 dev docker0 proto kernel scope link src 10.17.0.1 linkdown 
10.88.0.0/16 dev netname0 proto kernel scope link src 10.88.0.1 
172.16.0.0/16 via 172.16.0.1 dev ens10 
172.16.0.1 dev ens10 scope link 
172.31.1.1 dev eth0 scope link

nft --version

nftables v0.9.8 (E.D.S.)
/etc/nftables.conf
flush ruleset

include "/etc/nftables/filter.nft"
include "/etc/nftables/nat.nft"
/etc/nftables/filter.nft
table ip filter {
        chain prerouting {
                type filter hook prerouting priority filter; policy accept
                ct state established,related counter packets 0 bytes 0 accept
                iifname "lo" counter packets 0 bytes 0 accept
                ip saddr 10.17.0.0/16 counter packets 0 bytes 0 accept
                ip saddr 10.88.0.0/16 counter packets 0 bytes 0 accept
                ip saddr 172.16.0.0/16 counter packets 0 bytes 0 accept
                tcp dport 22 counter packets 0 bytes 0 accept
                # more port accepting
                counter packets 0 bytes 0 drop
        }

        chain input {
                type filter hook input priority filter; policy accept;
        }

        chain forward {
                type filter hook forward priority filter; policy accept;
        }

        chain output {
                type filter hook output priority filter; policy accept;
        }

        chain postrouting {
                type filter hook postrouting priority filter; policy accept;
        }
}

table ip6 filter {
        chain prerouting {
                type filter hook prerouting priority filter; policy accept;
                counter packets 0 bytes 0 drop
        }

        chain input {
                type filter hook input priority filter; policy accept;
        }

        chain forward {
                type filter hook forward priority filter; policy accept;
        }

        chain output {
                type filter hook output priority filter; policy accept;
        }

        chain postrouting {
                type filter hook postrouting priority filter; policy accept;
        }
}

/etc/nftables/nat.nft
table ip nat {
        chain prerouting {
                type nat hook prerouting priority dstnat; policy accept;
                ip saddr { 10.17.0.0/16, 10.88.0.0/16, 172.16.0.0/16 } tcp dport 53 counter packets 0 bytes 0 redirect to 8600
                ip saddr { 10.17.0.0/16, 10.88.0.0/16, 172.16.0.0/16 } udp dport 53 counter packets 0 bytes 0 redirect to 8600
        }

        chain input {
                type nat hook input priority 100; policy accept;
        }

        chain output {
                type nat hook output priority -100; policy accept;
                ip daddr != { 1.1.1.1, 8.8.8.8, 9.9.9.9 } tcp dport 53 counter packets 0 bytes 0 redirect to 8600
                ip daddr != { 1.1.1.1, 8.8.8.8, 9.9.9.9 } udp dport 53 counter packets 0 bytes 0 redirect to 8600
        }

        chain postrouting {
                type nat hook postrouting priority srcnat; policy accept;
        }
}

table ip6 nat {
        chain prerouting {
                type nat hook prerouting priority dstnat; policy accept;
        }

        chain input {
                type nat hook input priority 100; policy accept;
        }

        chain output {
                type nat hook output priority -100; policy accept;
        }

        chain postrouting {
                type nat hook postrouting priority srcnat; policy accept;
        }
}

Docker

Docker version 20.10.12, build e91ed57
/etc/docker/daemon.json
{
        "bip": "10.17.0.1/16",
        "iptables": false
}

Consul

Consul v1.11.3 Revision e319d7ed Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
/etc/consul.d/consul.hcl
datacenter = "dc-name"

data_dir = "/opt/consul"

encrypt = "KEY"

acl = {
  enabled = true
  default_policy = "allow"
  enable_token_persistence = true
}

performance {
  raft_multiplier = 1
}

telemetry {
  prometheus_retention_time = "10s"
}

bind_addr = "0.0.0.0"
advertise_addr = "{{ GetPrivateInterfaces | include \"network\" \"172.16.0.0/16\" | attr \"address\" }}"

client_addr = "0.0.0.0"
recursors = ["1.1.1.1","8.8.8.8","9.9.9.9"]

ports {
  grpc = 8502
}

connect {
  enabled = true
}
/etc/consul.d/server.hcl
server = true
bootstrap_expect = 1

ui_config = {
  enabled = true
}

ca_file = "/etc/consul.d/consul-agent-ca.pem"
cert_file = "/etc/consul.d/dc-name-server-consul-0.pem"
key_file = "/etc/consul.d/dc-name-server-consul-0-key.pem"

verify_incoming = true
verify_outgoing = true
verify_server_hostname = true

Nomad

Nomad v1.2.6 (a6c6b475db5073e33885377b4a5c733e1161020c)
/etc/nomad.d/nomad.hcl
datacenter = "dc-name"

data_dir = "/opt/nomad/data"
plugin_dir = "/opt/nomad/plugins"

bind_addr = "0.0.0.0"
advertise {
  http = "{{ GetPrivateInterfaces | include \"network\" \"172.16.0.0/16\" | attr \"address\" }}"
  rpc  = "{{ GetPrivateInterfaces | include \"network\" \"172.16.0.0/16\" | attr \"address\" }}"
  serf = "{{ GetPrivateInterfaces | include \"network\" \"172.16.0.0/16\" | attr \"address\" }}"
}

telemetry {
  publish_allocation_metrics = true
  publish_node_metrics = true
  prometheus_metrics = true
}
/etc/nomad.d/server.hcl
server {
  enabled = true
  bootstrap_expect = 1
}
/etc/nomad.d/client.hcl
client {
  enabled = true

  node_class = "general"

  host_network "private" {
    cidr = "172.16.0.0/16"
  }

  host_volume "mariadb-data" {
    path = "/mnt/mariadb/data"
    read_only = false
  }

  host_volume "postgres-data" {
    path = "/mnt/postgres/data"
    read_only = false
  }

  # and other host volumes
}
/opt/cni/config/netname.conflist
{
  "cniVersion": "0.4.0",
  "name": "netname",
  "plugins": [
    {
      "type": "bridge",
      "bridge": "netname0",
      "isGateway": true,
      "ipMasq": false,
      "ipam": {
        "type": "host-local",
        "routes": [
          {
            "dst": "0.0.0.0/0"
          }
        ],
        "ranges": [
          [
            {
              "subnet": "10.88.0.0/16",
              "gateway": "10.88.0.1"
            }
          ]
        ]
      }
    },
    {
      "type": "cni-nftables-portmap",
      "capabilities": {
        "portMappings": true
      }
    },
    {
      "type": "cni-nftables-firewall",
      "forward_chain_name": "forward"
    }
  ]
}

Using patched nftables compatible сni plugin from here


Host 2

uname -a

Linux 5c24b868 5.10.0-11-amd64 #1 SMP Debian 5.10.92-1 (2022-01-18) x86_64 GNU/Linux

lsb_release -a

No LSB modules are available.
Distributor ID:	Debian
Description:	Debian GNU/Linux 11 (bullseye)
Release:	11
Codename:	bullseye

ip a

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 96:00:00:d1:cc:79 brd ff:ff:ff:ff:ff:ff
    altname enp0s3
    inet EXTERNAL-IPv4/32 brd EXTERNAL-IPv4 scope global dynamic eth0
       valid_lft 78684sec preferred_lft 78684sec
    inet6 EXTERNAL-IPv6/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 EXTERNAL-IPv6/64 scope link 
       valid_lft forever preferred_lft forever
3: ens10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 86:00:00:d5:a9:6c brd ff:ff:ff:ff:ff:ff
    altname enp0s10
    inet 172.16.0.3/32 brd 172.16.0.3 scope global dynamic ens10
       valid_lft 54198sec preferred_lft 54198sec
    inet6 fe80::8400:ff:fed5:a96c/64 scope link 
       valid_lft forever preferred_lft forever
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 02:42:3b:77:db:3a brd ff:ff:ff:ff:ff:ff
    inet 10.17.0.1/16 brd 10.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
5: netname0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether c2:b8:a6:78:a0:60 brd ff:ff:ff:ff:ff:ff
    inet 10.88.0.1/16 brd 10.88.255.255 scope global netname0
       valid_lft forever preferred_lft forever
    inet6 fe80::c0b8:a6ff:fe78:a060/64 scope link 
       valid_lft forever preferred_lft forever
8: vetha7e1c925@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master netname0 state UP group default 
    link/ether 1e:12:cc:5b:db:a9 brd ff:ff:ff:ff:ff:ff link-netnsid 1
    inet6 fe80::14be:e8ff:febd:9989/64 scope link 
       valid_lft forever preferred_lft forever
9: veth62b13ddb@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master netname0 state UP group default 
    link/ether e2:bc:97:f5:09:93 brd ff:ff:ff:ff:ff:ff link-netnsid 2
    inet6 fe80::e0bc:97ff:fef5:993/64 scope link 
       valid_lft forever preferred_lft forever

ip r

default via 172.31.1.1 dev eth0 
10.17.0.0/16 dev docker0 proto kernel scope link src 10.17.0.1 linkdown 
10.88.0.0/16 dev netname0 proto kernel scope link src 10.88.0.1 
172.16.0.0/16 via 172.16.0.1 dev ens10 
172.16.0.1 dev ens10 scope link 
172.31.1.1 dev eth0 scope link

nft --version

nftables v0.9.8 (E.D.S.)
/etc/nftables.conf
flush ruleset

include "/etc/nftables/filter.nft"
include "/etc/nftables/nat.nft"
/etc/nftables/filter.nft
table ip filter {
        chain prerouting {
                type filter hook prerouting priority filter; policy accept
                ct state established,related counter packets 0 bytes 0 accept
                iifname "lo" counter packets 0 bytes 0 accept
                ip saddr 10.17.0.0/16 counter packets 0 bytes 0 accept
                ip saddr 10.88.0.0/16 counter packets 0 bytes 0 accept
                ip saddr 172.16.0.0/16 counter packets 0 bytes 0 accept
                tcp dport 22 counter packets 0 bytes 0 accept
                # more port accepting
                counter packets 0 bytes 0 drop
        }

        chain input {
                type filter hook input priority filter; policy accept;
        }

        chain forward {
                type filter hook forward priority filter; policy accept;
        }

        chain output {
                type filter hook output priority filter; policy accept;
        }

        chain postrouting {
                type filter hook postrouting priority filter; policy accept;
        }
}

table ip6 filter {
        chain prerouting {
                type filter hook prerouting priority filter; policy accept;
                counter packets 0 bytes 0 drop
        }

        chain input {
                type filter hook input priority filter; policy accept;
        }

        chain forward {
                type filter hook forward priority filter; policy accept;
        }

        chain output {
                type filter hook output priority filter; policy accept;
        }

        chain postrouting {
                type filter hook postrouting priority filter; policy accept;
        }
}

/etc/nftables/nat.nft
table ip nat {
        chain prerouting {
                type nat hook prerouting priority dstnat; policy accept;
                ip saddr { 10.17.0.0/16, 10.88.0.0/16, 172.16.0.0/16 } tcp dport 53 counter packets 0 bytes 0 redirect to 8600
                ip saddr { 10.17.0.0/16, 10.88.0.0/16, 172.16.0.0/16 } udp dport 53 counter packets 0 bytes 0 redirect to 8600
        }

        chain input {
                type nat hook input priority 100; policy accept;
        }

        chain output {
                type nat hook output priority -100; policy accept;
                ip daddr != { 1.1.1.1, 8.8.8.8, 9.9.9.9 } tcp dport 53 counter packets 0 bytes 0 redirect to 8600
                ip daddr != { 1.1.1.1, 8.8.8.8, 9.9.9.9 } udp dport 53 counter packets 0 bytes 0 redirect to 8600
        }

        chain postrouting {
                type nat hook postrouting priority srcnat; policy accept;
        }
}

table ip6 nat {
        chain prerouting {
                type nat hook prerouting priority dstnat; policy accept;
        }

        chain input {
                type nat hook input priority 100; policy accept;
        }

        chain output {
                type nat hook output priority -100; policy accept;
        }

        chain postrouting {
                type nat hook postrouting priority srcnat; policy accept;
        }
}

Docker

Docker version 20.10.12, build e91ed57
/etc/docker/daemon.json
{
        "bip": "10.17.0.1/16",
        "iptables": false
}

Consul

Consul v1.11.3 Revision e319d7ed Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
/etc/consul.d/consul.hcl
datacenter = "dc-name"

data_dir = "/opt/consul"

encrypt = "KEY"

acl = {
  enabled = true
  default_policy = "allow"
  enable_token_persistence = true
}

performance {
  raft_multiplier = 1
}

telemetry {
  prometheus_retention_time = "10s"
}

bind_addr = "0.0.0.0"
advertise_addr = "{{ GetPrivateInterfaces | include \"network\" \"172.16.0.0/16\" | attr \"address\" }}"

client_addr = "0.0.0.0"
recursors = ["1.1.1.1","8.8.8.8","9.9.9.9"]

ports {
  grpc = 8502
}

connect {
  enabled = true
}
/etc/consul.d/client.hcl
retry_join = ["172.16.0.2"]

ca_file = "/etc/consul.d/consul-agent-ca.pem"
cert_file = "/etc/consul.d/dc-name-client-consul-0.pem"
key_file = "/etc/consul.d/dc-name-client-consul-0-key.pem"

verify_incoming = true
verify_outgoing = true
verify_server_hostname = true

Nomad

Nomad v1.2.6 (a6c6b475db5073e33885377b4a5c733e1161020c)
/etc/nomad.d/nomad.hcl
datacenter = "dc-name"

data_dir = "/opt/nomad/data"
plugin_dir = "/opt/nomad/plugins"

bind_addr = "0.0.0.0"
advertise {
  http = "{{ GetPrivateInterfaces | include \"network\" \"172.16.0.0/16\" | attr \"address\" }}"
  rpc  = "{{ GetPrivateInterfaces | include \"network\" \"172.16.0.0/16\" | attr \"address\" }}"
  serf = "{{ GetPrivateInterfaces | include \"network\" \"172.16.0.0/16\" | attr \"address\" }}"
}

telemetry {
  publish_allocation_metrics = true
  publish_node_metrics = true
  prometheus_metrics = true
}
/etc/nomad.d/client.hcl
client {
  enabled = true

  node_class = "specific"

  meta {
    allow.allocations = "gitlab,container-registry"
  }

  host_network "private" {
    cidr = "172.16.0.0/16"
  }

  host_volume "gitlab-data" {
    path = "/mnt/gitlab/data"
    read_only = false
  }

  host_volume "container-registry-data" {
    path = "/mnt/container-registry/data"
    read_only = false
  }

  host_volume "container-registry-certs" {
    path = "/mnt/container-registry/certs"
    read_only = false
  }
}
/opt/cni/config/netname.conflist
{
  "cniVersion": "0.4.0",
  "name": "netname",
  "plugins": [
    {
      "type": "bridge",
      "bridge": "netname0",
      "isGateway": true,
      "ipMasq": false,
      "ipam": {
        "type": "host-local",
        "routes": [
          {
            "dst": "0.0.0.0/0"
          }
        ],
        "ranges": [
          [
            {
              "subnet": "10.88.0.0/16",
              "gateway": "10.88.0.1"
            }
          ]
        ]
      }
    },
    {
      "type": "cni-nftables-portmap",
      "capabilities": {
        "portMappings": true
      }
    },
    {
      "type": "cni-nftables-firewall",
      "forward_chain_name": "forward"
    }
  ]
}

Using patched nftables compatible сni plugin from here


If you need more information let me know.

@DerekStrickland DerekStrickland removed their assignment Feb 21, 2022
@lgfa29
Copy link
Contributor

lgfa29 commented Feb 22, 2022

Thanks for the extra info @AlekseyMelikov.

I believe the problem is that Debian 11 and Docker 20.10 switched to using cgroups v2 by default, which is not yet supported by Nomad.

If you need access to these stats you can switch your OS to use cgroups v1. Here are some steps on how to do that:
https://docs.docker.com/config/containers/runmetrics/#changing-cgroup-version (note that for v1 you need to set the value to 0).

I will leave the issue open for us to double check that this is fixed when v2 support lands in Nomad, but I will edit a bit to reflect that.

@lgfa29 lgfa29 added stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/cgroups cgroups issues type/enhancement and removed type/bug labels Feb 22, 2022
@lgfa29 lgfa29 changed the title Memory Usage of any Allocation is always 0 bytes Memory Usage of any Allocation is always 0 bytes when using cgroups v2 Feb 22, 2022
@AlekseyMelikov
Copy link
Author

Thank you @lgfa29, this solution worked.

All I did:

  • added "systemd.unified_cgroup_hierarchy=0" in /etc/default/grub to GRUB_CMDLINE_LINUX_DEFAULT
  • exec "update-grub && reboot now"

@shoenig
Copy link
Member

shoenig commented Apr 5, 2022

I believe this should be fixed in v1.3 (coming soon!) which adds support for cgroups v2. Feel free to re-open if things are still mis-behaving with Nomad v1.3

@shoenig shoenig closed this as completed Apr 5, 2022
@shoenig
Copy link
Member

shoenig commented Apr 5, 2022

Edit: sorry actually this is still broken, e.g. nomad alloc status -stats <id> on an exec task:

➜ mount -l | grep cgroup 
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)
Memory Stats
Cache  Swap  Usage
0 B    0 B   8.3 MiB

@shoenig shoenig reopened this Apr 5, 2022
@shoenig
Copy link
Member

shoenig commented Apr 6, 2022

I was looking at an alloc not doing anything; these values seem to be reporting fine 🤦
Sorry for the comment spam

@shoenig shoenig closed this as completed Apr 6, 2022
@cr0c0dylus
Copy link

I have the same problem with 1.3.1 on Ubuntu 22.04 LTS

Screenshot_20220527_155518

@cr0c0dylus
Copy link

sudo vi /etc/default/grub
# add "systemd.unified_cgroup_hierarchy=0" to the line GRUB_CMDLINE_LINUX
sudo update-grub
sudo reboot

it fixes the problem

@mr-karan
Copy link
Contributor

mr-karan commented Jun 3, 2022

I can confirm that the stats reported by nomad CLI are fine, however the frontend still reports as 0 bytes. These are alloc stats from a task running exec driver:

Task Resources
CPU          Memory           Disk     Addresses
822/100 MHz  287 MiB/300 MiB  300 MiB  

Memory Stats
Cache  Swap     Usage
0 B    287 MiB  287 MiB

I don't think enabling CGroups v1 (which is what adding systemd.unified_cgroup_hierarchy=0 to GRUB_CMDLINE_LINUX does) is a good fix, just because the frontend isn't rendering those stats properly.

cc @shoenig

@jhillyerd
Copy link

jhillyerd commented Jun 6, 2022

I am also seeing this on 1.3.1. I can see memory usage being sent in the chrome debugger network tab, although many of the other memory related fields are blank.

{
  "ResourceUsage": {
    "CpuStats": {
      "Measured": [
        "Throttled Periods",
        "Throttled Time",
        "Percent"
      ],
      "Percent": 0,
      "SystemMode": 0,
      "ThrottledPeriods": 0,
      "ThrottledTime": 0,
      "TotalTicks": 0,
      "UserMode": 0
    },
    "DeviceStats": [],
    "MemoryStats": {
      "Cache": 0,
      "KernelMaxUsage": 0,
      "KernelUsage": 0,
      "MappedFile": 0,
      "MaxUsage": 0,
      "Measured": [
        "Cache",
        "Swap",
        "Usage"
      ],
      "RSS": 0,
      "Swap": 0,
      "Usage": 218800128
    }
  },
  "Tasks": {
    "gitea": {
      "Pids": null,
      "ResourceUsage": {
        "CpuStats": {
          "Measured": [
            "Throttled Periods",
            "Throttled Time",
            "Percent"
          ],
          "Percent": 0,
          "SystemMode": 0,
          "ThrottledPeriods": 0,
          "ThrottledTime": 0,
          "TotalTicks": 0,
          "UserMode": 0
        },
        "DeviceStats": null,
        "MemoryStats": {
          "Cache": 0,
          "KernelMaxUsage": 0,
          "KernelUsage": 0,
          "MappedFile": 0,
          "MaxUsage": 0,
          "Measured": [
            "Cache",
            "Swap",
            "Usage"
          ],
          "RSS": 0,
          "Swap": 0,
          "Usage": 218800128
        }
      },
      "Timestamp": 1654558405635945500
    }
  },
  "Timestamp": 1654558405635945500
}

nomad alloc status output for same:

Task Resources
CPU Memory Disk Addresses
0/500 MHz 208 MiB/768 MiB 300 MiB

@devyn
Copy link

devyn commented Jul 8, 2022

Can confirm I'm also able to see the memory usage with nomad alloc status but not with the nomad UI, on Ubuntu 20.04 LTS

I'm guessing this is a separate issue related to the UI itself though, probably needing to fall back to the key that's available if cgroups v2 is in use

@lgfa29
Copy link
Contributor

lgfa29 commented Jul 11, 2022

Hi @devyn 👋

Yes, there's an additional work that needs to be done in the UI to fix this 100%. The UI work is being tracked in #13023.

@github-actions
Copy link

github-actions bot commented Nov 9, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 9, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/cgroups cgroups issues type/enhancement
Projects
None yet
Development

No branches or pull requests

9 participants