Every Nomad executor queries all consul services and health checks every 5 seconds. #1995

parasyte · 2016-11-15T23:11:07Z

Nomad version

Nomad v0.4.2

Operating system and Environment details

$ uname -a
Linux ~ 3.13.0-93-generic #140-Ubuntu SMP Mon Jul 18 21:21:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=14.04
DISTRIB_CODENAME=trusty
DISTRIB_DESCRIPTION="Ubuntu 14.04.5 LTS"

The cluster is running on AWS EC2-VPC.

Issue

Every 4~5 seconds, the nomad executor process queries Consul's services endpoint and health checks endpoint with no filtering; fetching the full list of all services and health checks known by Consul.

This creates a scaling problem where n executors frequently fetch information about n services.

Here's an example TCP dump session, watching requests from a unique client port:

$ sudo tcpdump -i lo -AAn port 8500 | egrep -A7 '\.22065 > .* \[P'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on lo, link-type EN10MB (Ethernet), capture size 65535 bytes
22:25:41.465102 IP 172.16.103.191.22065 > 172.16.103.191.8500: Flags [P.], seq 1209640838:1209640955, ack 3740529286, win 2431, options [nop,nop,TS val 414530103 ecr 414528937], length 117
..............E...U.@.@.....g...g.V1!4H.........        .(;.....
..:7..5.GET /v1/agent/services HTTP/1.1
Host: 172.16.103.191:8500
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip


--
22:25:41.481674 IP 172.16.103.191.22065 > 172.16.103.191.8500: Flags [P.], seq 117:232, ack 104037, win 2431, options [nop,nop,TS val 414530107 ecr 414530104], length 115
..............E...U3@.@..~..g...g.V1!4H.........        .(9.....
..:;..:8GET /v1/agent/checks HTTP/1.1
Host: 172.16.103.191:8500
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip


--
22:25:46.286527 IP 172.16.103.191.22065 > 172.16.103.191.8500: Flags [P.], seq 232:349, ack 105112, win 2431, options [nop,nop,TS val 414531308 ecr 414530107], length 117
..............E...U5@.@..z..g...g.V1!4H..n......        .(;.....
..>...:;GET /v1/agent/services HTTP/1.1
Host: 172.16.103.191:8500
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip


--
22:25:46.295171 IP 172.16.103.191.22065 > 172.16.103.191.8500: Flags [P.], seq 349:464, ack 209148, win 2431, options [nop,nop,TS val 414531310 ecr 414531308], length 115
..............E...U:@.@..w..g...g.V1!4H.....#...        .(9.....
..>...>.GET /v1/agent/checks HTTP/1.1
Host: 172.16.103.191:8500
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip


--
22:25:51.038255 IP 172.16.103.191.22065 > 172.16.103.191.8500: Flags [P.], seq 464:581, ack 210223, win 2431, options [nop,nop,TS val 414532496 ecr 414531310], length 117
..............E...U<@.@..s..g...g.V1!4H..V..'...        .(;.....
..C...>.GET /v1/agent/services HTTP/1.1
Host: 172.16.103.191:8500
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip


--
22:25:51.052174 IP 172.16.103.191.22065 > 172.16.103.191.8500: Flags [P.], seq 581:696, ack 314259, win 2431, options [nop,nop,TS val 414532500 ecr 414532496], length 115
..............E...UA@.@..p..g...g.V1!4H.........        .(9.....
..C...C.GET /v1/agent/checks HTTP/1.1
Host: 172.16.103.191:8500
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip


^C7695 packets captured
15406 packets received by filter
0 packets dropped by kernel

The client port maps back to an executor:

$ sudo lsof -i -n -P | grep :22065
nomad    12083   root   14u  IPv4 4942373      0t0  TCP 172.16.103.191:22065->172.16.103.191:8500 (ESTABLISHED)
consul   16230   root   44u  IPv6 4940746      0t0  TCP 172.16.103.191:8500->172.16.103.191:22065 (ESTABLISHED)

$ ps aux | grep 12083 | grep -v grep
root     12083  0.3  0.0 559344 17888 ?        S<sl 21:34   0:15 /usr/local/bin/nomad executor /opt/nomad/mnt/9fb55daf-6d92-3dc9-6cda-d128db32673d/event_handler/event_handler-executor.out

I'm running a job with Count: 800 on 4 nomad client machines, and the services response is pretty big:

$ curl -v -s -w 'Total content length: %{size_download}\n' -o /dev/null -H 'Accept-Encoding: gzip' http://localhost:8500/v1/agent/services
* Hostname was NOT found in DNS cache
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8500 (#0)
> GET /v1/agent/services HTTP/1.1
> User-Agent: curl/7.35.0
> Host: localhost:8500
> Accept: */*
> Accept-Encoding: gzip
>
< HTTP/1.1 200 OK
< Content-Type: application/json
< Date: Tue, 15 Nov 2016 22:30:50 GMT
< Transfer-Encoding: chunked
<
{ [data not shown]
* Connection #0 to host localhost left intact
Total content length: 103906

The health checks endpoint is very small, because I've disabled health checks on this big job while running the test. (Our initial assumption was that health checks were responsible for the CPU load and lo interface pressure.)

$ curl -v -s -w 'Total content length: %{size_download}\n' -o /dev/null -H 'Accept-Encoding: gzip' http://localhost:8500/v1/agent/checks
* Hostname was NOT found in DNS cache
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8500 (#0)
> GET /v1/agent/checks HTTP/1.1
> User-Agent: curl/7.35.0
> Host: localhost:8500
> Accept: */*
> Accept-Encoding: gzip
>
< HTTP/1.1 200 OK
< Content-Type: application/json
< Date: Tue, 15 Nov 2016 22:47:58 GMT
< Content-Length: 966
<
{ [data not shown]
* Connection #0 to host localhost left intact
Total content length: 966

Observations:

The Accept-Encoding header is ignored (no compression on the response).

800 Count / 4 clients = 200 jobs on localhost:

$ nomad node-status -self | grep running | wc -l
202

200 executors / 5 seconds = 40 requests per second
40 requests per second * 100 KB = 4 MB/s
4 MB/s * 8 bits = 32 Mbit/s over localhost

Librato metrics agrees:

The 800 count jobs themselves are idle (e.g. a bash script: while true ; do sleep 60 ; done)

Nomad executors and Consul consume a steady 25% CPU (on each core):

Tasks: 442; 2 running
Load average: 1.97 1.95 2.69
Uptime: 19 days, 04:58:27

Reproduction steps

Run any job with a high count, e.g. 200 or more executors per Nomad client. (Try with an without TCP/HTTP health checks.)
Observe CPU utilization from Nomad executors and Consul.
It appears all the CPU is being consumed by retransmitting cacheable data at a pretty high frequency.
Adding health checks to the job spec makes the problem worse, due to the periodic fetches asking for all health checks in addition to all services. In our tests, the lo traffic increases from ~36 Mbit/s to ~54 Mbit/s when health checks are included.

Nomad Server logs (if appropriate)

N/A

Nomad Client logs (if appropriate)

N/A

Job file (if appropriate)

{
    "Job": {
        "Region": "global",
        "ID": "foobar",
        "Name": "foobar",
        "Type": "service",
        "Priority": 50,
        "AllAtOnce": false,
        "Datacenters": [
            "my-dc"
        ],
        "Constraints": null,
        "TaskGroups": [
            {
                "Name": "event_handler_1",
                "Count": 800,
                "Constraints": null,
                "Tasks": [
                    {
                        "Name": "event_handler",
                        "Driver": "exec",
                        "User": "",
                        "Config": {
                            "args": [
                                "-c",
                                "while true ; do sleep 60 ; done"
                            ],
                            "command": "/bin/bash"
                        },
                        "Constraints": null,
                        "Env": null,
                        "Services": [
                            {
                                "Id": "",
                                "Name": "foobar",
                                "Tags": [],
                                "PortLabel": "http",
                                "Checks": null
                            }
                        ],
                        "Resources": {
                            "CPU": 30,
                            "MemoryMB": 100,
                            "DiskMB": 100,
                            "IOPS": 0,
                            "Networks": [
                                {
                                    "Public": false,
                                    "CIDR": "",
                                    "ReservedPorts": null,
                                    "DynamicPorts": [
                                        {
                                            "Label": "http",
                                            "Value": 0
                                        }
                                    ],
                                    "IP": "",
                                    "MBits": 10
                                }
                            ]
                        },
                        "Meta": null,
                        "KillTimeout": 5000000000,
                        "LogConfig": {
                            "MaxFiles": 5,
                            "MaxFileSizeMB": 10
                        },
                        "Artifacts": null
                    }
                ],
                "RestartPolicy": {
                    "Interval": 60000000000,
                    "Attempts": 5,
                    "Delay": 5000000000,
                    "Mode": "delay"
                },
                "Meta": null
            }
        ],
        "Update": {
            "Stagger": 0,
            "MaxParallel": 0
        },
        "Periodic": null,
        "Meta": null
    }
}

The text was updated successfully, but these errors were encountered:

schmichael · 2016-11-15T23:16:28Z

Definitely needs improving. Might be related to #1968 - still investigating that one.

parasyte · 2017-03-03T04:00:37Z

@schmichael Have you made progress on this? Pinging because we have Nomad clients in production which have 167 Mbps over lo, with only 0.98 Mbps over eth0. In other words, this bug is doing 160x more work than the tasks we're running in Nomad. The issue here is CPU, since lo isn't IO-bound.

schmichael · 2017-03-03T19:08:28Z

@parasyte The good news is: yes! Much progress has been made! The bad news is that it won't land in time for 0.5.5 (which should be out any day now).

Definitely going to land in the next release though. Unless you use a lot of script checks it should dramatically reduce Consul chatter. Since script checks have to heartbeat to Consul there's no getting around that traffic, but at least you can tune how often those run yourself (and the traffic should always incur far less CPU than exec'ing a script anyway).

I'll try to post binaries here when the PR is up if anyone wants to test. If I get merged soon enough after 0.5.5 you may just want to run it since there's really no workaround for the current Consul implementations excessively chatty nature.

(If you care about the implementation details: the existing implementation runs a periodic per-task reconcile loop against Consul. It's far more defensive in approach than is needed, and so I'm moving to a more straightforward talk-to-Consul-only-on-state-change approach.)

parasyte · 2017-03-04T02:54:11Z

Awesome!!! This sounds great, actually. I'm really looking forward to it. The description you provided matches my observations. Thanks for the update!

Fixes #2478 #2474 #1995 #2294 The new client only handles agent and task service advertisement. Server discovery is mostly unchanged. The Nomad client agent now handles all Consul operations instead of the executor handling task related operations. When upgrading from an earlier version of Nomad existing executors will be told to deregister from Consul so that the Nomad agent can re-register the task's services and checks. Drivers - other than qemu - now support an Exec method for executing abritrary commands in a task's environment. This is used to implement script checks. Interfaces are used extensively to avoid interacting with Consul in tests that don't assert any Consul related behavior.

schmichael · 2017-04-20T00:11:37Z

Fixed in master by #2467 -- will be in 0.6.0 release. Actually demo'd this particular fix internally at HashiCorp! For 250 tasks with checks running a single node, loopback traffic went from 60 Mbps in 0.5.6 to 60 Kbps.

CPU usage was significantly lower, but still not as low as we'd like. Some quick profiling revealed task stats collection as the next most likely CPU hog on idle systems.

If you have the time/ability/desire to test the Consul fix I've attached a binary here:

#2474 (comment)

github-actions · 2022-12-13T02:18:21Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

schmichael added type/bug theme/client theme/performance labels Nov 15, 2016

schmichael self-assigned this Nov 15, 2016

kak-tus mentioned this issue Jan 9, 2017

High CPU usage #2169

Closed

schmichael mentioned this issue Mar 24, 2017

Consul Service Advertisement Refactor #2478

Closed

schmichael mentioned this issue Mar 29, 2017

Refactor Consul Syncer into new ServiceClient #2467

Merged

schmichael closed this as completed Apr 20, 2017

schmichael added this to the v0.6.0 milestone Apr 20, 2017

github-actions bot locked as resolved and limited conversation to collaborators Dec 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Every Nomad executor queries all consul services and health checks every 5 seconds. #1995

Every Nomad executor queries all consul services and health checks every 5 seconds. #1995

parasyte commented Nov 15, 2016

schmichael commented Nov 15, 2016

parasyte commented Mar 3, 2017

schmichael commented Mar 3, 2017

parasyte commented Mar 4, 2017

schmichael commented Apr 20, 2017

github-actions bot commented Dec 13, 2022

Every Nomad executor queries all consul services and health checks every 5 seconds. #1995

Every Nomad executor queries all consul services and health checks every 5 seconds. #1995

Comments

parasyte commented Nov 15, 2016

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Job file (if appropriate)

schmichael commented Nov 15, 2016

parasyte commented Mar 3, 2017

schmichael commented Mar 3, 2017

parasyte commented Mar 4, 2017

schmichael commented Apr 20, 2017

github-actions bot commented Dec 13, 2022