Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Every Nomad executor queries all consul services and health checks every 5 seconds. #1995

Closed
parasyte opened this issue Nov 15, 2016 · 6 comments

Comments

@parasyte
Copy link
Contributor

Nomad version

Nomad v0.4.2

Operating system and Environment details

$ uname -a
Linux ~ 3.13.0-93-generic #140-Ubuntu SMP Mon Jul 18 21:21:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=14.04
DISTRIB_CODENAME=trusty
DISTRIB_DESCRIPTION="Ubuntu 14.04.5 LTS"

The cluster is running on AWS EC2-VPC.

Issue

Every 4~5 seconds, the nomad executor process queries Consul's services endpoint and health checks endpoint with no filtering; fetching the full list of all services and health checks known by Consul.

This creates a scaling problem where n executors frequently fetch information about n services.

Here's an example TCP dump session, watching requests from a unique client port:

$ sudo tcpdump -i lo -AAn port 8500 | egrep -A7 '\.22065 > .* \[P'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on lo, link-type EN10MB (Ethernet), capture size 65535 bytes
22:25:41.465102 IP 172.16.103.191.22065 > 172.16.103.191.8500: Flags [P.], seq 1209640838:1209640955, ack 3740529286, win 2431, options [nop,nop,TS val 414530103 ecr 414528937], length 117
..............E...U.@.@.....g...g.V1!4H.........        .(;.....
..:7..5.GET /v1/agent/services HTTP/1.1
Host: 172.16.103.191:8500
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip


--
22:25:41.481674 IP 172.16.103.191.22065 > 172.16.103.191.8500: Flags [P.], seq 117:232, ack 104037, win 2431, options [nop,nop,TS val 414530107 ecr 414530104], length 115
..............E...U3@.@..~..g...g.V1!4H.........        .(9.....
..:;..:8GET /v1/agent/checks HTTP/1.1
Host: 172.16.103.191:8500
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip


--
22:25:46.286527 IP 172.16.103.191.22065 > 172.16.103.191.8500: Flags [P.], seq 232:349, ack 105112, win 2431, options [nop,nop,TS val 414531308 ecr 414530107], length 117
..............E...U5@.@..z..g...g.V1!4H..n......        .(;.....
..>...:;GET /v1/agent/services HTTP/1.1
Host: 172.16.103.191:8500
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip


--
22:25:46.295171 IP 172.16.103.191.22065 > 172.16.103.191.8500: Flags [P.], seq 349:464, ack 209148, win 2431, options [nop,nop,TS val 414531310 ecr 414531308], length 115
..............E...U:@.@..w..g...g.V1!4H.....#...        .(9.....
..>...>.GET /v1/agent/checks HTTP/1.1
Host: 172.16.103.191:8500
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip


--
22:25:51.038255 IP 172.16.103.191.22065 > 172.16.103.191.8500: Flags [P.], seq 464:581, ack 210223, win 2431, options [nop,nop,TS val 414532496 ecr 414531310], length 117
..............E...U<@.@..s..g...g.V1!4H..V..'...        .(;.....
..C...>.GET /v1/agent/services HTTP/1.1
Host: 172.16.103.191:8500
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip


--
22:25:51.052174 IP 172.16.103.191.22065 > 172.16.103.191.8500: Flags [P.], seq 581:696, ack 314259, win 2431, options [nop,nop,TS val 414532500 ecr 414532496], length 115
..............E...UA@.@..p..g...g.V1!4H.........        .(9.....
..C...C.GET /v1/agent/checks HTTP/1.1
Host: 172.16.103.191:8500
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip


^C7695 packets captured
15406 packets received by filter
0 packets dropped by kernel

The client port maps back to an executor:

$ sudo lsof -i -n -P | grep :22065
nomad    12083   root   14u  IPv4 4942373      0t0  TCP 172.16.103.191:22065->172.16.103.191:8500 (ESTABLISHED)
consul   16230   root   44u  IPv6 4940746      0t0  TCP 172.16.103.191:8500->172.16.103.191:22065 (ESTABLISHED)

$ ps aux | grep 12083 | grep -v grep
root     12083  0.3  0.0 559344 17888 ?        S<sl 21:34   0:15 /usr/local/bin/nomad executor /opt/nomad/mnt/9fb55daf-6d92-3dc9-6cda-d128db32673d/event_handler/event_handler-executor.out

I'm running a job with Count: 800 on 4 nomad client machines, and the services response is pretty big:

$ curl -v -s -w 'Total content length: %{size_download}\n' -o /dev/null -H 'Accept-Encoding: gzip' http://localhost:8500/v1/agent/services
* Hostname was NOT found in DNS cache
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8500 (#0)
> GET /v1/agent/services HTTP/1.1
> User-Agent: curl/7.35.0
> Host: localhost:8500
> Accept: */*
> Accept-Encoding: gzip
>
< HTTP/1.1 200 OK
< Content-Type: application/json
< Date: Tue, 15 Nov 2016 22:30:50 GMT
< Transfer-Encoding: chunked
<
{ [data not shown]
* Connection #0 to host localhost left intact
Total content length: 103906

The health checks endpoint is very small, because I've disabled health checks on this big job while running the test. (Our initial assumption was that health checks were responsible for the CPU load and lo interface pressure.)

$ curl -v -s -w 'Total content length: %{size_download}\n' -o /dev/null -H 'Accept-Encoding: gzip' http://localhost:8500/v1/agent/checks
* Hostname was NOT found in DNS cache
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8500 (#0)
> GET /v1/agent/checks HTTP/1.1
> User-Agent: curl/7.35.0
> Host: localhost:8500
> Accept: */*
> Accept-Encoding: gzip
>
< HTTP/1.1 200 OK
< Content-Type: application/json
< Date: Tue, 15 Nov 2016 22:47:58 GMT
< Content-Length: 966
<
{ [data not shown]
* Connection #0 to host localhost left intact
Total content length: 966

Observations:

  • The Accept-Encoding header is ignored (no compression on the response).

  • 800 Count / 4 clients = 200 jobs on localhost:

    $ nomad node-status -self | grep running | wc -l
    202
    
  • 200 executors / 5 seconds = 40 requests per second

  • 40 requests per second * 100 KB = 4 MB/s

  • 4 MB/s * 8 bits = 32 Mbit/s over localhost

Librato metrics agrees:

screen shot 2016-11-15 at 2 35 32 pm

  • The 800 count jobs themselves are idle (e.g. a bash script: while true ; do sleep 60 ; done)

  • Nomad executors and Consul consume a steady 25% CPU (on each core):

    Tasks: 442; 2 running
    Load average: 1.97 1.95 2.69
    Uptime: 19 days, 04:58:27
    

Reproduction steps

  • Run any job with a high count, e.g. 200 or more executors per Nomad client. (Try with an without TCP/HTTP health checks.)
  • Observe CPU utilization from Nomad executors and Consul.
  • It appears all the CPU is being consumed by retransmitting cacheable data at a pretty high frequency.
  • Adding health checks to the job spec makes the problem worse, due to the periodic fetches asking for all health checks in addition to all services. In our tests, the lo traffic increases from ~36 Mbit/s to ~54 Mbit/s when health checks are included.

Nomad Server logs (if appropriate)

N/A

Nomad Client logs (if appropriate)

N/A

Job file (if appropriate)

{
    "Job": {
        "Region": "global",
        "ID": "foobar",
        "Name": "foobar",
        "Type": "service",
        "Priority": 50,
        "AllAtOnce": false,
        "Datacenters": [
            "my-dc"
        ],
        "Constraints": null,
        "TaskGroups": [
            {
                "Name": "event_handler_1",
                "Count": 800,
                "Constraints": null,
                "Tasks": [
                    {
                        "Name": "event_handler",
                        "Driver": "exec",
                        "User": "",
                        "Config": {
                            "args": [
                                "-c",
                                "while true ; do sleep 60 ; done"
                            ],
                            "command": "/bin/bash"
                        },
                        "Constraints": null,
                        "Env": null,
                        "Services": [
                            {
                                "Id": "",
                                "Name": "foobar",
                                "Tags": [],
                                "PortLabel": "http",
                                "Checks": null
                            }
                        ],
                        "Resources": {
                            "CPU": 30,
                            "MemoryMB": 100,
                            "DiskMB": 100,
                            "IOPS": 0,
                            "Networks": [
                                {
                                    "Public": false,
                                    "CIDR": "",
                                    "ReservedPorts": null,
                                    "DynamicPorts": [
                                        {
                                            "Label": "http",
                                            "Value": 0
                                        }
                                    ],
                                    "IP": "",
                                    "MBits": 10
                                }
                            ]
                        },
                        "Meta": null,
                        "KillTimeout": 5000000000,
                        "LogConfig": {
                            "MaxFiles": 5,
                            "MaxFileSizeMB": 10
                        },
                        "Artifacts": null
                    }
                ],
                "RestartPolicy": {
                    "Interval": 60000000000,
                    "Attempts": 5,
                    "Delay": 5000000000,
                    "Mode": "delay"
                },
                "Meta": null
            }
        ],
        "Update": {
            "Stagger": 0,
            "MaxParallel": 0
        },
        "Periodic": null,
        "Meta": null
    }
}
@schmichael
Copy link
Member

Definitely needs improving. Might be related to #1968 - still investigating that one.

@kak-tus kak-tus mentioned this issue Jan 9, 2017
@parasyte
Copy link
Contributor Author

parasyte commented Mar 3, 2017

@schmichael Have you made progress on this? Pinging because we have Nomad clients in production which have 167 Mbps over lo, with only 0.98 Mbps over eth0. In other words, this bug is doing 160x more work than the tasks we're running in Nomad. The issue here is CPU, since lo isn't IO-bound.

@schmichael
Copy link
Member

@parasyte The good news is: yes! Much progress has been made! The bad news is that it won't land in time for 0.5.5 (which should be out any day now).

Definitely going to land in the next release though. Unless you use a lot of script checks it should dramatically reduce Consul chatter. Since script checks have to heartbeat to Consul there's no getting around that traffic, but at least you can tune how often those run yourself (and the traffic should always incur far less CPU than exec'ing a script anyway).

I'll try to post binaries here when the PR is up if anyone wants to test. If I get merged soon enough after 0.5.5 you may just want to run it since there's really no workaround for the current Consul implementations excessively chatty nature.

(If you care about the implementation details: the existing implementation runs a periodic per-task reconcile loop against Consul. It's far more defensive in approach than is needed, and so I'm moving to a more straightforward talk-to-Consul-only-on-state-change approach.)

@parasyte
Copy link
Contributor Author

parasyte commented Mar 4, 2017

Awesome!!! This sounds great, actually. I'm really looking forward to it. The description you provided matches my observations. Thanks for the update!

schmichael added a commit that referenced this issue Mar 30, 2017
Fixes #2478 #2474 #1995 #2294

The new client only handles agent and task service advertisement. Server
discovery is mostly unchanged.

The Nomad client agent now handles all Consul operations instead of the
executor handling task related operations. When upgrading from an
earlier version of Nomad existing executors will be told to deregister
from Consul so that the Nomad agent can re-register the task's services
and checks.

Drivers - other than qemu - now support an Exec method for executing
abritrary commands in a task's environment. This is used to implement
script checks.

Interfaces are used extensively to avoid interacting with Consul in
tests that don't assert any Consul related behavior.
schmichael added a commit that referenced this issue Apr 14, 2017
Fixes #2478 #2474 #1995 #2294

The new client only handles agent and task service advertisement. Server
discovery is mostly unchanged.

The Nomad client agent now handles all Consul operations instead of the
executor handling task related operations. When upgrading from an
earlier version of Nomad existing executors will be told to deregister
from Consul so that the Nomad agent can re-register the task's services
and checks.

Drivers - other than qemu - now support an Exec method for executing
abritrary commands in a task's environment. This is used to implement
script checks.

Interfaces are used extensively to avoid interacting with Consul in
tests that don't assert any Consul related behavior.
schmichael added a commit that referenced this issue Apr 19, 2017
Fixes #2478 #2474 #1995 #2294

The new client only handles agent and task service advertisement. Server
discovery is mostly unchanged.

The Nomad client agent now handles all Consul operations instead of the
executor handling task related operations. When upgrading from an
earlier version of Nomad existing executors will be told to deregister
from Consul so that the Nomad agent can re-register the task's services
and checks.

Drivers - other than qemu - now support an Exec method for executing
abritrary commands in a task's environment. This is used to implement
script checks.

Interfaces are used extensively to avoid interacting with Consul in
tests that don't assert any Consul related behavior.
@schmichael
Copy link
Member

Fixed in master by #2467 -- will be in 0.6.0 release. Actually demo'd this particular fix internally at HashiCorp! For 250 tasks with checks running a single node, loopback traffic went from 60 Mbps in 0.5.6 to 60 Kbps.

CPU usage was significantly lower, but still not as low as we'd like. Some quick profiling revealed task stats collection as the next most likely CPU hog on idle systems.

If you have the time/ability/desire to test the Consul fix I've attached a binary here:

#2474 (comment)

@schmichael schmichael added this to the v0.6.0 milestone Apr 20, 2017
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 13, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants