Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nomad large batch jobs slowdown on clients #2590

Closed
OferE opened this issue Apr 26, 2017 · 30 comments
Closed

nomad large batch jobs slowdown on clients #2590

OferE opened this issue Apr 26, 2017 · 30 comments

Comments

@OferE
Copy link

OferE commented Apr 26, 2017

Nomad version

v0.5.3
(we later reproduced the issue with version 0.5.6)

Operating system and Environment details

Ubuntu 14.04 on AWS

Issue

We launched the following configuration:
3 nomad masters on m4.4xlarge machines.
97 nomad slaves on m4.4xlarge machines

M4.4xlarge machines has 16 cores per host each with 2300 MHZ.

cat of /proc/cpuinfo returns 
...
processor	: 15
vendor_id	: GenuineIntel
cpu family	: 6
model		: 79
model name	: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
stepping	: 1
microcode	: 0xb000014
cpu MHz		: 2300.064
cache size	: 46080 KB
physical id	: 0
siblings	: 16
core id		: 7
cpu cores	: 8
apicid		: 15
initial apicid	: 15
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt
bugs		:
bogomips	: 4600.12
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

Any idea how to make such thing work?

Reproduction steps

We created a nomad batch job with 50K tasks to run.
Each task in its own group since we don't want to bundle many tasks to the same machine since we don't know how long it will take.

we use raw_exec to launch our jobs - and we gave the scheduler limit of 8*2300 CPOU units.
This means that we would expect to schedule 2 groups/tasks on each node simultanously.

In fact we schedlued 50K groups,

The critical problems we saw are:
1.
We didn't see 97*2 tasks running simultanously. What we saw instead is a lot of tasks (say 171) and then the number dropped to zero and only then we saw then next tasks (say 120).

This is not as expected - we expected to see a flat 97*2 tasks running all the time...

The batch ended up after ~500 groups only with status dead. where all the rest of the groups/tasks were in pending state.

When running the job again (nomad run on the same job file)- it got into a pending state.
with the following placement failures:

ESC[1mPlacement FailureESC[0mESC[0m
Task Group "portal_for_each_lane_1000084509":
  * Class "com.manager" filtered 3 nodes
  * Class "com.admin" filtered 1 nodes
  * Constraint "computed class ineligible" filtered 4 nodes
  * Resources exhausted on 97 nodes
  * Class "com.mesim" exhausted on 97 nodes
  * Dimension "cpu exhausted" exhausted on 97 nodes

For some reason it thinks that the cpu is exahsted on 97 nodes even that all 97 nodes runs nothing at all.

Nomad Server logs (if appropriate)

(after i tried to run again the log looks like that)

   Loaded configuration from /etc/nomad.d/consul.hc1, /etc/nomad.d/client.hc1
==> Starting Nomad agent...
==> Nomad agent configuration:

                 Atlas: <disabled>
                Client: true
             Log Level: INFO
                Region: global (DC: dc1)
                Server: true
               Version: 0.5.3

==> Nomad agent started! Log data will stream in below:

    2017/04/26 11:28:38 [INFO] raft: Node at 10.0.32.100:4647 [Follower] entering Follower state (Leader: "")
    2017/04/26 11:28:38 [INFO] serf: EventMemberJoin: ip-10-0-32-100.global 10.0.32.100
    2017/04/26 11:28:38.164263 [INFO] nomad: starting 16 scheduling worker(s) for [system service batch _core]
    2017/04/26 11:28:38.164789 [INFO] client: using state directory /var/nomad/client
    2017/04/26 11:28:38.164899 [INFO] client: using alloc directory /var/nomad/alloc
    2017/04/26 11:28:38.165047 [INFO] nomad: adding server ip-10-0-32-100.global (Addr: 10.0.32.100:4647) (DC: dc1)
    2017/04/26 11:28:38.165897 [INFO] fingerprint.cgroups: cgroups are available
    2017/04/26 11:28:38.169085 [INFO] fingerprint.consul: consul agent is available
    2017/04/26 11:28:38.306206 [INFO] client: Node ID "97302f73-ee96-08a0-383a-ce265460cc77"
    2017/04/26 11:28:38.307660 [ERR] client.consul: error discovering nomad servers: no Nomad Servers advertising service "nomad" in Consul datacenters: ["prems"]
    2017/04/26 11:28:39 [WARN] raft: EnableSingleNode disabled, and no known peers. Aborting election.
==> Newer Nomad version available: 0.5.6 (currently running: 0.5.3)
    2017/04/26 11:28:43.388032 [ERR] client: registration failure: No cluster leader
    2017/04/26 11:28:45 [INFO] serf: EventMemberJoin: ip-10-0-0-100.global 10.0.0.100
    2017/04/26 11:28:45.951868 [INFO] nomad: adding server ip-10-0-0-100.global (Addr: 10.0.0.100:4647) (DC: dc1)
    2017/04/26 11:28:45 [INFO] serf: EventMemberJoin: ip-10-0-16-100.global 10.0.16.100
    2017/04/26 11:28:45.953752 [INFO] server.consul: successfully contacted 2 Nomad Servers
    2017/04/26 11:28:45.953851 [INFO] nomad: adding server ip-10-0-16-100.global (Addr: 10.0.16.100:4647) (DC: dc1)
    2017/04/26 11:28:45.953872 [INFO] nomad: Attempting bootstrap with nodes: [10.0.32.100:4647 10.0.0.100:4647 10.0.16.100:4647]
    2017/04/26 11:28:45.958310 [INFO] server.consul: successfully contacted 2 Nomad Servers
    2017/04/26 11:28:59.005200 [INFO] client: node registration complete
    2017/04/26 12:12:41.352503 [ERR] worker: failed to submit plan for evaluation 833231e7-081f-d399-c855-e9bdd0f7993a: rpc error: evaluation token does not match
    2017/04/26 12:12:41.403073 [ERR] worker: failed to nack evaluation '833231e7-081f-d399-c855-e9bdd0f7993a': rpc error: Token does not match for Evaluation ID
    2017/04/26 12:13:26.989375 [ERR] worker: failed to submit plan for evaluation 833231e7-081f-d399-c855-e9bdd0f7993a: rpc error: evaluation token does not match
    2017/04/26 12:13:26.990381 [ERR] worker: failed to nack evaluation '833231e7-081f-d399-c855-e9bdd0f7993a': rpc error: Token does not match for Evaluation ID


Nomad Client logs (if appropriate)

buntu@ip-10-0-9-206:~$ sudo cat /var/log/upstart/nomad.log
    Loaded configuration from /etc/nomad.d/consul.hc1, /etc/nomad.d/client.hc1
==> Starting Nomad agent...
==> Nomad agent configuration:

                 Atlas: <disabled>
                Client: true
             Log Level: INFO
                Region: global (DC: dc1)
                Server: false
               Version: 0.5.3

==> Nomad agent started! Log data will stream in below:

    2017/04/26 11:29:11.820077 [INFO] client: using state directory /var/nomad/client
    2017/04/26 11:29:11.820118 [INFO] client: using alloc directory /var/nomad/alloc
    2017/04/26 11:29:11.825524 [INFO] fingerprint.cgroups: cgroups are available
    2017/04/26 11:29:11.879295 [INFO] fingerprint.consul: consul agent is available
    2017/04/26 11:29:12.150424 [INFO] client: Node ID "569e9da2-4ec5-c4ae-12b2-ec7ee153aee6"
    2017/04/26 11:29:12.155235 [INFO] client.consul: discovered following Servers: 10.0.0.100:4647,10.0.32.100:4647,10.0.16.100:4647
    2017/04/26 11:29:12.163464 [INFO] client: node registration complete
    2017/04/26 11:29:28.867475 [INFO] client: node registration complete
==> Newer Nomad version available: 0.5.6 (currently running: 0.5.3)
    2017/04/26 12:06:42.521331 [INFO] client: task "portal_for_each_lane_1196321675" for alloc "8ecd8143-5c62-e3ad-cbe1-a049c5c05369" completed successfully
    2017/04/26 12:06:42.521352 [INFO] client: Not restarting task: portal_for_each_lane_1196321675 for alloc: 8ecd8143-5c62-e3ad-cbe1-a049c5c05369 
    2017/04/26 12:06:44.012948 [INFO] client: marking allocation 8ecd8143-5c62-e3ad-cbe1-a049c5c05369 for GC
    2017/04/26 12:06:52.172780 [INFO] client: task "portal_for_each_lane_94757013" for alloc "85b763f2-1bd9-d5c7-0e02-d056d887cb60" completed successfully
    2017/04/26 12:06:52.172799 [INFO] client: Not restarting task: portal_for_each_lane_94757013 for alloc: 85b763f2-1bd9-d5c7-0e02-d056d887cb60 
    2017/04/26 12:06:53.454801 [INFO] client: marking allocation 85b763f2-1bd9-d5c7-0e02-d056d887cb60 for GC
    2017/04/26 12:11:57.770511 [INFO] client: task "portal_for_each_lane_1045801617" for alloc "bda0e2f3-daf3-a68e-0dc5-c4917def60f9" completed successfully
    2017/04/26 12:11:57.770531 [INFO] client: Not restarting task: portal_for_each_lane_1045801617 for alloc: bda0e2f3-daf3-a68e-0dc5-c4917def60f9 
    2017/04/26 12:11:59.406721 [INFO] client: marking allocation bda0e2f3-daf3-a68e-0dc5-c4917def60f9 for GC
    2017/04/26 12:12:20.832093 [INFO] client: task "portal_for_each_lane_1203321634" for alloc "9c9780d8-db47-dff2-9ee3-bcf628ba1c72" completed successfully
    2017/04/26 12:12:20.832112 [INFO] client: Not restarting task: portal_for_each_lane_1203321634 for alloc: 9c9780d8-db47-dff2-9ee3-bcf628ba1c72 
    2017/04/26 12:12:22.253991 [INFO] client: marking allocation 9c9780d8-db47-dff2-9ee3-bcf628ba1c72 for GC
    2017/04/26 12:16:43.262685 [INFO] client: task "portal_for_each_lane_874096992" for alloc "b901492c-5c96-ec0c-f97b-9f55be863b3c" completed successfully
    2017/04/26 12:16:43.262704 [INFO] client: Not restarting task: portal_for_each_lane_874096992 for alloc: b901492c-5c96-ec0c-f97b-9f55be863b3c 
    2017/04/26 12:16:44.605594 [INFO] client: marking allocation b901492c-5c96-ec0c-f97b-9f55be863b3c for GC
    2017/04/26 12:17:08.058726 [INFO] client: task "portal_for_each_lane_418294313" for alloc "6ea61951-231f-8c27-66fe-4cc7c0a9c396" completed successfully
    2017/04/26 12:17:08.058744 [INFO] client: Not restarting task: portal_for_each_lane_418294313 for alloc: 6ea61951-231f-8c27-66fe-4cc7c0a9c396 
    2017/04/26 12:17:09.361289 [INFO] client: marking allocation 6ea61951-231f-8c27-66fe-4cc7c0a9c396 for GC
    2017/04/26 12:27:39.301535 [INFO] client: marking allocation 6ea61951-231f-8c27-66fe-4cc7c0a9c396 for GC
    2017/04/26 12:27:39.320392 [INFO] client: marking allocation bda0e2f3-daf3-a68e-0dc5-c4917def60f9 for GC
    2017/04/26 12:27:39.328752 [INFO] client: marking allocation 9c9780d8-db47-dff2-9ee3-bcf628ba1c72 for GC
    2017/04/26 12:27:39.370909 [INFO] client: marking allocation 85b763f2-1bd9-d5c7-0e02-d056d887cb60 for GC
    2017/04/26 12:27:39.373750 [INFO] client: marking allocation 8ecd8143-5c62-e3ad-cbe1-a049c5c05369 for GC
    2017/04/26 12:27:39.891904 [INFO] client: marking allocation b901492c-5c96-ec0c-f97b-9f55be863b3c for GC

Job file (if appropriate)

its a 85MB file.
This is how our configuartion looks like:

{
    "job": {
        "MESIMDispatcher_aws50k_2017-04-26_1425": {
            "datacenters": [
                "dc1"
            ],
            "meta": {
                "AWS_DEFAULT_REGION": "${PREMS_ENV_AWS_DEFAULT_REGION}",
                "AWS_ECR_ACCESS_KEY_ID": "${PREMS_ENV_AWS_ECR_ACCESS_KEY_ID}",
                "AWS_S3_ACCESS_KEY_ID": "${PREMS_ENV_AWS_ACCESS_KEY_ID}",
                "AWS_S3_SECRET_ACCESS_KEY": "${PREMS_ENV_AWS_SECRET_ACCESS_KEY}",
                "AWS_ECR_SECRET_ACCESS_KEY": "${PREMS_ENV_AWS_ECR_SECRET_ACCESS_KEY}"
            },
            "type": "batch",
            "group": {
                "portal_for_each_lane_1997568752": {
                    "portal_for_each_lane_1997568752": {
                        "task": [
                            {
                                "portal_for_each_lane_1997568752": {
                                    "config": {
                                        "args": [
                                            "--network=host",
                                            "-e TERM=xterm",
                                            "-v /opt/mesim_shared/:/tmp/versions/",
                                            "secret-id.dkr.ecr.us-east-1.amazonaws.com/mesim:latest",
                                            "/opt/mesim/mesim-run.sh "some-secret-config"
                                        ],
                                        "command": "/opt/nomad_docker_run.sh"
                                    },
                                    "driver": "raw_exec",
                                    "resources": {
                                        "network": {
                                            "mbits": 10,
                                            "port": {
                                                "db": {}
                                            }
                                        },
                                        "cpu": 18400,
                                        "memory": 2048
                                    }
                                }
                            }
                        ]
                    }
                },
               {and like that are the rest 49999 groups}

@OferE
Copy link
Author

OferE commented Apr 27, 2017

few discoveries i made:

i launched a smaller configuration with 20 m3.medium machine - running a dummy commad with sleep 30.
i submitted 5000 tasks to it, each in its own group using a similar json file.
The CPU constraint i used is 1000 where the machine has 2500 - so i expect to see 40 tasks in every moment. The results are very poor.

  1. It turns out that the nomad agent got the entire core of the machine to itself - spinning with 100% cpu - openning many threads on the machine.
  2. The nomad agent high cpu utilization prevents from other tasks to run on the machine so instead of seeing a flat 40 tasks on the machine - after a while i see less than 10 tsks simultanously.

This is a serious issue - and its a blocker for us.

@OferE
Copy link
Author

OferE commented Apr 27, 2017

Part of the bugs reported above are being caused by #2169

@dovka
Copy link

dovka commented Apr 27, 2017

threads_100pct_cpu
htop screenshot showing nomad agent threads spinning on CPU, also consuming 1.4G of RAM

strace on one of the CPU-spinning threads, seeing some of IO-waits:
100_CPU_strace.txt

@OferE
Copy link
Author

OferE commented Apr 27, 2017

another experiment:
We thought that maybe if we use stronger machine - the nomad agent 100% cpu issue will only take 1 core so the other cores may still do real work.
The results are poor.
We launched 20 m4.xlarge machines.
nomad agent took 2/3 out of total cores most of the time. with spikes to all cpu in some of the time.
This cpu issue makes nomad useless to run batch jobs.
Unless solved we will be forced to move to AWS batch :-(

@OferE OferE mentioned this issue Apr 27, 2017
@OferE
Copy link
Author

OferE commented Apr 27, 2017

another update:
We limited the nomad agent and its many child processes to one cpu using taskset. this made nomad only consume one cpu. This didn't change the scheduling which is still poor. After a while at pick nomad utilizing just 1/3 of the cluster in a good scenario - most of the time its even less. This means that nomad scheduling for batch job is really broken.

@dadgar
Copy link
Contributor

dadgar commented Apr 27, 2017

@OferE There is a lot here but there is one critical issue. Your job size of 85 mbs. That just will not work. That is going through Raft and is part of the payload the schedulers send around which will cause the eval failures you are seeing:
2017/04/26 12:12:41.352503 [ERR] worker: failed to submit plan for evaluation 833231e7-081f-d399-c855-e9bdd0f7993a: rpc error: evaluation token does not match.

Can you instead have one task group and set the count to 50000?

@OferE
Copy link
Author

OferE commented Apr 28, 2017

First of all, thanks for the reply! I highly appreciate it (as i built a lot of infra around nomad).

I did the experiment with 5000 groups - about 8.5 MB. still with the same results.
I don't understand your suggestion - using 1 group will not work since it will try to schedule all 5000 on the same node and fail. what do u mean by one task group?

@dadgar
Copy link
Contributor

dadgar commented Apr 28, 2017

@OferE I was suggesting something like this:

{
    "job": {
        "MESIMDispatcher_aws50k_2017-04-26_1425": {
            "datacenters": [
                "dc1"
            ],
            "meta": {
                "AWS_DEFAULT_REGION": "${PREMS_ENV_AWS_DEFAULT_REGION}",
                "AWS_ECR_ACCESS_KEY_ID": "${PREMS_ENV_AWS_ECR_ACCESS_KEY_ID}",
                "AWS_S3_ACCESS_KEY_ID": "${PREMS_ENV_AWS_ACCESS_KEY_ID}",
                "AWS_S3_SECRET_ACCESS_KEY": "${PREMS_ENV_AWS_SECRET_ACCESS_KEY}",
                "AWS_ECR_SECRET_ACCESS_KEY": "${PREMS_ENV_AWS_ECR_SECRET_ACCESS_KEY}"
            },
            "type": "batch",
            "group": {
                    "portal_for_each_lane_1997568752": {
                        "count": 50000,        <-----
                        "task": [
                            {
                                "portal_for_each_lane_1997568752": {
                                    "config": {
                                        "args": [
                                            "--network=host",
                                            "-e TERM=xterm",
                                            "-v /opt/mesim_shared/:/tmp/versions/",
                                            "secret-id.dkr.ecr.us-east-1.amazonaws.com/mesim:latest",
                                            "/opt/mesim/mesim-run.sh "some-secret-config"
                                        ],
                                        "command": "/opt/nomad_docker_run.sh"
                                    },
                                    "driver": "raw_exec",
                                    "resources": {
                                        "network": {
                                            "mbits": 10,
                                            "port": {
                                                "db": {}
                                            }
                                        },
                                        "cpu": 18400,
                                        "memory": 2048
                                    }
                                }
                            }
                        ]
                    }
                },

I am going to try to reproduce high cpu usage and will report back any findings.

@dadgar
Copy link
Contributor

dadgar commented Apr 28, 2017

@OferE Can you set enable_debug = true on the client and re-run your job. While that is running can you run:

$ go tool pprof -proto -output=dump.proto `which nomad` http://127.0.0.1:4646/debug/pprof/profile

@dadgar
Copy link
Contributor

dadgar commented Apr 28, 2017

Also you shouldn't expect a totally flat rate of allocations on a client especially when the tasks are finishing roughly at the same time because the client has to mark the tasks as finished and then the scheduler has to detect that and then place more work. So it should dip and then go up again, etc.

@OferE
Copy link
Author

OferE commented Apr 28, 2017

Thanks for the advice - i will try your suggestion on Sunday and will report the results.

Few comments though:

  1. no need to launch many machines to reproduce the issues. (high cpu/cluster utilization) - just launch 3 machines and give it 5000 tasks to run in the same technique i used - u will see it immediately.
  2. as for flat rate - i know i shouldn't expect flat rate but what i saw is that after 3-5 minutes the cluster utilization is no more than 1/3 (in the best case) which is poor. i reproduced this when my tasks only run a sleep command to reduce the noise from the experiment (as our original tasks use intensive CPU and have diverse duration).
  3. your suggestion is more elegant than what i did - 85 MB file :-). Actually i didn't expect my company will take it to this extreme - now i understand that we will take this into much higher numbers, so i need the scheduler to be reliable...

@dadgar
Copy link
Contributor

dadgar commented Apr 29, 2017

@OferE yeah I have reproduced as well. I have some fixes and will try to get you a build with them applied on top of 0.5.6! I have some more testing and work to do on it though. Hopefully I can get it you by EOD Monday

@OferE
Copy link
Author

OferE commented Apr 29, 2017 via email

@OferE
Copy link
Author

OferE commented Apr 30, 2017

I used your latest executable (thanks again).
I also changed my json to have 1 task with count 5000.

I launched 3 m4.xlarge machine wach with 4*2300MH and 16GB RAM.
the managers run on m4.2xlarge machines.
I increased the sleep to 120 seconds to reduce noise.

here is how my task looks like:

{
   "job": {
      "batchsmall": {
         "datacenters": [
            "dc1"
         ],
         "meta": {
            "AWS_DEFAULT_REGION": "${PREMS_ENV_AWS_DEFAULT_REGION}",
            "AWS_ECR_ACCESS_KEY_ID": "${PREMS_ENV_AWS_ECR_ACCESS_KEY_ID}",
            "AWS_S3_ACCESS_KEY_ID": "${PREMS_ENV_AWS_ACCESS_KEY_ID}",
            "AWS_S3_SECRET_ACCESS_KEY": "${PREMS_ENV_AWS_SECRET_ACCESS_KEY}",
            "AWS_ECR_SECRET_ACCESS_KEY": "${PREMS_ENV_AWS_ECR_SECRET_ACCESS_KEY}"
         },
         "type": "batch",
         "group": {
            "batchsmall_0": {
               "count": 5000,
               "batchsmall_0": {
                  "task": [
                     {
                        "batchsmall_0": {
                           "config": {
                              "args": [
                                 "--network=host",
                                 "-e TERM=xterm",
                                 "-e SLEEP=120",
                                 "-v /opt/mesim_shared/:/tmp/versions/",
                                 "720146705806.dkr.ecr.us-east-1.amazonaws.com/dummy:cmd",
                                 "/runme.sh"
                              ],
                              "command": "/opt/nomad_docker_run.sh"
                           },
                           "driver": "raw_exec",
                           "resources": {
                              "network": {
                                 "mbits": 10,
                                 "port": {
                                    "db": {}
                                 }
                              },
                              "cpu": 1000,
                              "memory": 500
                           }
                        }
                     }
                  ]
               }
            }
         },
         "constraint": {
            "attribute": "${node.class}",
            "value": "com.mesim"
         }
      }
   }
}

CPU usage is still very high on the agent.
I counted the amount of running containers using docker ps (i also use docker swarm to be able to run docker commands on the cluster from a single machine) every 5 seconds:
(as u can see the performance is getting worse as the time goes by)

7  9  13  15  17  16  20  24 3  9  11  14  14  15  17  13  12  14  12  11  9  7  5  8  11  9  11  12  11  8  7  9  8  10  10  12  10  10  9  6  6  6  8  6  9  8  10  7  8  6  4  3  2  3  6  7  10  11  10  10  6  8  10  7  9  8  7  8  5  5  5  5  4  2  2  2  0  1  2  5  7  9  8  7  7  4  4  3  3  5  7  7  4  4  6  6  5  4  11  11  9  7  7  8  1  2  2  8  8  7  7  7  7  1  3  3  3  4  4  6  6  4  5  3  5  5  3  5  4  4  4  4  5  3  5  8  6  6  8  8  10  7  7  7  4  6  2  2  2  4  4  5  5  5  5  3  3  0  0  0  2  2  4  4  5  5  4  4  4  2  1  2  8  9  9  9  10  8  7  7  5  5  4  4  2  2  4  4  4  6  4  4  4  3  3  3  1  1  0  2  2  2  4  4  2  2  2  0  2  2  2  2  4  3  8  9  13  14  14  12  6  5  1  2  2  2  2  4  6  6  4  4  4  0  0  1  1  1  1  0  1  3  3  3  3  3  5  3  3  2  2  2  1  1  1  1  1  2  2  2  2  5  7  5  9  14  14  5  0  2  2  2  2  6  6  4  4  4  4  0  0  0  0  0  0  0  0  0  0  0  2  2  2  2  2  4  2  2  2  2  5  12  10  10  10  10  5  0  0  2  2  2  2  2  7  6  5  5  5  5  2  2  2  2  2  6  4  4  4  4  0  0  0  0  0  0  0  0  0  0  0  0  0  0  2  3  3  3  3  3  1  3  2  2  2  2  4  4  2  2  2  2  0  0  0  0  0  0  0  0  0  0  1  5  5  5  5  7  5  2  2  2  2  6  4  4  4  4  4  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  3  3  3  3  3  3  0  2  2  2  2  2  4  2  2  2  2  2  0  0  0  0  0  0  0  0  0  0  0  0  0  5  5  5  14  14  16  16  11  11  2  2  4  4  4  4  4  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  1  2  2  2  3  2  2  2  1  1  0  0  2  2  2  4  4  4  2  2  2  0  0  0  0  0  

I finally got the prof right (I hope).
please download it from: (github refuse to upload it - even when zipped).
https://filebin.ca/3Kq5s2bhJjmI/dump.proto

@dadgar
Copy link
Contributor

dadgar commented May 3, 2017

@OferE So here is the first PR that addresses some of the performance issues. This will be a little harder to rebase off of 0.5.6 but if you want I can build a binary from that branch for you to test: #2610.

There are other optimizations I have made note of todo in 0.6.0 but will not be able to get to in the immediate time-frame.

@dadgar
Copy link
Contributor

dadgar commented May 3, 2017

Actually just built that branch. Here it is if you want to test:
nomad.zip

@OferE
Copy link
Author

OferE commented May 4, 2017

Thank u so much for this!
From few tests i have done It looks like it works much better!

I did 2 experiments:

  1. 5000 tasks with count.
    nomad agent cpu took about 50% of a single core, and memory footprint was low.
    cluster was utilized very good.
  2. 5000 tasks with tasks as array (8.5 MB)
    nomad agent cpu was much higher - more than 100% but still much better than the previous version.
    There is probably a memory leak - since the memory in some point reached 95% and then there was no scheduling,
    until the memory was exaushted the cluster was utilized in a reasonable manner (but worse than experiment Pass agent configuration into fingerprinter #1.

i think that the memory leak is also relevant to the first experiment but there it is not reaching to higher numbers - though it is not getting freed.

I will test it on the original 50K real jobs and let u know...

Also, regarding this non official executable u sent me:

  1. I installed it on both managers and workers.
    Since my managers also handles other clusters with regular services - do i must install it on the managers or can i just install it on the workers and thats it?
  2. When can we expect this fix to arrive to an official release?

thanks again.

@OferE
Copy link
Author

OferE commented May 4, 2017

hi,
I did another experiment with 50K tasks.

It started fine but started to degregate and when it reached to 12K tasks - the results were poor again.
Only 1/3 of the cluster was utilized.

Nomad status command showed 360 tasks running (which was what i expected exactly) even though only ~100 tasks actually ran (i am checking this with docker swarm, and also using nomad status - counting the completed and compare to what was a minute ago).

Also - the memory leak i described earlier started to be significant (6% of 16GB after 13K tasks completed) .

These issues needs to be solved in order for Nomad batch be production ready...
Having said all of this - i am planning to group many commands into a single command and not exceeding 2000 long tasks - this should improve things a lot and probably workaround the bugs mentioned...

@dadgar
Copy link
Contributor

dadgar commented May 4, 2017

@OferE You can run it just on the workers. As for the memory usage, that is part 2 of the optimization I would like to do. So that should go away.

I believe you are running raw_exec launching docker containers. I would not be surprised if the issue in your 50k tasks test was due to docker engine misbehaving. Have you tried that test with count = 50000 and just running:

driver = "raw_exec"
config {
   command = "sleep"
   args = ["10"]
}

Also why aren't you using docker driver?

@OferE
Copy link
Author

OferE commented May 5, 2017

I dont believe its docker. Im almost sure that it's nomad, though it might be a bug in my wrapping script (very unlikely too - since in the first 4000- tasks it works perfectly all the time - and its small...).

The reason is that whenever i start a new big test on the same cluster it behaves correctly in the first 4000 tasks. It becomes poor a around 13000 tasks.

If it was docker - one run after the other would show poor results from the beginning of the second run.
My script is small and was tested heavily.

I think that the problem is that some of the tasks finishes and nomad is not getting the exit signal of the script that wraps the containers on time for some reason. So when u check for status nomad shows u a good running count, but there are less and less real processes in the cluster.

The check that u did to query running allocations is not good enough - u need to count how many r actually finishes - u will see it clear.

Docker driver is not good since it forces cgroups, in which i will be forced to limit memory and cpu for my batches. This is not good for us since our tasks have unpredictable footptint usage, i prefer that the os will handle resource utulization for them (they have peaks etc.).
It is also very not convenient to try and do the cgroups decisions... in my real use case i just need to launch 2 tasks on each node and i'm fine.
Cgroups prevents u from getting all the resources from the node u r paying for.

Cgroups usage is good for data centeters utilization and not for cloud usages with static allocation (where u allocate machines to a specific usage such as batch/service)

@dadgar
Copy link
Contributor

dadgar commented May 5, 2017

@OferE That isn't quite how the cgroups works when using cpu shares. Under cpu contention the limits are enforced but otherwise the processes will be allowed to use as much cpu as is available with the shares applying as a weight.

When we get the memory use sorted out, lets run the benchmark again and grab some profiling data early in the run and then later when you see the slowdown so we can see what is happening!

@dadgar dadgar changed the title nomad large batch jobs failures nomad large batch jobs slowdown on clients May 5, 2017
@OferE
Copy link
Author

OferE commented May 5, 2017

For CPU u r right - but for memory cgroups enforce hard limit which is not good enough.
Docker driver also disables swap usage. The combination of hard limits and no swap is very problematic in some peak cases (both for batch but also for regular services).

Like I said - on high load the problem is most probably signal handling management of nomad... (as the scheduler always show good number of running task, but in reality the numbers are poor).

I'll be happy to help with any profiling u ask, and again i am graceful for your engagement in this issue!

@dadgar
Copy link
Contributor

dadgar commented May 5, 2017

Hmm, yeah I don't think it is that as the code is quite simple there. Literally waiting on a Golang Process https://github.com/hashicorp/nomad/blob/master/client/driver/executor/executor.go#L402.

If it is easy enough do you want to grab the profiling data without the memory fixes since it looks like in the 50K test you weren't effected by memory too much?

@OferE
Copy link
Author

OferE commented May 5, 2017

i'll do it - though reaching 50K is very expensive for us because of the performance issue.
I'll run 50K and do the profiling around 13K jobs (in this point the performance is not good).

I'll also try to dig more - maybe i'm wrong ;-) (I really don't know nomad code)

@burdandrei
Copy link
Contributor

burdandrei commented Jun 29, 2017

interesting solution, to memory limiting problem @OferE, we 're running ~ 500 workers on shared machines, and were forced to add memory usage monitoring before starting to move them to nomad, and unfortunatelly, after running first 100, we received hi CPU and memory usage on host. BTW, our nomad file is ~ 700KB, it's one job with 100 groups and some of them got counts inside.
@dadgar maybe this is the issue, and splitting it to separate jobs will help the agent?
and is there any option to tell nomad not too kill container if it uses too much memory?

@dadgar
Copy link
Contributor

dadgar commented Jun 30, 2017

I would like to revisit this once 0.6.0 is out. A good amount of performance enhancements came out of debugging from what @OferE provided.

Splitting the job would help but 0.6.0 should mitigate the need to do so!

@dadgar
Copy link
Contributor

dadgar commented Aug 1, 2017

@OferE Bumping this since 0.6.0 is now out! Are you all seeing improved performance?

@OferE
Copy link
Author

OferE commented Aug 2, 2017

i haven't had a chance to upgrade yet - will keep u posted once i get back to this project. (might take a while). The add hoc version u gave back then really helped. thank you so much for this.

@dadgar
Copy link
Contributor

dadgar commented Aug 2, 2017

@OferE Awesome! Glad it helped you! I am going to close this because in the related high cpu bug, people are reporting 0.6.0 has fixed the issue. When you upgrade we can re-open if it is still an issue! Would appreciate if you could post your findings either way! Thanks 👍

@dadgar dadgar closed this as completed Aug 2, 2017
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants