Nomad Client generates lots of IOPS when idle, saturating HDD #9047

ashtuchkin · 2020-10-08T04:36:58Z

Nomad version

Nomad v0.12.1 (14a6893)

Operating system and Environment details

Ubuntu 16.04, Linux kernel 4.15.0-70
3 on-premise servers, each having 64Gb RAM, 32 cores CPU, 1Tb HDD.

Issue

I'm testing batch job scheduling on my small cluster and noticed that after running 1-2 thousand allocations, the Nomad Clients slow down to a crawl: 5-10 minutes to start small jobs, even if there's no other jobs running. Restarting the clients doesn't help. The problem disappears only after I issue nomad system gc command.

After a short investigation, I noticed that almost all the time is spent by the Nomad Client continuously writing to its state file (<data root>/client/state.db), fully saturating disk iops, which slows down everything else on the host. Note, this is with an idle cluster - no jobs are running at that point.

A deeper investigation (golang profiling) took me to Client.saveState() function. It's called every 60 seconds by Client.periodicSnapshot() and calls PersistState() for all allocations tracked by the client. Each PersistState() internally issues a BoltDB transaction to unconditionally write current state of the allocation to the state file, ending with a mandatory fsync(). This results is a very high number of IOPS, especially for an HDD (which unfortunately I have to use), saturating disk access for the whole host.

As an example, an average HDD tops out at maybe ~120 IOPS. If we assume 2 IO operations per PersistState() transaction, then the disk i/o would be at 100% at ~3600 allocations. In my tests it's even less than that - around 2000. Again, this is in idle state - nothing is actually running at that time on the cluster.

In my use case each batch job can have 1000+ allocations, easily surpassing the numbers above. After this happens, the cluster becomes unresponsive and it's pretty hard to fix it. The only way I found is issuing a nomad system gc command.

So I have a couple of questions:

Is it really necessary to persist the state of every allocation every minute in idle state? Can you add a condition there to persist only if something has actually changed?
If the above is hard to do, can you batch the writes so that e.g. 100 allocations be persisted at a time? That would decrease the IOPS requirements dramatically.
Is there any workaround you could suggest?

Note, I think garbage collection of allocations would help the situation somewhat, but if jobs are executed in an unpredictable manner, it would be very hard to tune the gc threshold period while still keeping the logs for a bit to help debug things.

Separately, minor thing, I noticed that there's an attempt at parallelizing the persistence in Client.saveState() function by running a goroutine per allocation. AFAIK it doesn't help at all, because write transactions in BoltDB are serialized using locks and each goroutine have to wait for its order anyway, making the whole thing sequential. Not a big issue, just muddies the code a little.

Reproduction steps

Set up a cluster with HDD drives. I used 3 nodes, but it should work with one as well.
Run an echo "hello world" batch job with job.group.count=3000.
Observe saturated HDD long after the job is finished and the cluster is idle. Observe it takes 5-10 minutes to run a trivial task.

Job file

Any simple job should do. I'm using docker with something like this:

job "test" {
  datacenters = ["dc1"]
  type = "batch"
  group "tasks" {
    count = 3000
    task "main" {
      driver = "docker"
      config {
        image = "alpine"
        args = ["echo", "Hello world"]
      }
    }
  }
}

Nomad Client logs (if appropriate)

I have some logs and pprof traces, but not sure how helpful they would be. Let me know if you need them.

The text was updated successfully, but these errors were encountered:

ashtuchkin · 2020-10-08T21:05:53Z

Update: just stumbled on the logic in helper/boltdd package that uses hashing to avoid writing the same data to the database (https://github.com/hashicorp/nomad/blob/master/helper/boltdd/boltdd.go#L316), so it seems that my first question is moot. On the other hand, the IOPS is still there, so my current hypothesis is that the actual writes happen in transaction begin/commit stages.

tgross · 2020-10-12T15:48:52Z

Hi @ashtuchkin just wanted to make sure you knew we saw this one and someone will look into it.

ashtuchkin · 2020-10-12T17:14:00Z

I appreciate it, thank you!

…

On Mon, Oct 12, 2020, 11:49 Tim Gross ***@***.***> wrote: Hi @ashtuchkin <https://github.com/ashtuchkin> just wanted to make sure you knew we saw this one and someone will look into it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9047 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEZKHK635QE3ZPFKXGOPGTSKMQPJANCNFSM4SIHUSDA> .

Fixes hashicorp#9047, see problem details there. As a solution, we use BoltDB's 'Batch' mode that combines multiple parallel writes into small number of transactions. See https://github.com/boltdb/bolt#batch-read-write-transactions for more information.

ashtuchkin · 2020-10-14T20:23:33Z

I've added a PR that fixes this issue in our environment. We'll use my fork with this change for now to unblock the tests, but I'd be happy to go back to using upstream when this issue is fixed (whether by merging the PR or not).

Fixes hashicorp#9047, see problem details there. As a solution, we use BoltDB's 'Batch' mode that combines multiple parallel writes into small number of transactions. See https://github.com/boltdb/bolt#batch-read-write-transactions for more information.

Fixes #9047, see problem details there. As a solution, we use BoltDB's 'Batch' mode that combines multiple parallel writes into small number of transactions. See https://github.com/boltdb/bolt#batch-read-write-transactions for more information.

…hicorp#9093) Fixes hashicorp#9047, see problem details there. As a solution, we use BoltDB's 'Batch' mode that combines multiple parallel writes into small number of transactions. See https://github.com/boltdb/bolt#batch-read-write-transactions for more information.

github-actions · 2022-10-29T02:37:16Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tgross added theme/client type/bug stage/needs-investigation labels Oct 12, 2020

ashtuchkin mentioned this issue Oct 14, 2020

Implement 'batch mode' for persisting allocations on the client. #9093

Merged

tgross closed this as completed in #9093 Oct 20, 2020

github-actions bot locked as resolved and limited conversation to collaborators Oct 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad Client generates lots of IOPS when idle, saturating HDD #9047

Nomad Client generates lots of IOPS when idle, saturating HDD #9047

ashtuchkin commented Oct 8, 2020

ashtuchkin commented Oct 8, 2020

tgross commented Oct 12, 2020

ashtuchkin commented Oct 12, 2020 via email

ashtuchkin commented Oct 14, 2020

github-actions bot commented Oct 29, 2022

Nomad Client generates lots of IOPS when idle, saturating HDD #9047

Nomad Client generates lots of IOPS when idle, saturating HDD #9047

Comments

ashtuchkin commented Oct 8, 2020

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Job file

Nomad Client logs (if appropriate)

ashtuchkin commented Oct 8, 2020

tgross commented Oct 12, 2020

ashtuchkin commented Oct 12, 2020 via email

ashtuchkin commented Oct 14, 2020

github-actions bot commented Oct 29, 2022