Tens of thousands of open file descriptors to a single nomad alloc logs directory #2024

sheldonkwok · 2016-11-23T02:31:45Z

Nomad version

Nomad v0.4.1

Operating system and Environment details

Linux ip-10-201-5-129 4.4.0-47-generic #68-Ubuntu SMP Wed Oct 26 19:39:52 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Issue

Nomad has tens of thousands of open file descriptors to an alloc log directory.

nomad 2143 root *530r DIR 202,80 4096 8454154 /var/lib/ssi/nomad/alloc/14e62a40-8598-2fed-405e-ca237bc940c6/alloc/logs

Something similar to that repeated ~60000 times.
lsof -p 2143 | wc -l returns ~60000

I stopped the alloc but the descriptors are still there.

In addition, the nomad process is approaching 55 GB of memory used.

The text was updated successfully, but these errors were encountered:

dadgar · 2016-11-23T18:08:07Z

Hey could you post the open file descriptors in a gist?

Couple questions:

Are you running raw_exec jobs?
Can you grab the memory usage of all nomad processes (tree view would be great)
Do you have enable_debug set? If so can you grab a memory profile of the agent by hitting:

go tool pprof -output=mem.profile http://127.0.0.1:4646/debug/pprof/heap

sheldonkwok · 2016-11-23T18:45:12Z

Not running raw_exec jobs.
2/3) Will do next time we run into the issue.

sheldonkwok · 2016-11-23T21:50:43Z

Here's a glimpse of htop. The process just ran out of memory and rebooted.
The logs are filled with "dropping update to alloc"

Unfortunately, this agent did not have enable_debug set.

sheldonkwok · 2016-11-23T21:54:48Z

Here's a graph of memory usage from our alerting system.

dadgar · 2016-11-26T01:14:48Z

@sheldonkwok can you attach the logs and the open file descriptors.

Does this happen regularly? The memory profile will help a lot!

sheldonkwok · 2016-11-29T18:52:27Z

Using the docker driver. Debug is on now so I'll have to catch it when it's in the bad state before it reboots.

sheldonkwok · 2016-11-30T08:14:28Z

Bad news. Getting net/http: timeout awaiting response headers when trying to run go tool pprof. Notably we were running the logging command before this and the 60000 open file descriptors are

nomad   2113 root *012r      DIR  202,80     4096 8456000 /var/lib/ssi/nomad/alloc/ddf27f57-9179-2fbf-9c44-69f54a28f17b/alloc/logs
nomad   2113 root *013r      DIR  202,80     4096 8456000 /var/lib/ssi/nomad/alloc/ddf27f57-9179-2fbf-9c44-69f54a28f17b/alloc/logs
nomad   2113 root *014r      DIR  202,80     4096 8456000 /var/lib/ssi/nomad/alloc/ddf27f57-9179-2fbf-9c44-69f54a28f17b/alloc/logs

repeated many many times.

jippi · 2016-11-30T08:19:28Z

got anything in the client or server logs relevant to the issue? e.g. repeated actions ?

sheldonkwok · 2016-11-30T08:25:11Z

A lot of
client: alloc "00e0bb67-624c-5221-087b-ba31f5740b43" in terminal status, waiting for destroy
in the client but that seems more of a product of the client running out of memory

sheldonkwok · 2016-12-01T22:19:12Z

We encountered this issue again after hitting the logging api for one of the allocs. It's crashing multiple production nomad clients.

dadgar · 2016-12-01T22:20:16Z

@sheldonkwok Can you show nomad fs of that folder? What is the logging like of the task? What is the log config

sheldonkwok · 2016-12-01T22:22:34Z

Mode        Size    Modified Time          Name
-rw-r--r--  31 KiB  12/01/16 21:25:59 UTC  service.stderr.0
-rw-r--r--  0 B     12/01/16 06:31:26 UTC  service.stdout.0

      LogConfig: {
        MaxFileSizeMB: 10,
        MaxFiles: 10,
      },

Nothing too crazy it seems.

dadgar · 2016-12-01T22:35:11Z

Does this reproduce every time you log or just occasionally. What is the log command you are running? Are the file descriptors open on the nomad client or the executor?

sheldonkwok · 2016-12-01T22:39:00Z

/v1/client/fs/logs/<Allocation-ID>
follow=true,origin=start,offset=0

We're not sure if it's every time or just occasionally. The descriptors are open on the client.

dadgar · 2016-12-01T22:50:41Z

@sheldonkwok I can not reproduce this on either 0.4.1/0.5.0/master. It would be great if you could get a vagrant box or some reproductions steps so I can debug this. As of now it is a bit hard since you can't even attach a profiler

sheldonkwok · 2016-12-01T22:54:44Z

We're still working on some reproduction steps because nothing we do immediately crashes the system and we're alerted of issues later.

sheldonkwok · 2016-12-01T23:16:32Z

Managed to get the client into a state where the memory low but the number of file descriptors open is still high. Here's the mem profile.

http://pastebin.com/XbPH9qhw

dadgar · 2016-12-02T00:39:42Z

Could you get the profile too and send both the heap and profile output Thanks, Alex

…

On Dec 1, 2016, 3:16 PM -0800, Sheldon Kwok ***@***.***>, wrote: Managed to get the client into a state where the memory low but the number of file descriptors open is still high. Here's the mem profile. http://pastebin.com/XbPH9qhw — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

justinwalz · 2016-12-02T01:56:57Z

Tree: http://pastebin.com/Up3kLmvi
Top: http://pastebin.com/GxfDm6Hg
Dot: http://pastebin.com/iVLYVUY8

Does this help?

sheldonkwok · 2016-12-06T01:52:08Z

We upgraded to Nomad 0.5 and we're still experiencing this issue when we make multiple requests to an allocation's log route simultaneously. Our use case is that many of our devs want to see logs for their service at the same time.

dadgar · 2016-12-06T02:03:42Z

@sheldonkwok Can you get me this: curl http://127.0.0.1:4646/debug/pprof/goroutine?debug=2

How are retrieving the logs? Programatically?

sheldonkwok · 2016-12-06T02:12:35Z

Will do next time it happens.
We are using the route /v1/client/fs/logs/<Allocation-ID> when one of the devs asks for logs.

dadgar · 2016-12-06T02:13:25Z

My current thought is you may never be closing the connection which is causing this. The goroutines should help correlate number of descriptors to times its been called (goroutines handling the request)

sheldonkwok · 2016-12-06T02:29:25Z

I was considering that but the number of file descriptors jumped from dozens to the tens of thousands pretty quickly when we only made a handful of simultaneous requests.

leonoff · 2016-12-07T07:50:09Z

I reproduced this behavior with nomad logs command. When job contains more that one task and you trying to see logs from them e.g.:
nomad logs -tail -f -stderr -job job_name

sheldonkwok · 2016-12-07T17:49:21Z

We unfortunately had to stop using the nomad logs/logs route because of this issue. We are instead ingesting the logs through Filebeat/Logstash/Elasticsearch. It doesn't handle streaming as well though. Will continue to try and reproduce though.

leonoff · 2016-12-12T14:34:57Z

@diptanu @dadgar Tested new binaries, issue fixed for me.

dadgar · 2016-12-12T17:36:40Z

@leonoff Awesome! Thank you!

justinwalz · 2016-12-13T03:43:28Z

@dadgar @diptanu Please see #2089.

Had no issues today with logging, but did have many clients panic.

sheldonkwok · 2017-01-11T03:54:40Z

The file descriptor issue just creeped back on to our system today running Nomad 0.5.2... Will try to reproduce. Just letting everyone know.

justinwalz · 2017-01-23T17:55:22Z

This has still been happening a lot for us lately. Today there was an actual panic - I opened a separate ticket here, but there was heavy utilization of logs during the time of the crash. I still have a feeling there's an edge case in here somewhere...

dadgar · 2017-01-23T18:01:02Z

@justinwalz Do you have a list of the file descriptors that were open?

justinwalz · 2017-01-23T18:53:53Z

I got to the machine after the nomad process had already died, but I will get a dump next time it occurs live.

justinwalz · 2017-01-26T21:29:03Z

Hi, it occurred again

Nomad is using up a ton of RAM, and a high count of file descriptors

ps aux | grep nomad

root       2449  649 39.0 59051896 49124000 ?   Ssl  19:17 868:09 /usr/local/bin/nomad agent -config /etc/nomad

truncated output from lsof -p 2449

nomad   2449 root *810r      DIR             202,81     4096 2621596 /var/lib/ssi/nomad/alloc/32b3a6a3-7fde-418b-2006-7c65b8ec1660/alloc/logs
nomad   2449 root *811r      DIR             202,81     4096 2621588 /var/lib/ssi/nomad/alloc/bd1fa749-7fa1-da62-faf1-8a3187ecc508/alloc/logs
nomad   2449 root *812r      DIR             202,81     4096 2621588 /var/lib/ssi/nomad/alloc/bd1fa749-7fa1-da62-faf1-8a3187ecc508/alloc/logs
nomad   2449 root *813r      DIR             202,81     4096 2621596 /var/lib/ssi/nomad/alloc/32b3a6a3-7fde-418b-2006-7c65b8ec1660/alloc/logs
nomad   2449 root *814r      DIR             202,81     4096 2621588 /var/lib/ssi/nomad/alloc/bd1fa749-7fa1-da62-faf1-8a3187ecc508/alloc/logs
nomad   2449 root *815r      DIR             202,81     4096 2621588 /var/lib/ssi/nomad/alloc/bd1fa749-7fa1-da62-faf1-8a3187ecc508/alloc/logs
nomad   2449 root *816r      DIR             202,81     4096 2621596 /var/lib/ssi/nomad/alloc/32b3a6a3-7fde-418b-2006-7c65b8ec1660/alloc/logs
nomad   2449 root *817r      DIR             202,81     4096 2621588 /var/lib/ssi/nomad/alloc/bd1fa749-7fa1-da62-faf1-8a3187ecc508/alloc/logs
nomad   2449 root *818r      DIR             202,81     4096 2621588 /var/lib/ssi/nomad/alloc/bd1fa749-7fa1-da62-faf1-8a3187ecc508/alloc/logs
nomad   2449 root *819r      DIR             202,81     4096 2621596 /var/lib/ssi/nomad/alloc/32b3a6a3-7fde-418b-2006-7c65b8ec1660/alloc/logs
nomad   2449 root *820r      DIR             202,81     4096 2621588 /var/lib/ssi/nomad/alloc/bd1fa749-7fa1-da62-faf1-8a3187ecc508/alloc/logs
nomad   2449 root *821r      DIR             202,81     4096 2621596 /var/lib/ssi/nomad/alloc/32b3a6a3-7fde-418b-2006-7c65b8ec1660/alloc/logs
nomad   2449 root *822r      DIR             202,81     4096 2621596 /var/lib/ssi/nomad/alloc/32b3a6a3-7fde-418b-2006-7c65b8ec1660/alloc/logs
nomad   2449 root *823r      DIR             202,81     4096 2621596 /var/lib/ssi/nomad/alloc/32b3a6a3-7fde-418b-2006-7c65b8ec1660/alloc/logs
nomad   2449 root *824r      DIR             202,81     4096 2621588 /var/lib/ssi/nomad/alloc/bd1fa749-7fa1-da62-faf1-8a3187ecc508/alloc/logs
nomad   2449 root *825r      DIR             202,81     4096 2621588 /var/lib/ssi/nomad/alloc/bd1fa749-7fa1-da62-faf1-8a3187ecc508/alloc/logs
nomad   2449 root *826r      DIR             202,81     4096 2621596 /var/lib/ssi/nomad/alloc/32b3a6a3-7fde-418b-2006-7c65b8ec1660/alloc/logs
nomad   2449 root *827r      DIR             202,81     4096 2621588 /var/lib/ssi/nomad/alloc/bd1fa749-7fa1-da62-faf1-8a3187ecc508/alloc/logs
nomad   2449 root *828r      DIR             202,81     4096 2621596 /var/lib/ssi/nomad/alloc/32b3a6a3-7fde-418b-2006-7c65b8ec1660/alloc/logs
nomad   2449 root *829r      DIR             202,81     4096 2621588 /var/lib/ssi/nomad/alloc/bd1fa749-7fa1-da62-faf1-8a3187ecc508/alloc/logs
nomad   2449 root *830r      DIR             202,81     4096 2621596 /var/lib/ssi/nomad/alloc/32b3a6a3-7fde-418b-2006-7c65b8ec1660/alloc/logs
nomad   2449 root *831r      DIR             202,81     4096 2621596 /var/lib/ssi/nomad/alloc/32b3a6a3-7fde-418b-2006-7c65b8ec1660/alloc/logs
nomad   2449 root *832r      DIR             202,81     4096 2621588 /var/lib/ssi/nomad/alloc/bd1fa749-7fa1-da62-faf1-8a3187ecc508/alloc/logs
nomad   2449 root *833r      DIR             202,81     4096 2621588 /var/lib/ssi/nomad/alloc/bd1fa749-7fa1-da62-faf1-8a3187ecc508/alloc/logs
nomad   2449 root *834r      DIR             202,81     4096 2621588 /var/lib/ssi/nomad/alloc/bd1fa749-7fa1-da62-faf1-8a3187ecc508/alloc/logs
nomad   2449 root *835r      DIR             202,81     4096 2621596 /var/lib/ssi/nomad/alloc/32b3a6a3-7fde-418b-2006-7c65b8ec1660/alloc/logs
nomad   2449 root *836r      DIR             202,81     4096 2621588 /var/lib/ssi/nomad/alloc/bd1fa749-7fa1-da62-faf1-8a3187ecc508/alloc/logs
nomad   2449 root *837r      DIR             202,81     4096 2621596 /var/lib/ssi/nomad/alloc/32b3a6a3-7fde-418b-2006-7c65b8ec1660/alloc/logs
nomad   2449 root *838r      DIR             202,81     4096 2621596 /var/lib/ssi/nomad/alloc/32b3a6a3-7fde-418b-2006-7c65b8ec1660/alloc/logs
nomad   2449 root *839r      DIR             202,81     4096 2621596 /var/lib/ssi/nomad/alloc/32b3a6a3-7fde-418b-2006-7c65b8ec1660/alloc/logs
nomad   2449 root *840r      DIR             202,81     4096 2621596 /var/lib/ssi/nomad/alloc/32b3a6a3-7fde-418b-2006-7c65b8ec1660/alloc/logs
nomad   2449 root *841r      DIR             202,81     4096 2621588 /var/lib/ssi/nomad/alloc/bd1fa749-7fa1-da62-faf1-8a3187ecc508/alloc/logs
nomad   2449 root *842r      DIR             202,81     4096 2621588 /var/lib/ssi/nomad/alloc/bd1fa749-7fa1-da62-faf1-8a3187ecc508/alloc/logs
nomad   2449 root *843r      DIR             202,81     4096 2621588 /var/lib/ssi/nomad/alloc/bd1fa749-7fa1-da62-faf1-8a3187ecc508/alloc/logs

nomad -v
Nomad v0.5.2

justinwalz · 2017-01-26T21:44:31Z

To add more context, I've attached the client logs. There are some errors about the logs API and too many open files included.

For what it's worth, nomad healed itself - both memory usage back under control and file descriptors back to normal without needing to restart.

client-logs.tar.gz

dadgar · 2017-01-28T00:08:48Z

@justinwalz Okay re-opening this. Do you have any reproduction steps?

sheldonkwok · 2017-03-14T18:03:51Z

We're still running into this issue every week. It makes the server unresponsive and forces reallocations for all of the jobs on it. This issue happens across all of our jobs intermittently when we attempt to fetch logs. It's difficult to reproduce but seems guaranteed to popup with enough logging requests over time on a job. The shared logging config is

      LogConfig: {
        MaxFileSizeMB: 10, 
        MaxFiles: 10, 
      },

dadgar · 2017-03-14T20:51:24Z

@sheldonkwok I will make time to play with this soon. I was never able to reproduce however. Are you running both server/client on one machine?

sheldonkwok · 2017-03-14T21:50:14Z

Thanks Alex. We are running 5 servers on their own machines and the clients are run separately as well.

dadgar · 2017-04-20T20:57:54Z

Hey just tried the following and couldn't reproduce:

Ran this job:

job "example" {
  datacenters = ["us-west-2"]
  type = "service"

  group "cache" {
    count = 20

    restart {
      attempts = 60
      interval = "1m"
      delay = "1s"
      mode = "delay"
    }

    ephemeral_disk {
      size = 300
    }

    task "redis" {
      driver = "docker"
      config {
        image = "redis:3.2"
        command = "/bin/bash"
        args = ["-c", "base64 /dev/urandom | head -c 10000000"]
        port_map {
          db = 6379
        }
      }

      resources {
        cpu    = 500 # 500 MHz
        memory = 256 # 256MB
        network {
          mbits = 10
          port "db" {}
        }
      }
    }
  }
}

The job produces 10 MB of logs every few seconds and I streamed the logs of all 20 allocations to 3 different machines. I did this for 30+ minutes and transferred many GBs of logs from the client.

@sheldonkwok I did look at your logs and it looks like the streaming connection is held for 50+ minutes but it is still streaming log stdout.0/stderr.0? Are the apps just not logging much?

I am going to try that as well and will report back.

dadgar · 2017-04-20T21:49:18Z

Just tailed logs for over an hour and nothing as well.

@sheldonkwok any more detail on the logging pattern would be great.

sheldonkwok · 2017-04-21T18:15:25Z

Hey @dadgar, I really appreciate the efforts you've spent trying to reproduce it! A normal logging pattern that we see is multiple people tailing the logs of multiple services at the same time for many hours. Some of these services may be the same service as well.

dadgar · 2017-04-21T22:36:58Z

@sheldonkwok Hm unfortunately I did try that.

So I had 3 nodes each running ~7 allocations. I then was following the logs of all 20 allocations on 3 different machines for over an hour and a half.

I repeated this for two different workloads:

Small trickle of logs outputted by each allocation
10 MBs of logs every few seconds.

Neither reproduced. Further when I stopped the logging, not a single extra FD was used (as in there was no leaking that would eventually cause a problem).

sheldonkwok · 2017-04-24T17:06:17Z

Unfortunately, we cannot reproduce it effectively either. It generally happens once a week. We have thousands of allocations being created and running daily over hundreds of machines. In addition, users can be tailing thousands of them at once. Thanks again @dadgar

dadgar · 2017-04-24T17:24:12Z

@sheldonkwok Do you retain Nomad log files? Would you be willing to run a debug adds some logging that may help get to the bottom of this?

sheldonkwok · 2017-04-24T17:56:35Z

Sure, what do we set and how much extra logging does it generate?

dadgar · 2017-04-24T18:38:53Z

@sheldonkwok You will set log_level = "DEBUG". I don't know how much extra logging as I need to spend some time building the debug build. I may get to it this week. Will let you know.

nickethier · 2020-04-18T02:37:16Z

@sheldonkwok I'm doing some cleanup of old issues and came across this one. The Nomad logging infrastructure has undergone many changes since this was originally opened. If you're still experiencing any problems would you please open a new issue?

Thanks!

github-actions · 2022-11-08T02:32:34Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar added stage/needs-investigation type/question labels Nov 23, 2016

dadgar added theme/api HTTP API and SDK issues type/bug labels Dec 7, 2016

justinwalz mentioned this issue Dec 13, 2016

Nomad 0.5.1 rc2 panic - alloc runner #2089

Closed

dadgar reopened this Jan 28, 2017

dadgar added this to the v0.5.4 milestone Jan 28, 2017

schmichael removed this from the v0.5.4 milestone Jul 31, 2017

dinapappor mentioned this issue Sep 11, 2019

Alot of filedescriptors opened by nomad logmon #6309

Closed

nickethier closed this as completed Apr 18, 2020

github-actions bot locked as resolved and limited conversation to collaborators Nov 8, 2022

Tens of thousands of open file descriptors to a single nomad alloc logs directory #2024

Tens of thousands of open file descriptors to a single nomad alloc logs directory #2024

Comments

sheldonkwok commented Nov 23, 2016 • edited Loading

Nomad version

Operating system and Environment details

Issue

dadgar commented Nov 23, 2016 • edited Loading

sheldonkwok commented Nov 23, 2016

sheldonkwok commented Nov 23, 2016

sheldonkwok commented Nov 23, 2016

dadgar commented Nov 26, 2016 • edited Loading

sheldonkwok commented Nov 29, 2016

sheldonkwok commented Nov 30, 2016

jippi commented Nov 30, 2016

sheldonkwok commented Nov 30, 2016 • edited Loading

sheldonkwok commented Dec 1, 2016 • edited Loading

dadgar commented Dec 1, 2016

sheldonkwok commented Dec 1, 2016

dadgar commented Dec 1, 2016 • edited Loading

sheldonkwok commented Dec 1, 2016 • edited Loading

dadgar commented Dec 1, 2016

sheldonkwok commented Dec 1, 2016

sheldonkwok commented Dec 1, 2016

dadgar commented Dec 2, 2016 via email

justinwalz commented Dec 2, 2016 • edited Loading

sheldonkwok commented Dec 6, 2016

dadgar commented Dec 6, 2016

sheldonkwok commented Dec 6, 2016

dadgar commented Dec 6, 2016 • edited Loading

sheldonkwok commented Dec 6, 2016

leonoff commented Dec 7, 2016

sheldonkwok commented Dec 7, 2016 • edited Loading

leonoff commented Dec 12, 2016

dadgar commented Dec 12, 2016

justinwalz commented Dec 13, 2016

sheldonkwok commented Jan 11, 2017

justinwalz commented Jan 23, 2017

dadgar commented Jan 23, 2017

justinwalz commented Jan 23, 2017

justinwalz commented Jan 26, 2017 • edited Loading

justinwalz commented Jan 26, 2017

dadgar commented Jan 28, 2017

sheldonkwok commented Mar 14, 2017

dadgar commented Mar 14, 2017

sheldonkwok commented Mar 14, 2017

dadgar commented Apr 20, 2017

dadgar commented Apr 20, 2017

sheldonkwok commented Apr 21, 2017

dadgar commented Apr 21, 2017

sheldonkwok commented Apr 24, 2017

dadgar commented Apr 24, 2017

sheldonkwok commented Apr 24, 2017

dadgar commented Apr 24, 2017

nickethier commented Apr 18, 2020

github-actions bot commented Nov 8, 2022

sheldonkwok commented Nov 23, 2016 •

edited

Loading

dadgar commented Nov 23, 2016 •

edited

Loading

dadgar commented Nov 26, 2016 •

edited

Loading

sheldonkwok commented Nov 30, 2016 •

edited

Loading

sheldonkwok commented Dec 1, 2016 •

edited

Loading

dadgar commented Dec 1, 2016 •

edited

Loading

sheldonkwok commented Dec 1, 2016 •

edited

Loading

justinwalz commented Dec 2, 2016 •

edited

Loading

dadgar commented Dec 6, 2016 •

edited

Loading

sheldonkwok commented Dec 7, 2016 •

edited

Loading

justinwalz commented Jan 26, 2017 •

edited

Loading