High number of log streaming go-routines: Panic: out of memory #2840

mlehner616 · 2017-07-16T02:33:58Z

A nomad client node crashed with an out of memory error despite having reserved resources configured. The instances have just shy of 4GB of memory available and the memory utilization at the time of the crash (based on relocated allocations) is estimated at about 650MB. I can say, based on how few allocations there were at the time, the client nodes were barely utilized which leads me to think the nomad client on this particular node hit some kind of memory leak until it crashed. I suppose it's also possible my ulimit settings are misconfigured, I will look into this.

{
    "log_level": "INFO",
    "data_dir": "/var/lib/nomad",
    "datacenter": "qa",
    "client": {
        "enabled": true,
        "options": {
            "driver.raw_exec.enable": "1",
            "docker.auth.config": "/root/.docker/config.json"
        },
        "reserved": {
            "cpu": "200",
            "memory": "128",
            "disk": "1024",
            "reserved_ports": "22,80,443,8500-8600"
        }
    }
}

Nomad version

Nomad v0.6.0-rc1 (5187fcf+CHANGES)

Operating system and Environment details

Amazon Linux kernel: 4.9.32-15.41.amzn1.x86_64

Issue

nomad client crashed with the following error:
fatal error: runtime: out of memory

Reproduction steps

I was deploying updates to a job that included additions of invalid template stanzas.

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

https://gist.github.com/mlehner616/20d82b2df37054e62fc9041a96498ce7

First example allocation attempt (Invalid Template stanza)

failed to build task's template manager: failed to read template: open /var/lib/nomad/alloc/25b5c177-cfe9-1e90-8458-85fb24fb0fd4/example1/config/default.json.tpl: no such file or directory

Second example allocation attempt (Invalid App config)

Docker container exited with non-zero exit code: 1

Job file (the consul keys were not escaped correctly)

job "example" {
  datacenters = ["qa"]
  type = "service"
  constraint {
    attribute = "${node.class}"
    value     = "qa-infrastructure-cluster"
  }
  update {
    stagger = "5s"
    max_parallel = 1
  }
  meta {
    CI_COMMIT_REF_SLUG = "branch-name"
  }
  group "example-group" {
    count = 6
    restart {
      attempts = 10
      interval = "5m"
      delay = "25s"
      mode = "delay"
    }
    task "example" {
      driver = "docker"
      env {
        NODE_ENV = "qa"
      }
      template {
        data = <<EOH
        {
          "server": {
            "port": 3000
          },
          "api": {
            "signing": {
              "enabled": true,
              "secret": "\{\{ key "some/otherkey" \}\}",
              "timeout": 300000,
            },
            "endpoints": {
              "external": "http://external-service.com/"
            },
            "postBodySizeLimit": "20mb"
          },
          "provider": {
            "apiKey": "\{\{ key "some/key" \}\}"
          },
          "defaults": {
            "from": "example@example.com",
            "fromname": "Not Real"
          }
        }
        EOH

        destination = "config/default.json"
      }
      config {
        image = "123456789012.dkr.ecr.us-west-2.amazonaws.com/example:latest"
        # command = ""
        port_map {
          web = 3000
        }
      }
      resources {
        cpu    = 100 # 500 MHz
        memory = 256 # 256MB
        network {
          mbits = 1
          port "web" {}
        }
      }
      service {
        name = "example"
        tags = ["qa", "example-group"]
        port = "web"
        check {
          name     = "alive"
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
        check {
            type     = "http"
            port     = "web"
            path     = "/health"
            interval = "5s"
            timeout  = "3s"
        }
      }
    }
  }
}

The text was updated successfully, but these errors were encountered:

dadgar · 2017-07-17T18:37:19Z

@mlehner616 Thanks for the great bug reports!

dadgar · 2017-07-17T20:05:39Z

@mlehner616 Do you by chance have the full panic?

mlehner616 · 2017-07-17T21:17:07Z

@dadgar The entire log for this crash is ~327MB, do you have a preferred method for me to get this to you?

mlehner616 · 2017-07-17T21:32:58Z

@dadgar This is the only panic block I see (same as the gist). The rest of the log (>4 million lines) contains repeats of goroutine stack traces. I saved these logs off the server and zipped them to an impressive 10MB so if you want the whole thing I'd be happy to get it over to you somehow.

fatal error: runtime: out of memory

runtime stack:
runtime.throw(0x1317a26, 0x16)
	/opt/go/src/runtime/panic.go:596 +0x95
runtime.sysMap(0xc4ffed0000, 0x100000, 0xc4f7e8a800, 0x1b87bd8)
	/opt/go/src/runtime/mem_linux.go:216 +0x1d0
runtime.(*mheap).sysAlloc(0x1b6e540, 0x100000, 0x2)
	/opt/go/src/runtime/malloc.go:428 +0x374
runtime.(*mheap).grow(0x1b6e540, 0x8, 0x0)
	/opt/go/src/runtime/mheap.go:774 +0x62
runtime.(*mheap).allocSpanLocked(0x1b6e540, 0x8, 0x42f5e4)
	/opt/go/src/runtime/mheap.go:678 +0x44f
runtime.(*mheap).alloc_m(0x1b6e540, 0x8, 0x7f0100000000, 0x424958)
	/opt/go/src/runtime/mheap.go:562 +0xe2
runtime.(*mheap).alloc.func1()
	/opt/go/src/runtime/mheap.go:627 +0x4b
runtime.systemstack(0x7fce61ffadd8)
	/opt/go/src/runtime/asm_amd64.s:343 +0xab
runtime.(*mheap).alloc(0x1b6e540, 0x8, 0x10100000000, 0x7fce70066870)
	/opt/go/src/runtime/mheap.go:628 +0xa0
runtime.largeAlloc(0x10000, 0x113a101, 0x113a1c0)
	/opt/go/src/runtime/malloc.go:795 +0x93
runtime.mallocgc.func1()
	/opt/go/src/runtime/malloc.go:690 +0x3e
runtime.systemstack(0xc42001a000)
	/opt/go/src/runtime/asm_amd64.s:327 +0x79
runtime.mstart()
	/opt/go/src/runtime/proc.go:1132

dadgar · 2017-07-18T01:41:59Z

@mlehner616 Yeah if you can upload it here that would be great, if not send it to alex@hashicorp.com

mlehner616 · 2017-07-18T16:41:23Z

@dadgar nomad.err.zip

schmichael · 2017-07-18T21:56:14Z

@mlehner616 The panic contains ~600k log streaming related goroutines, so I'm guessing we're leaking them somewhere. Are you using the log API (either via HTTP or the nomad logs command)? I doubt you're opening 600k connections at once, but more details may help debug where the leak is.

Thanks!

mlehner616 · 2017-07-19T04:25:25Z

@dadgar @schmichael Yes, I may have been experimenting with the nomad logs cli command at least a few minutes before it crashed.

I was definitely looking at logs through the hashi-ui and i imagine that uses the API.

I was deploying job updates with changes mostly to the template stanza trying to figure out how to get my particular desired consul integration right. While I was doing this, I was checking both logs and the /config directory to verify my template rendered correctly.

Is it odd though that only one node spun out of control here though? I suppose i could have gotten lucky and was checking logs for a number of allocs that were all on the same server, although I'm not familiar with how nomad bin packs.

dadgar · 2017-07-19T23:01:47Z

Going to change title as it is (very likely) unrelated to the release candidate changes.

schmichael · 2018-01-17T19:26:27Z

@mlehner616 Any chance you've tried to reproduce this on Nomad 0.7.1? It contained what seems like a relevant fix except that it really should have only affected Windows client nodes.

schmichael · 2018-04-24T21:32:10Z

I've been unable to reproduce this and log streaming has changed substantially since 0.6, so I'm going to close this. Sorry we couldn't find a direct resolution, but please don't hesitate to reopen if you run into this issue again!

github-actions · 2022-11-30T02:18:31Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar added theme/client stage/needs-investigation labels Jul 17, 2017

dadgar changed the title ~~0.6.0-rc1: Panic: out of memory~~ High number of log streaming go-routines: Panic: out of memory Jul 19, 2017

dadgar added the theme/logging label Jul 19, 2017

dadgar mentioned this issue Dec 5, 2017

Fix bug in log framer only affecting Windows #3608

Merged

schmichael closed this as completed Apr 24, 2018

github-actions bot locked as resolved and limited conversation to collaborators Nov 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High number of log streaming go-routines: Panic: out of memory #2840

High number of log streaming go-routines: Panic: out of memory #2840

mlehner616 commented Jul 16, 2017 •

edited

Loading

dadgar commented Jul 17, 2017

dadgar commented Jul 17, 2017

mlehner616 commented Jul 17, 2017 •

edited

Loading

mlehner616 commented Jul 17, 2017

dadgar commented Jul 18, 2017

mlehner616 commented Jul 18, 2017 •

edited

Loading

schmichael commented Jul 18, 2017

mlehner616 commented Jul 19, 2017

dadgar commented Jul 19, 2017

schmichael commented Jan 17, 2018

schmichael commented Apr 24, 2018

github-actions bot commented Nov 30, 2022

High number of log streaming go-routines: Panic: out of memory #2840

High number of log streaming go-routines: Panic: out of memory #2840

Comments

mlehner616 commented Jul 16, 2017 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Job file (the consul keys were not escaped correctly)

dadgar commented Jul 17, 2017

dadgar commented Jul 17, 2017

mlehner616 commented Jul 17, 2017 • edited Loading

mlehner616 commented Jul 17, 2017

dadgar commented Jul 18, 2017

mlehner616 commented Jul 18, 2017 • edited Loading

schmichael commented Jul 18, 2017

mlehner616 commented Jul 19, 2017

dadgar commented Jul 19, 2017

schmichael commented Jan 17, 2018

schmichael commented Apr 24, 2018

github-actions bot commented Nov 30, 2022

mlehner616 commented Jul 16, 2017 •

edited

Loading

mlehner616 commented Jul 17, 2017 •

edited

Loading

mlehner616 commented Jul 18, 2017 •

edited

Loading