Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nomad allocates all RAM on server and crashes the box #1817

Closed
jippi opened this issue Oct 15, 2016 · 6 comments
Closed

nomad allocates all RAM on server and crashes the box #1817

jippi opened this issue Oct 15, 2016 · 6 comments

Comments

@jippi
Copy link
Contributor

jippi commented Oct 15, 2016

Hi,

Running 0.4.1 i'm seeing some weird behavior from nomad - it happens ~daily on our web servers

The result of the behavior is the server becomes unresponsive for ~10min while things OOM and then slowly recover again. No allocations or changes are done during these outages, the one in the logs now happened on a saturday with no one working or being logged into the systems.

Also observing that the node is in ready mode, but system jobs will not actually restart on the node (Job + Allocation) - restaring nomad makes the allocation succeed again - https://gist.github.com/jippi/046840d5c6c65b4e0e1ea32ea2424242

Log (with debug on) https://gist.github.com/jippi/95b88ef66fd592206406ba9d312ca228

Interesting enough, the two clients that this behavior happen on is physical servers, where the $x other clients in the cluster, running inside kvm, don't act up like this.

They are provisioned identically with puppet, and their only major differences is physical vs virtual machine, and that the web boxes (which see this issue) also have active docker jobs running. Where the other servers got docker running, but nothing allocated on docker.

Allocation executor logs

https://gist.github.com/jippi/83a32fce9d409a32fa6175b5793d7c2c

config.hcl

bind_addr = "0.0.0.0"
datacenter = "production"
region = "global"
data_dir = "/opt/nomad/data"
log_level = "DEBUG"

advertise {
  http = "???.???.91.111:4646"
  rpc = "???.???.91.111:4647"
  serf = "???.???.91.111:4648"
}

addresses {
  http = "0.0.0.0"
  rpc = "0.0.0.0"
  serf = "0.0.0.0"
}

client {
  enabled = true
  servers = ["nomad.service.bownty:4647"]

  options = {
    "driver.raw_exec.enable" = "1"
  }

  node_class = "web"

  meta {
    "web" = "1"
  }
}

consul {
  address               = "127.0.0.1:8500"

  server_service_name   = "nomad"
  server_auto_join      = true

  client_service_name   = "nomad-client"
  client_auto_join      = true
}

http_api_response_headers {
  Access-Control-Allow-Origin   = "*"
  Access-Control-Expose-Headers = "x-nomad-index"
  Access-Control-Allow-Methods  = "GET, POST, OPTIONS"
}

nomad agent-info

-> nomad agent-info
client
  heartbeat_ttl = 12.593495744s
  known_servers = 3
  last_heartbeat = 10.843486073s
  node_id = 9ff0ea83-ede6-9143-adca-aaed5c3e6553
  num_allocations = 7
runtime
  arch = amd64
  cpu_count = 8
  goroutines = 85
  kernel.name = linux
  max_procs = 5
  version = go1.7

node as seen from /v1/node/:id

{
  "ID": "9ff0ea83-ede6-9143-adca-aaed5c3e6553",
  "Datacenter": "production",
  "Name": "web02",
  "HTTPAddr": "xxx.zzz.91.111:4646",
  "Attributes": {
    "unique.storage.volume": "/dev/disk/by-uuid/0cfc07c4-4b8f-4709-aaad-2ee1a1854762",
    "unique.network.ip-address": "xxx.yyy.91.111",
    "cpu.totalcompute": "27992",
    "driver.java.version": "1.8.0_72",
    "cpu.modelname": "Intel(R) Xeon(R) CPU E3-1230 v3 @ 3.30GHz",
    "driver.exec": "1",
    "os.version": "7.9",
    "unique.cgroup.mountpoint": "/sys/fs/cgroup",
    "driver.java.runtime": "Java(TM) SE Runtime Environment (build 1.8.0_72-b15)",
    "driver.java.vm": "Java HotSpot(TM) 64-Bit Server VM (build 25.72-b15, mixed mode)",
    "driver.docker": "1",
    "unique.storage.bytestotal": "1876063666176",
    "driver.raw_exec": "1",
    "driver.java": "1",
    "unique.hostname": "web02",
    "kernel.name": "linux",
    "arch": "amd64",
    "cpu.numcores": "8",
    "kernel.version": "4.7.6",
    "nomad.revision": "'8fdc55e16b54f176a711c115966ba234e8bb7879+CHANGES'",
    "cpu.frequency": "3499",
    "os.name": "debian",
    "unique.storage.bytesfree": "1763363778560",
    "nomad.version": "0.4.1",
    "driver.docker.version": "1.12.1",
    "memory.totalbytes": "33569710080"
  },
  "Resources": {
    "CPU": 27992,
    "MemoryMB": 32014,
    "DiskMB": 1681674,
    "IOPS": 0,
    "Networks": [
      {
        "Device": "eth0",
        "CIDR": "xxx.zzz.91.111/32",
        "IP": "xxx.zzz.91.111",
        "MBits": 1000,
        "ReservedPorts": null,
        "DynamicPorts": null
      }
    ]
  },
  "Reserved": {
    "CPU": 0,
    "MemoryMB": 0,
    "DiskMB": 0,
    "IOPS": 0,
    "Networks": null
  },
  "Links": {

  },
  "Meta": {
    "web": "1"
  },
  "NodeClass": "web",
  "ComputedClass": "v1:1445644767665653020",
  "Drain": false,
  "Status": "ready",
  "StatusDescription": "",
  "StatusUpdatedAt": 1476548513,
  "CreateIndex": 22,
  "ModifyIndex": 12451
}

Example allocation from the server

{
  "ID": "b38a1355-f949-fa7b-1271-06fff182e6c2",
  "EvalID": "e1b5e417-f82e-0347-1d8a-e5344eb5d80e",
  "Name": "insights-web.php-fpm[0]",
  "NodeID": "9ff0ea83-ede6-9143-adca-aaed5c3e6553",
  "JobID": "insights-web",
  "Job": {
    "Region": "global",
    "ID": "insights-web",
    "ParentID": "",
    "Name": "insights-web",
    "Type": "system",
    "Priority": 50,
    "AllAtOnce": false,
    "Datacenters": [
      "production"
    ],
    "Constraints": [
      {
        "LTarget": "${meta.web}",
        "RTarget": "1",
        "Operand": "="
      },
      {
        "LTarget": "",
        "RTarget": "",
        "Operand": "distinct_hosts"
      }
    ],
    "TaskGroups": [
      {
        "Name": "php-fpm",
        "Count": 1,
        "Constraints": null,
        "RestartPolicy": {
          "Attempts": 2,
          "Interval": 60000000000,
          "Delay": 15000000000,
          "Mode": "delay"
        },
        "Tasks": [
          {
            "Name": "server",
            "Driver": "raw_exec",
            "User": "www-data",
            "Config": {
              "args": [
                "--fpm-config=/etc/bownty/insights/php-fpm/manager.conf"
              ],
              "command": "/usr/sbin/php-fpm7.0"
            },
            "Env": null,
            "Services": [
              {
                "Name": "insights-web-php-fpm-server",
                "PortLabel": "",
                "Tags": null,
                "Checks": null
              }
            ],
            "Constraints": null,
            "Resources": {
              "CPU": 500,
              "MemoryMB": 128,
              "DiskMB": 300,
              "IOPS": 0,
              "Networks": null
            },
            "Meta": null,
            "KillTimeout": 5000000000,
            "LogConfig": {
              "MaxFiles": 10,
              "MaxFileSizeMB": 10
            },
            "Artifacts": null
          }
        ],
        "Meta": null
      }
    ],
    "Update": {
      "Stagger": 10000000000,
      "MaxParallel": 1
    },
    "Periodic": null,
    "Meta": null,
    "Status": "running",
    "StatusDescription": "",
    "CreateIndex": 91,
    "ModifyIndex": 99,
    "JobModifyIndex": 91
  },
  "TaskGroup": "php-fpm",
  "Resources": {
    "CPU": 500,
    "MemoryMB": 128,
    "DiskMB": 300,
    "IOPS": 0,
    "Networks": null
  },
  "TaskResources": {
    "server": {
      "CPU": 500,
      "MemoryMB": 128,
      "DiskMB": 300,
      "IOPS": 0,
      "Networks": null
    }
  },
  "Metrics": {
    "NodesEvaluated": 1,
    "NodesFiltered": 0,
    "NodesAvailable": {
      "production": 6
    },
    "ClassFiltered": null,
    "ConstraintFiltered": null,
    "NodesExhausted": 0,
    "ClassExhausted": null,
    "DimensionExhausted": null,
    "Scores": {
      "9ff0ea83-ede6-9143-adca-aaed5c3e6553.binpack": 0.479497471987969
    },
    "AllocationTime": 48246,
    "CoalescedFailures": 0
  },
  "DesiredStatus": "stop",
  "DesiredDescription": "alloc is lost since its node is down",
  "ClientStatus": "failed",
  "ClientDescription": "",
  "TaskStates": {
    "server": {
      "State": "dead",
      "Events": [
        {
          "Type": "Received",
          "Time": 1476434000544104192,
          "RestartReason": "",
          "DriverError": "",
          "ExitCode": 0,
          "Signal": 0,
          "Message": "",
          "KillTimeout": 0,
          "KillError": "",
          "StartDelay": 0,
          "DownloadError": "",
          "ValidationError": ""
        },
        {
          "Type": "Started",
          "Time": 1476434000557040481,
          "RestartReason": "",
          "DriverError": "",
          "ExitCode": 0,
          "Signal": 0,
          "Message": "",
          "KillTimeout": 0,
          "KillError": "",
          "StartDelay": 0,
          "DownloadError": "",
          "ValidationError": ""
        },
        {
          "Type": "Terminated",
          "Time": 1476547483098828050,
          "RestartReason": "",
          "DriverError": "",
          "ExitCode": 0,
          "Signal": 0,
          "Message": "unexpected EOF",
          "KillTimeout": 0,
          "KillError": "",
          "StartDelay": 0,
          "DownloadError": "",
          "ValidationError": ""
        },
        {
          "Type": "Restarting",
          "Time": 1476547541518148797,
          "RestartReason": "Restart within policy",
          "DriverError": "",
          "ExitCode": 0,
          "Signal": 0,
          "Message": "",
          "KillTimeout": 0,
          "KillError": "",
          "StartDelay": 17869710516,
          "DownloadError": "",
          "ValidationError": ""
        },
        {
          "Type": "Driver Failure",
          "Time": 1476547619247235399,
          "RestartReason": "",
          "DriverError": "failed to start task 'server' for alloc 'b38a1355-f949-fa7b-1271-06fff182e6c2': unable to dispense the executor plugin: EOF",
          "ExitCode": 0,
          "Signal": 0,
          "Message": "",
          "KillTimeout": 0,
          "KillError": "",
          "StartDelay": 0,
          "DownloadError": "",
          "ValidationError": ""
        },
        {
          "Type": "Not Restarting",
          "Time": 1476547619247389681,
          "RestartReason": "Error was unrecoverable",
          "DriverError": "",
          "ExitCode": 0,
          "Signal": 0,
          "Message": "",
          "KillTimeout": 0,
          "KillError": "",
          "StartDelay": 0,
          "DownloadError": "",
          "ValidationError": ""
        }
      ]
    }
  },
  "CreateIndex": 5301,
  "ModifyIndex": 12460,
  "AllocModifyIndex": 12418,
  "CreateTime": 1476434000485948951
}

Observed from datadog
image

Observed from newrelic (1)
image

Observed from newrelic (2)
image

From NewRelic, the data includes both nomad agent and the different nomad executor instances, I'm unable to split them apart.

@jippi jippi changed the title nomad allocate all RAM on server and crashes nomad allocates all RAM on server and crashes the box Oct 15, 2016
@dadgar
Copy link
Contributor

dadgar commented Oct 15, 2016

@jippi Fairly positive you hit this bug: #1762.

A short term fix would be to use exec versus raw_exec. I also suggest you reserve some CPU and Memory on the nodes otherwise you are allowing Nomad to allocate the whole machines memory

@jippi
Copy link
Contributor Author

jippi commented Oct 15, 2016

@dadgar okay, i've reserved some CPU / RAM for nomad now (2.5GHz and 512MB)

Any ETA for a release containing that fix? Also, any suggestion on how I could verify if its indeed that issue?

@dadgar
Copy link
Contributor

dadgar commented Oct 18, 2016

@jippi Did you end up verifying? Hopefully in a 1-2 weeks

@jippi
Copy link
Contributor Author

jippi commented Oct 18, 2016

@dadgar the error happened again today, even though I did an allocation limit on nomad.

I'm honestly not good enough at Go to be confident a custom build would be production grade. If you got time it would be amazing to get a amd64 linux build with the cherry-picked commit and I can test it out - or guidance on how to make a production grade build for amd64 :)

The super odd thing is that it's only two out of 7 boxes that have the issue. Same kernel version and everything, only difference is physical hardware vs virtual kvm server

@jippi
Copy link
Contributor Author

jippi commented Oct 21, 2016

@dadgar since i cherry-picked the commit you suggested from #1762 i've not observed the issue ! :)

@jippi jippi closed this as completed Oct 21, 2016
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants