Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad doesn't cleanup some allocs that must be garbage collected #4287

Closed
tantra35 opened this issue May 11, 2018 · 12 comments
Closed

Nomad doesn't cleanup some allocs that must be garbage collected #4287

tantra35 opened this issue May 11, 2018 · 12 comments

Comments

@tantra35
Copy link
Contributor

tantra35 commented May 11, 2018

Nomad version

Nomad v0.8.3 (c85483d)

Operating system and Environment details

Issue

For some of our jobs present uncleaned allocations, and we can't remove they with forcedly GC

For example for job githubproxy-branches

$ nomad status githubproxy-branches
ID            = githubproxy-branches
Name          = githubproxy-branches
Submit Date   = 2018-05-11T11:03:26+03:00
Type          = service
Priority      = 50
Datacenters   = ptz,msc,ivn,klg,spb,kv,krv,rsv
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group            Queued  Starting  Running  Failed  Complete  Lost
githubproxy-branches  0       0         8        9       45        1

Latest Deployment
ID          = 9e267371
Status      = failed
Description = Failed due to unhealthy allocations

Deployed
Task Group            Desired  Placed  Healthy  Unhealthy
githubproxy-branches  8        8       7        1

Allocations
ID        Node ID   Task Group            Version  Desired  Status   Created     Modified
a2416fbc  f632bcd6  githubproxy-branches  15       run      running  1h48m ago   1h47m ago
1cbd7372  b57b72c4  githubproxy-branches  15       run      running  1h48m ago   1h47m ago
2a6578ac  397f61c5  githubproxy-branches  15       run      running  1h48m ago   1h48m ago
437bd5a5  e432b04f  githubproxy-branches  15       run      running  1h48m ago   1h47m ago
61a44a61  b6f0e522  githubproxy-branches  15       run      running  1h48m ago   1h41m ago
df95a7f0  063f8506  githubproxy-branches  15       run      running  1h48m ago   1h47m ago
d7ac710e  83cb2016  githubproxy-branches  15       run      running  1h48m ago   1h48m ago
be666d9c  696fcaee  githubproxy-branches  15       run      running  1h48m ago   1h47m ago
d036c82e  83cb2016  githubproxy-branches  11       stop     failed   20h46m ago  1h49m ago
1325925c  063f8506  githubproxy-branches  0        stop     failed   6d18h ago   1h49m ago
9013e267  f632bcd6  githubproxy-branches  0        stop     failed   6d18h ago   1h49m ago
8e04c7dd  e432b04f  githubproxy-branches  0        stop     failed   6d18h ago   1h49m ago
860ac7a5  b57b72c4  githubproxy-branches  0        stop     failed   6d18h ago   1h49m ago
8544ab5c  83cb2016  githubproxy-branches  0        stop     failed   6d18h ago   1h49m ago
f1a884c4  696fcaee  githubproxy-branches  0        stop     failed   6d18h ago   1h49m ago
f4ed77f6  397f61c5  githubproxy-branches  0        stop     failed   6d18h ago   1h49m ago

As you can see there present 8 allocations in stop sate, which must be GC collected, but they doesn't(and sometimes they update they modify time, so bellow it only have 1h49m ago in modify time, but this allocations in stop state more then 1 days, our GC collection time it default(4 hours))

If we run GC manualy

curl -XPUT -H"X-Nomad-Token:<token>" nomad.service.consul:4646/v1/system/gc

absolutely nothing changes, and stop allocations still present in nomad status githubproxy-branches. When we launch nomad status for those allocations we got error(for example for allocation d036c82e)

$ nomad status d036c82e
Error querying deployment "211efaaf-0bdb-179c-239a-8c87dae6764f": Unexpected response code: 404 (deployment not found)

Also strange that on few nomad agents which have stoped allocations in allocs dir not present any allocations that must be GC'ed, so I can conclude that on few nodes GC was maked, but on server side this fact doesn't registered

@tantra35 tantra35 changed the title Nomad doesn't cleanup some gc allocs Nomad doesn't cleanup some allocs that must be garbage collected May 11, 2018
@preetapan
Copy link
Contributor

@tantra35 We did make some changes to the garbage collection logic in 0.8 to make sure that failed allocations that are not yet replaced are not GCed. To help us debug this issue, could you also share your job specification file, and the output of /v1/allocation/<allocation_id> of one of the allocs that did not get GCed?

@tantra35
Copy link
Contributor Author

tantra35 commented May 11, 2018

@preetapan

here is our job file with problem allocations

job "githubproxy-branches"
{

  datacenters = [
    "ptz",
    "msc",
    "ivn",
    "klg",
    "spb",
    "kv",
    "krv",
    "rsv"
  ]
  priority = 50

  constraint
  {
    distinct_hosts = true
  }

  constraint {
    attribute = "${attr.kernel.name}"
    value = "linux"
  }

  constraint {
    attribute = "${node.class}"
    value = "branch"
    distinct_hosts = true
  }

  update
  {
    stagger = "10s"
    max_parallel = 1
  }

  group "githubproxy-branches"
  {
    count = 8

    task "githubproxy-branches"
    {
      driver = "docker"

      artifact
      {
        source = "http://docker.playrix.local/playrix-githubproxy-lighttpd-p07.tar.gz"

        options
        {
          archive = false
        }
      }

      config
      {
        image = "playrix/githubproxy-lighttpd:p07"
        load = "playrix-githubproxy-lighttpd-p07.tar.gz"
        network_mode = "host"
        command = "/init.sh"

        volumes = [
          "/srv/git:/srv/git",
        ]

        port_map
        {
          lighttpd = 80
          lighttpd2 = 8080
        }
      }

      vault { policies = ["service_gitproxy"] }

      template
      {
        data = <<EOH
{{with secret "secrets/service/local/gitproxy"}}
{{.Data.value }}
{{end}}
        EOH
        destination = "secrets/github"
      }


      logs {
        max_files = 3
        max_file_size = 10
      }

      resources {

        network {
          port "lighttpd2"
          {
            static = "8080"
          }
          port "lighttpd"
          {
            static = "80"
          }
        }
      }
    }
  }
} 
$ nomad job status -verbose githubproxy-branches
ID            = githubproxy-branches
Name          = githubproxy-branches
Submit Date   = 2018-05-11T11:03:26+03:00
Type          = service
Priority      = 50
Datacenters   = ptz,msc,ivn,klg,spb,kv,krv,rsv
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group            Queued  Starting  Running  Failed  Complete  Lost
githubproxy-branches  0       0         8        9       45        1

Evaluations
ID                                    Priority  Triggered By  Status    Placement Failures
314e6043-a409-3419-8dbe-81f8321dd1c8  50        job-register  complete  false
110d45c3-ff3a-6f41-4485-32e4b0c1900d  50        job-register  complete  false
6f2260e4-84ea-37bd-cfd3-073442431be7  50        job-register  complete  false

Latest Deployment
ID          = 9e267371-c1e6-6ae8-9a7c-0e2eeadc4b49
Status      = failed
Description = Failed due to unhealthy allocations

Deployed
Task Group            Desired  Placed  Healthy  Unhealthy
githubproxy-branches  8        8       7        1

Allocations
ID                                    Eval ID                               Node ID                               Task Group            Version  Desired  Status   Created                    Modified
a2416fbc-8fa8-a5d8-0328-96b5289ac224  314e6043-a409-3419-8dbe-81f8321dd1c8  f632bcd6-3e61-0d00-076d-d944590c61f2  githubproxy-branches  15       run      running  2018-05-11T11:03:26+03:00  2018-05-11T11:04:18+03:00
1cbd7372-b506-7d6f-f80c-caf6a28677fa  314e6043-a409-3419-8dbe-81f8321dd1c8  b57b72c4-8589-84a3-4690-b3d5fae86b34  githubproxy-branches  15       run      running  2018-05-11T11:03:26+03:00  2018-05-11T11:04:43+03:00
2a6578ac-3d97-8b8b-377b-5ced0706b79e  314e6043-a409-3419-8dbe-81f8321dd1c8  397f61c5-be14-7855-f781-07dc1fcf7967  githubproxy-branches  15       run      running  2018-05-11T11:03:26+03:00  2018-05-11T11:04:05+03:00
437bd5a5-041f-47a6-02d1-349dcf182f8d  314e6043-a409-3419-8dbe-81f8321dd1c8  e432b04f-7315-f2d4-07ca-4e2282f04b59  githubproxy-branches  15       run      running  2018-05-11T11:03:26+03:00  2018-05-11T11:04:30+03:00
61a44a61-fa41-de10-d0c7-d0f52830aa49  314e6043-a409-3419-8dbe-81f8321dd1c8  b6f0e522-58f1-64ba-7697-a8107bd4f93a  githubproxy-branches  15       run      running  2018-05-11T11:03:26+03:00  2018-05-11T11:10:21+03:00
df95a7f0-dd07-b16f-a8df-806505a6e1e2  314e6043-a409-3419-8dbe-81f8321dd1c8  063f8506-6642-3490-cca1-48c61e4d3cec  githubproxy-branches  15       run      running  2018-05-11T11:03:26+03:00  2018-05-11T11:05:08+03:00
d7ac710e-6886-b41d-7a82-38c0615c8a97  314e6043-a409-3419-8dbe-81f8321dd1c8  83cb2016-6427-592d-d659-c3fc907f4808  githubproxy-branches  15       run      running  2018-05-11T11:03:26+03:00  2018-05-11T11:04:11+03:00
be666d9c-76e7-3a28-ed7b-1b0fbba50dcd  314e6043-a409-3419-8dbe-81f8321dd1c8  696fcaee-8e9c-7ec8-7de0-f16f7fc90c08  githubproxy-branches  15       run      running  2018-05-11T11:03:26+03:00  2018-05-11T11:05:03+03:00
d036c82e-433b-84f1-fd79-c7d52845e402  110d45c3-ff3a-6f41-4485-32e4b0c1900d  83cb2016-6427-592d-d659-c3fc907f4808  githubproxy-branches  11       stop     failed   2018-05-10T16:06:00+03:00  2018-05-11T11:02:27+03:00
1325925c-8483-3d58-5978-c28cb44e101e  6f2260e4-84ea-37bd-cfd3-073442431be7  063f8506-6642-3490-cca1-48c61e4d3cec  githubproxy-branches  0        stop     failed   2018-05-04T18:15:41+03:00  2018-05-11T11:02:27+03:00
9013e267-c78b-daf6-cf9e-7e635946a06d  6f2260e4-84ea-37bd-cfd3-073442431be7  f632bcd6-3e61-0d00-076d-d944590c61f2  githubproxy-branches  0        stop     failed   2018-05-04T18:15:41+03:00  2018-05-11T11:02:27+03:00
8e04c7dd-7106-5d82-b947-c01249c4479a  6f2260e4-84ea-37bd-cfd3-073442431be7  e432b04f-7315-f2d4-07ca-4e2282f04b59  githubproxy-branches  0        stop     failed   2018-05-04T18:15:41+03:00  2018-05-11T11:02:27+03:00
860ac7a5-3528-778f-2413-223e783b63da  6f2260e4-84ea-37bd-cfd3-073442431be7  b57b72c4-8589-84a3-4690-b3d5fae86b34  githubproxy-branches  0        stop     failed   2018-05-04T18:15:41+03:00  2018-05-11T11:02:27+03:00
8544ab5c-9a67-14c4-5694-73383fb1822b  6f2260e4-84ea-37bd-cfd3-073442431be7  83cb2016-6427-592d-d659-c3fc907f4808  githubproxy-branches  0        stop     failed   2018-05-04T18:15:41+03:00  2018-05-11T11:02:27+03:00
f1a884c4-08f3-6f27-eace-87c72c7ae4ab  6f2260e4-84ea-37bd-cfd3-073442431be7  696fcaee-8e9c-7ec8-7de0-f16f7fc90c08  githubproxy-branches  0        stop     failed   2018-05-04T18:15:41+03:00  2018-05-11T11:02:27+03:00
f4ed77f6-e198-30cd-7047-6f2df5092a95  6f2260e4-84ea-37bd-cfd3-073442431be7  397f61c5-be14-7855-f781-07dc1fcf7967  githubproxy-branches  0        stop     failed   2018-05-04T18:15:41+03:00  2018-05-11T11:02:27+03:00

result of curl launch

curl -H"X-Nomad-Token:<token>" nomad.service.consul:4646/v1/allocation/d036c82e-433b-84f1-fd79-c7d52845e402


{
  "AllocModifyIndex": 3797345,
  "ClientDescription": "",
  "ClientStatus": "failed",
  "CreateIndex": 3784155,
  "CreateTime": 1525957560710384600,
  "DeploymentID": "211efaaf-0bdb-179c-239a-8c87dae6764f",
  "DeploymentStatus": {
    "Healthy": false,
    "ModifyIndex": 3784319
  },
  "DesiredDescription": "alloc not needed due to job update",
  "DesiredStatus": "stop",
  "DesiredTransition": {
    "Migrate": null
  },
  "EvalID": "110d45c3-ff3a-6f41-4485-32e4b0c1900d",
  "FollowupEvalID": "",
  "ID": "d036c82e-433b-84f1-fd79-c7d52845e402",
  "Job": {
    "AllAtOnce": false,
    "Constraints": [
      {
        "LTarget": "",
        "Operand": "distinct_hosts",
        "RTarget": ""
      },
      {
        "LTarget": "${attr.kernel.name}",
        "Operand": "=",
        "RTarget": "linux"
      },
      {
        "LTarget": "${node.class}",
        "Operand": "distinct_hosts",
        "RTarget": "branch"
      }
    ],
    "CreateIndex": 3700809,
    "Datacenters": [
      "ptz",
      "msc",
      "ivn",
      "klg",
      "spb",
      "kv",
      "krv",
      "rsv"
    ],
    "ID": "githubproxy-branches",
    "JobModifyIndex": 3784153,
    "Meta": null,
    "ModifyIndex": 3784154,
    "Name": "githubproxy-branches",
    "Namespace": "default",
    "ParameterizedJob": null,
    "ParentID": "",
    "Payload": null,
    "Periodic": null,
    "Priority": 50,
    "Region": "global",
    "Stable": false,
    "Status": "pending",
    "StatusDescription": "",
    "Stop": false,
    "SubmitTime": 1525957560448694800,
    "TaskGroups": [
      {
        "Constraints": [
          {
            "LTarget": "${attr.vault.version}",
            "Operand": "version",
            "RTarget": ">= 0.6.1"
          }
        ],
        "Count": 8,
        "EphemeralDisk": {
          "Migrate": false,
          "SizeMB": 300,
          "Sticky": false
        },
        "Meta": null,
        "Migrate": {
          "HealthCheck": "checks",
          "HealthyDeadline": 300000000000,
          "MaxParallel": 1,
          "MinHealthyTime": 10000000000
        },
        "Name": "githubproxy-branches",
        "ReschedulePolicy": {
          "Attempts": 0,
          "Delay": 30000000000,
          "DelayFunction": "exponential",
          "Interval": 0,
          "MaxDelay": 3600000000000,
          "Unlimited": true
        },
        "RestartPolicy": {
          "Attempts": 2,
          "Delay": 15000000000,
          "Interval": 1800000000000,
          "Mode": "fail"
        },
        "Tasks": [
          {
            "Artifacts": [
              {
                "GetterMode": "any",
                "GetterOptions": {
                  "archive": "0"
                },
                "GetterSource": "http://docker.playrix.local/playrix-githubproxy-lighttpd-p07.tar.gz",
                "RelativeDest": "local/"
              }
            ],
            "Config": {
              "network_mode": "host",
              "port_map": [
                {
                  "lighttpd": 80,
                  "lighttpd2": 8080
                }
              ],
              "volumes": [
                "/srv/git:/srv/git"
              ],
              "command": "/init.sh",
              "image": "playrix/githubproxy-lighttpd:p07",
              "load": "playrix-githubproxy-lighttpd-p07.tar.gz"
            },
            "Constraints": null,
            "DispatchPayload": null,
            "Driver": "docker",
            "Env": null,
            "KillSignal": "",
            "KillTimeout": 5000000000,
            "Leader": false,
            "LogConfig": {
              "MaxFileSizeMB": 10,
              "MaxFiles": 3
            },
            "Meta": null,
            "Name": "githubproxy-branches",
            "Resources": {
              "CPU": 3000,
              "DiskMB": 0,
              "IOPS": 0,
              "MemoryMB": 3000,
              "Networks": [
                {
                  "CIDR": "",
                  "Device": "",
                  "DynamicPorts": null,
                  "IP": "",
                  "MBits": 10,
                  "ReservedPorts": [
                    {
                      "Label": "lighttpd2",
                      "Value": 8080
                    },
                    {
                      "Label": "lighttpd",
                      "Value": 80
                    }
                  ]
                }
              ]
            },
            "Services": null,
            "ShutdownDelay": 0,
            "Templates": [
              {
                "ChangeMode": "restart",
                "ChangeSignal": "",
                "DestPath": "secrets/github",
                "EmbeddedTmpl": "{{with secret \"secrets/service/local/gitproxy\"}}\n{{.Data.value }}\n{{end}}\n        ",
                "Envvars": false,
                "LeftDelim": "{{",
                "Perms": "0644",
                "RightDelim": "}}",
                "SourcePath": "",
                "Splay": 5000000000,
                "VaultGrace": 15000000000
              }
            ],
            "User": "",
            "Vault": {
              "ChangeMode": "restart",
              "ChangeSignal": "SIGHUP",
              "Env": true,
              "Policies": [
                "service_gitproxy"
              ]
            }
          }
        ],
        "Update": {
          "AutoRevert": false,
          "Canary": 0,
          "HealthCheck": "checks",
          "HealthyDeadline": 300000000000,
          "MaxParallel": 1,
          "MinHealthyTime": 10000000000,
          "Stagger": 10000000000
        }
      }
    ],
    "Type": "service",
    "Update": {
      "AutoRevert": false,
      "Canary": 0,
      "HealthCheck": "",
      "HealthyDeadline": 0,
      "MaxParallel": 1,
      "MinHealthyTime": 0,
      "Stagger": 10000000000
    },
    "VaultToken": "",
    "Version": 11
  },
  "JobID": "githubproxy-branches",
  "Metrics": {
    "AllocationTime": 105546,
    "ClassExhausted": null,
    "ClassFiltered": null,
    "CoalescedFailures": 0,
    "ConstraintFiltered": null,
    "DimensionExhausted": null,
    "NodesAvailable": {
      "ivn": 1,
      "kv": 1,
      "spb": 1,
      "ptz": 1,
      "rsv": 1,
      "klg": 1,
      "krv": 1,
      "msc": 1
    },
    "NodesEvaluated": 3,
    "NodesExhausted": 0,
    "NodesFiltered": 0,
    "QuotaExhausted": null,
    "Scores": {
      "696fcaee-8e9c-7ec8-7de0-f16f7fc90c08.binpack": 15.938406874563274,
      "83cb2016-6427-592d-d659-c3fc907f4808.binpack": 16.475762863388052,
      "397f61c5-be14-7855-f781-07dc1fcf7967.binpack": 15.762788669670627
    }
  },
  "ModifyIndex": 3797345,
  "ModifyTime": 1526025747119507500,
  "Name": "githubproxy-branches.githubproxy-branches[0]",
  "Namespace": "default",
  "NextAllocation": "",
  "NodeID": "83cb2016-6427-592d-d659-c3fc907f4808",
  "PreviousAllocation": "",
  "RescheduleTracker": null,
  "Resources": {
    "CPU": 3000,
    "DiskMB": 300,
    "IOPS": 0,
    "MemoryMB": 3000,
    "Networks": [
      {
        "CIDR": "",
        "Device": "enp0s10f0",
        "DynamicPorts": null,
        "IP": "172.16.80.22",
        "MBits": 10,
        "ReservedPorts": [
          {
            "Label": "lighttpd2",
            "Value": 8080
          },
          {
            "Label": "lighttpd",
            "Value": 80
          }
        ]
      }
    ]
  },
  "SharedResources": {
    "CPU": 0,
    "DiskMB": 300,
    "IOPS": 0,
    "MemoryMB": 0,
    "Networks": null
  },
  "TaskGroup": "githubproxy-branches",
  "TaskResources": {
    "githubproxy-branches": {
      "CPU": 3000,
      "DiskMB": 0,
      "IOPS": 0,
      "MemoryMB": 3000,
      "Networks": [
        {
          "CIDR": "",
          "Device": "enp0s10f0",
          "DynamicPorts": null,
          "IP": "172.16.80.22",
          "MBits": 10,
          "ReservedPorts": [
            {
              "Label": "lighttpd2",
              "Value": 8080
            },
            {
              "Label": "lighttpd",
              "Value": 80
            }
          ]
        }
      ]
    }
  },
  "TaskStates": {
    "githubproxy-branches": {
      "Events": [
        {
          "Details": {},
          "DiskLimit": 0,
          "DisplayMessage": "Client is downloading artifacts",
          "DownloadError": "",
          "DriverError": "",
          "DriverMessage": "",
          "ExitCode": 0,
          "FailedSibling": "",
          "FailsTask": false,
          "GenericSource": "",
          "KillError": "",
          "KillReason": "",
          "KillTimeout": 0,
          "Message": "",
          "RestartReason": "",
          "SetupError": "",
          "Signal": 0,
          "StartDelay": 0,
          "TaskSignal": "",
          "TaskSignalReason": "",
          "Time": 1525957564110092800,
          "Type": "Downloading Artifacts",
          "ValidationError": "",
          "VaultError": ""
        },
        {
          "Details": {
            "message": "Task not running by deadline"
          },
          "DiskLimit": 0,
          "DisplayMessage": "Task not running by deadline",
          "DownloadError": "",
          "DriverError": "",
          "DriverMessage": "",
          "ExitCode": 0,
          "FailedSibling": "",
          "FailsTask": false,
          "GenericSource": "",
          "KillError": "",
          "KillReason": "",
          "KillTimeout": 0,
          "Message": "Task not running by deadline",
          "RestartReason": "",
          "SetupError": "",
          "Signal": 0,
          "StartDelay": 0,
          "TaskSignal": "",
          "TaskSignalReason": "",
          "Time": 1525957860996673000,
          "Type": "Alloc Unhealthy",
          "ValidationError": "",
          "VaultError": ""
        },
        {
          "Details": {
            "download_error": "failed to download artifact \"http://docker.playrix.local/playrix-githubproxy-lighttpd-p07.tar.gz\": read tcp 172.16.80.22:59810->172.16.9.35:80: read: connection timed out"
          },
          "DiskLimit": 0,
          "DisplayMessage": "failed to download artifact \"http://docker.playrix.local/playrix-githubproxy-lighttpd-p07.tar.gz\": read tcp 172.16.80.22:59810->172.16.9.35:80: read: connection timed out",
          "DownloadError": "failed to download artifact \"http://docker.playrix.local/playrix-githubproxy-lighttpd-p07.tar.gz\": read tcp 172.16.80.22:59810->172.16.9.35:80: read: connection timed out",
          "DriverError": "",
          "DriverMessage": "",
          "ExitCode": 0,
          "FailedSibling": "",
          "FailsTask": false,
          "GenericSource": "",
          "KillError": "",
          "KillReason": "",
          "KillTimeout": 0,
          "Message": "",
          "RestartReason": "",
          "SetupError": "",
          "Signal": 0,
          "StartDelay": 0,
          "TaskSignal": "",
          "TaskSignalReason": "",
          "Time": 1525957871817091300,
          "Type": "Failed Artifact Download",
          "ValidationError": "",
          "VaultError": ""
        },
        {
          "Details": {
            "start_delay": "18553794506",
            "restart_reason": "Restart within policy"
          },
          "DiskLimit": 0,
          "DisplayMessage": "Task restarting in 18.553794506s",
          "DownloadError": "",
          "DriverError": "",
          "DriverMessage": "",
          "ExitCode": 0,
          "FailedSibling": "",
          "FailsTask": false,
          "GenericSource": "",
          "KillError": "",
          "KillReason": "",
          "KillTimeout": 0,
          "Message": "",
          "RestartReason": "Restart within policy",
          "SetupError": "",
          "Signal": 0,
          "StartDelay": 18553794506,
          "TaskSignal": "",
          "TaskSignalReason": "",
          "Time": 1525957871817129700,
          "Type": "Restarting",
          "ValidationError": "",
          "VaultError": ""
        },
        {
          "Details": {},
          "DiskLimit": 0,
          "DisplayMessage": "Client is downloading artifacts",
          "DownloadError": "",
          "DriverError": "",
          "DriverMessage": "",
          "ExitCode": 0,
          "FailedSibling": "",
          "FailsTask": false,
          "GenericSource": "",
          "KillError": "",
          "KillReason": "",
          "KillTimeout": 0,
          "Message": "",
          "RestartReason": "",
          "SetupError": "",
          "Signal": 0,
          "StartDelay": 0,
          "TaskSignal": "",
          "TaskSignalReason": "",
          "Time": 1525957890371278800,
          "Type": "Downloading Artifacts",
          "ValidationError": "",
          "VaultError": ""
        },
        {
          "Details": {
            "download_error": "failed to download artifact \"http://docker.playrix.local/playrix-githubproxy-lighttpd-p07.tar.gz\": Get http://docker.playrix.local/playrix-githubproxy-lighttpd-p07.tar.gz: dial tcp: lookup docker.playrix.local: Temporary failure in name resolution"
          },
          "DiskLimit": 0,
          "DisplayMessage": "failed to download artifact \"http://docker.playrix.local/playrix-githubproxy-lighttpd-p07.tar.gz\": Get http://docker.playrix.local/playrix-githubproxy-lighttpd-p07.tar.gz: dial tcp: lookup docker.playrix.local: Temporary failure in name resolution",
          "DownloadError": "failed to download artifact \"http://docker.playrix.local/playrix-githubproxy-lighttpd-p07.tar.gz\": Get http://docker.playrix.local/playrix-githubproxy-lighttpd-p07.tar.gz: dial tcp: lookup docker.playrix.local: Temporary failure in name resolution",
          "DriverError": "",
          "DriverMessage": "",
          "ExitCode": 0,
          "FailedSibling": "",
          "FailsTask": false,
          "GenericSource": "",
          "KillError": "",
          "KillReason": "",
          "KillTimeout": 0,
          "Message": "",
          "RestartReason": "",
          "SetupError": "",
          "Signal": 0,
          "StartDelay": 0,
          "TaskSignal": "",
          "TaskSignalReason": "",
          "Time": 1525957910388891100,
          "Type": "Failed Artifact Download",
          "ValidationError": "",
          "VaultError": ""
        },
        {
          "Details": {
            "restart_reason": "Restart within policy",
            "start_delay": "17896057074"
          },
          "DiskLimit": 0,
          "DisplayMessage": "Task restarting in 17.896057074s",
          "DownloadError": "",
          "DriverError": "",
          "DriverMessage": "",
          "ExitCode": 0,
          "FailedSibling": "",
          "FailsTask": false,
          "GenericSource": "",
          "KillError": "",
          "KillReason": "",
          "KillTimeout": 0,
          "Message": "",
          "RestartReason": "Restart within policy",
          "SetupError": "",
          "Signal": 0,
          "StartDelay": 17896057074,
          "TaskSignal": "",
          "TaskSignalReason": "",
          "Time": 1525957910388940800,
          "Type": "Restarting",
          "ValidationError": "",
          "VaultError": ""
        },
        {
          "Details": {},
          "DiskLimit": 0,
          "DisplayMessage": "Client is downloading artifacts",
          "DownloadError": "",
          "DriverError": "",
          "DriverMessage": "",
          "ExitCode": 0,
          "FailedSibling": "",
          "FailsTask": false,
          "GenericSource": "",
          "KillError": "",
          "KillReason": "",
          "KillTimeout": 0,
          "Message": "",
          "RestartReason": "",
          "SetupError": "",
          "Signal": 0,
          "StartDelay": 0,
          "TaskSignal": "",
          "TaskSignalReason": "",
          "Time": 1525957928285346600,
          "Type": "Downloading Artifacts",
          "ValidationError": "",
          "VaultError": ""
        },
        {
          "Details": {
            "download_error": "failed to download artifact \"http://docker.playrix.local/playrix-githubproxy-lighttpd-p07.tar.gz\": Get http://docker.playrix.local/playrix-githubproxy-lighttpd-p07.tar.gz: dial tcp: lookup docker.playrix.local: Temporary failure in name resolution"
          },
          "DiskLimit": 0,
          "DisplayMessage": "failed to download artifact \"http://docker.playrix.local/playrix-githubproxy-lighttpd-p07.tar.gz\": Get http://docker.playrix.local/playrix-githubproxy-lighttpd-p07.tar.gz: dial tcp: lookup docker.playrix.local: Temporary failure in name resolution",
          "DownloadError": "failed to download artifact \"http://docker.playrix.local/playrix-githubproxy-lighttpd-p07.tar.gz\": Get http://docker.playrix.local/playrix-githubproxy-lighttpd-p07.tar.gz: dial tcp: lookup docker.playrix.local: Temporary failure in name resolution",
          "DriverError": "",
          "DriverMessage": "",
          "ExitCode": 0,
          "FailedSibling": "",
          "FailsTask": false,
          "GenericSource": "",
          "KillError": "",
          "KillReason": "",
          "KillTimeout": 0,
          "Message": "",
          "RestartReason": "",
          "SetupError": "",
          "Signal": 0,
          "StartDelay": 0,
          "TaskSignal": "",
          "TaskSignalReason": "",
          "Time": 1525957948301453800,
          "Type": "Failed Artifact Download",
          "ValidationError": "",
          "VaultError": ""
        },
        {
          "Details": {
            "restart_reason": "Exceeded allowed attempts 2 in interval 30m0s and mode is \"fail\"",
            "fails_task": "true"
          },
          "DiskLimit": 0,
          "DisplayMessage": "Exceeded allowed attempts 2 in interval 30m0s and mode is \"fail\"",
          "DownloadError": "",
          "DriverError": "",
          "DriverMessage": "",
          "ExitCode": 0,
          "FailedSibling": "",
          "FailsTask": true,
          "GenericSource": "",
          "KillError": "",
          "KillReason": "",
          "KillTimeout": 0,
          "Message": "",
          "RestartReason": "Exceeded allowed attempts 2 in interval 30m0s and mode is \"fail\"",
          "SetupError": "",
          "Signal": 0,
          "StartDelay": 0,
          "TaskSignal": "",
          "TaskSignalReason": "",
          "Time": 1525957948301501400,
          "Type": "Not Restarting",
          "ValidationError": "",
          "VaultError": ""
        }
      ],
      "Failed": true,
      "FinishedAt": "2018-05-10T13:12:28.30151046Z",
      "LastRestart": "2018-05-10T16:11:50.388940847+03:00",
      "Restarts": 2,
      "StartedAt": "0001-01-01T00:00:00Z",
      "State": "dead"
    }
  }
}

@tantra35
Copy link
Contributor Author

it seems that in our case all allocations in fail state not be subjected of GC

@preetapan
Copy link
Contributor

preetapan commented May 21, 2018

@tantra35 just got back to investigating this after a week's break. I tried using a somewhat modified version of your job spec and so far I haven't been able to reproduce.

Could you also provide us the exact steps you took to get nomad into this state where the failed allocs don't GC? Please provide as much detail as possible, especially about specific commands run before the GC.

Also helpful to debug would be the output of nomad eval status for the eval ID in the EvalID field of the allocation that failed and was never GCed. In the above example alloc it would be 110d45c3-ff3a-6f41-4485-32e4b0c1900d

@tantra35
Copy link
Contributor Author

tantra35 commented May 21, 2018

@preetapan for nowdays we stop job with failed not GC allocations, and launch new, before launch we made GC to make sure everything is cleared out, then we waited when deployment will fully complete, and now allocations looks like this

ruslan@ruslan:~$ nomad job status -verbose githubproxy-branches
ID            = githubproxy-branches
Name          = githubproxy-branches
Submit Date   = 2018-05-14T15:08:57+03:00
Type          = service
Priority      = 50
Datacenters   = ptz,msc,ivn,klg,spb,kv,krv,rsv
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group            Queued  Starting  Running  Failed  Complete  Lost
githubproxy-branches  0       0         8        96      17        1

Evaluations
ID                                    Priority  Triggered By   Status    Placement Failures
bc549b15-ebb1-7cbe-0db6-c01816eb0e79  50        alloc-failure  complete  false
63018331-f3a1-c455-1c99-0dc4b89c9838  50        alloc-failure  complete  false
e0fb0074-4328-3fdf-1417-dd395d6d7853  50        alloc-failure  complete  false
8c7ebeef-d455-0c3c-9769-559901c4c650  50        alloc-failure  complete  false
120e1feb-9c15-06ad-f39d-fa083c34d3d2  50        alloc-failure  complete  false
9f1cb25f-038d-a8dd-229b-f01a99970f11  50        alloc-failure  complete  false
89e07d3d-33e0-c939-7856-a68a8b3e7c4f  50        node-update    complete  false
f3385086-31b8-2032-3c04-2cabacd6589d  50        node-update    complete  false
f50dc68d-9050-4dbf-162c-f417dc1a2929  50        job-register   complete  false

Latest Deployment
ID          = 2e862f0e-01e1-fb9c-8999-371de7dc23c0
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group            Desired  Placed  Healthy  Unhealthy
githubproxy-branches  8        8       5        3

Allocations
ID                                    Eval ID                               Node ID                               Task Group            Version  Desired  Status   Created                    Modified
06203c42-3bd8-6382-3fb7-ffa1cd13349c  bc549b15-ebb1-7cbe-0db6-c01816eb0e79  063f8506-6642-3490-cca1-48c61e4d3cec  githubproxy-branches  0        run      running  2018-05-21T15:29:17+03:00  2018-05-21T15:31:04+03:00
a7338298-efb8-559c-6495-47a6fc24159d  63018331-f3a1-c455-1c99-0dc4b89c9838  397f61c5-be14-7855-f781-07dc1fcf7967  githubproxy-branches  0        run      running  2018-05-21T13:56:32+03:00  2018-05-21T13:57:24+03:00
3a7d2ab7-3ff3-74c9-7a79-9a5d71bb2ae0  e0fb0074-4328-3fdf-1417-dd395d6d7853  696fcaee-8e9c-7ec8-7de0-f16f7fc90c08  githubproxy-branches  0        run      running  2018-05-21T09:10:02+03:00  2018-05-21T09:11:53+03:00
3bb92d33-f573-85c2-f832-22c7e0655619  8c7ebeef-d455-0c3c-9769-559901c4c650  b6f0e522-58f1-64ba-7697-a8107bd4f93a  githubproxy-branches  0        run      running  2018-05-21T07:23:29+03:00  2018-05-21T07:28:25+03:00
a07bdc8e-3f96-6f90-5f88-1b17e0b1cebb  120e1feb-9c15-06ad-f39d-fa083c34d3d2  83cb2016-6427-592d-d659-c3fc907f4808  githubproxy-branches  0        run      running  2018-05-20T09:48:23+03:00  2018-05-20T09:49:02+03:00
b7e6501c-1a11-1553-268b-f5701a03a92b  9f1cb25f-038d-a8dd-229b-f01a99970f11  f632bcd6-3e61-0d00-076d-d944590c61f2  githubproxy-branches  0        run      running  2018-05-18T18:11:04+03:00  2018-05-18T18:11:45+03:00
8e51a8a7-25f2-58b7-4d81-8fb01b64b59c  89e07d3d-33e0-c939-7856-a68a8b3e7c4f  b6f0e522-58f1-64ba-7697-a8107bd4f93a  githubproxy-branches  0        stop     failed   2018-05-18T00:25:35+03:00  2018-05-18T00:44:09+03:00
171a3f44-5e4f-c660-8bcd-105b4f1ff678  f3385086-31b8-2032-3c04-2cabacd6589d  b57b72c4-8589-84a3-4690-b3d5fae86b34  githubproxy-branches  0        run      running  2018-05-15T23:10:20+03:00  2018-05-15T23:11:24+03:00
bf117538-c6e6-823b-d8c0-8cd0ba902e9c  f50dc68d-9050-4dbf-162c-f417dc1a2929  83cb2016-6427-592d-d659-c3fc907f4808  githubproxy-branches  0        stop     failed   2018-05-14T15:08:57+03:00  2018-05-16T17:36:38+03:00
f5a8a7e3-4a83-eb37-c6e2-99a5ac89d14d  f50dc68d-9050-4dbf-162c-f417dc1a2929  e432b04f-7315-f2d4-07ca-4e2282f04b59  githubproxy-branches  0        run      running  2018-05-14T15:08:57+03:00  2018-05-14T15:10:00+03:00

as you can see we have 2 alloc in failed state that live more then 3 days(here short output of nomad job status, where modify time shows more pretty):

ruslan@ruslan:~$ nomad job status githubproxy-branches
ID            = githubproxy-branches
Name          = githubproxy-branches
Submit Date   = 2018-05-14T15:08:57+03:00
Type          = service
Priority      = 50
Datacenters   = ptz,msc,ivn,klg,spb,kv,krv,rsv
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group            Queued  Starting  Running  Failed  Complete  Lost
githubproxy-branches  0       0         8        96      17        1

Latest Deployment
ID          = 2e862f0e
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group            Desired  Placed  Healthy  Unhealthy
githubproxy-branches  8        8       5        3

Allocations
ID        Node ID   Task Group            Version  Desired  Status   Created     Modified
06203c42  063f8506  githubproxy-branches  0        run      running  6h16m ago   6h14m ago
a7338298  397f61c5  githubproxy-branches  0        run      running  7h49m ago   7h48m ago
3a7d2ab7  696fcaee  githubproxy-branches  0        run      running  12h35m ago  12h33m ago
3bb92d33  b6f0e522  githubproxy-branches  0        run      running  14h22m ago  14h17m ago
a07bdc8e  83cb2016  githubproxy-branches  0        run      running  1d11h ago   1d11h ago
b7e6501c  f632bcd6  githubproxy-branches  0        run      running  3d3h ago    3d3h ago
8e51a8a7  b6f0e522  githubproxy-branches  0        stop     failed   3d21h ago   3d21h ago
171a3f44  b57b72c4  githubproxy-branches  0        run      running  5d22h ago   5d22h ago
bf117538  83cb2016  githubproxy-branches  0        stop     failed   7d6h ago    5d4h ago
f5a8a7e3  e432b04f  githubproxy-branches  0        run      running  7d6h ago    7d6h ago

eval status for then on my oppinoon gives nothing interesting

ruslan@ruslan:~$ nomad eval status 89e07d3d-33e0-c939-7856-a68a8b3e7c4f
ID                 = 89e07d3d
Status             = complete
Status Description = complete
Type               = service
TriggeredBy        = node-update
Priority           = 50
Placement Failures = false

and for second

ruslan@ruslan:~$ nomad eval status f50dc68d-9050-4dbf-162c-f417dc1a2929
ID                 = f50dc68d
Status             = complete
Status Description = complete
Type               = service
TriggeredBy        = job-register
Job ID             = githubproxy-branches
Priority           = 50
Placement Failures = false

in our case we sometimes have very unstable network, and have mane connectivity problems between DC

@tantra35
Copy link
Contributor Author

tantra35 commented May 21, 2018

But also we have jobs with failed state without GC in more stable aws network enviroment for example:

ruslan@ruslan:~$ nomad status -region=atf01 teamcity-build-monitor
ID            = teamcity-build-monitor
Name          = teamcity-build-monitor
Submit Date   = 2018-05-16T14:20:44+03:00
Type          = service
Priority      = 80
Datacenters   = test
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group              Queued  Starting  Running  Failed  Complete  Lost
teamcity-build-monitor  0       0         1        1       0         0

Latest Deployment
ID          = 0c97a80e
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group              Desired  Placed  Healthy  Unhealthy
teamcity-build-monitor  1        1       1        0

Allocations
ID        Node ID   Task Group              Version  Desired  Status   Created     Modified
95ecf8ef  fa553b73  teamcity-build-monitor  64       run      running  5d7h ago    5d7h ago
2905720b  bcd00076  teamcity-build-monitor  36       stop     failed   28d11h ago  5d7h ago

and verbose output

ruslan@ruslan:~$ nomad job status -verbose -region=atf01 teamcity-build-monitor
ID            = teamcity-build-monitor
Name          = teamcity-build-monitor
Submit Date   = 2018-05-16T14:20:44+03:00
Type          = service
Priority      = 80
Datacenters   = test
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group              Queued  Starting  Running  Failed  Complete  Lost
teamcity-build-monitor  0       0         1        1       0         0

Evaluations
ID                                    Priority  Triggered By  Status    Placement Failures
e283d1ea-180e-c553-cb48-181c2025787b  80        job-register  complete  false
cd35b264-23fe-18fe-6e52-75d8f0ca8477  80        job-register  complete  false

Latest Deployment
ID          = 0c97a80e-02dc-b3d8-88df-81532fa8d298
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group              Desired  Placed  Healthy  Unhealthy
teamcity-build-monitor  1        1       1        0

Allocations
ID                                    Eval ID                               Node ID                               Task Group              Version  Desired  Status   Created                    Modified
95ecf8ef-07d1-54db-2fbd-ac46cd3e8503  e283d1ea-180e-c553-cb48-181c2025787b  fa553b73-e869-d72a-8831-604ac2ea4fbd  teamcity-build-monitor  64       run      running  2018-05-16T14:20:45+03:00  2018-05-16T14:20:55+03:00
2905720b-887a-a4f0-6d11-6c730d081005  cd35b264-23fe-18fe-6e52-75d8f0ca8477  bcd00076-347f-b571-85fb-5947ca3c2476  teamcity-build-monitor  36       stop     failed   2018-04-23T10:29:09+03:00  2018-05-16T14:20:42+03:00

with follow eval status of failed allocation

ruslan@ruslan:~$ nomad eval status -verbose -region=atf01 cd35b264-23fe-18fe-6e52-75d8f0ca8477
ID                 = cd35b264-23fe-18fe-6e52-75d8f0ca8477
Status             = complete
Status Description = complete
Type               = service
TriggeredBy        = job-register
Job ID             = teamcity-build-monitor
Priority           = 80
Placement Failures = false
Previous Eval      = <none>
Next Eval          = <none>
Blocked Eval       = <none>

@qkate qkate assigned preetapan and qkate and unassigned qkate May 21, 2018
@tantra35
Copy link
Contributor Author

tantra35 commented May 21, 2018

So we doesnt do anything special, simply job standart workflow. And its strange that manual gc doesn't helps and nothing interesting present in nomad server leader logs(although we have DEBUG verbosity)

@preetapan
Copy link
Contributor

@tantra35 I was able to reproduce this and PR #4313 should fix it. Would you be willing to try it out if I provide you a test binary tomorrow?

@tantra35
Copy link
Contributor Author

@preetapan great work, ofcourse we can try this binary. But I'm confused about that in our case job version wasn't changed

ruslan@ruslan:~$ nomad job history githubproxy-branches
Version     = 0
Stable      = true
Submit Date = 2018-05-14T15:08:57+03:00

@preetapan
Copy link
Contributor

@tantra35 that is expected if you ran a nomad stop -purge githubproxy-branches or used the stop API with purge set. When the job is run again the version is reset to zero, but any previous allocs that were not GCed properly are still associated with the same job id because they were not cleaned up properly.

@tantra35
Copy link
Contributor Author

@preetapan Hm that sounds logical, but I'm quite sure that after stop and re-launch, there wasn't be any allocations in fail state, and was only 8 healed allocation. Also if i understand all correctly i make desigion that id of allocation hasn't changed, so i expect to found them in first message for this issue, but they there wasn't

Any way lets try

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants