Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client not starting a placed allocation #3227

Closed
nugend opened this issue Sep 15, 2017 · 29 comments
Closed

Client not starting a placed allocation #3227

nugend opened this issue Sep 15, 2017 · 29 comments

Comments

@nugend
Copy link

nugend commented Sep 15, 2017

Nomad version

0.6.0

I had an issue recently where a number of batch jobs were pending, but they didn't seemed to be blocked on any resource. While tailing the logs, nomad seemed to be in a loop trying to allocate a service job whose sole task was starting and exiting extremely quickly (it had a bad configuration and was using the default restart stanza). I did try stopping it to see if that would cause the pending batch jobs to get allocations, but due to what appears to be an unrelated bug, nomad continued to try and allocate for the service job despite the desired state being dead (#3217 maybe?).

My question is: Is the intended behavior that nomad continue to try to allocate the service job and ignore the batch jobs for allocation placement? I haven't adjusted any priority values in my current job specifications, so I would presume they are all at the default of 50. Normally, I would actually prefer that the service jobs have a higher priority, but it did seem a little funny that nomad wasn't trying to schedule anything else.

@dadgar
Copy link
Contributor

dadgar commented Sep 25, 2017

@nugend It would be hard to diagnose what happened with the current information. Would need to see the logs, evaluations and allocations to really get a clear picture but I can help describe Nomad behaviors in this scenario.

  1. The batch jobs not starting

There must have been something that was causing the jobs not to schedule. The scheduler will attempt a placement if there is room and all constraints are met. If these conditions are not met, Nomad creates a followup evaluation that will block until there is more freed resources or a new type of node is added that may meet the set of constraints on the job. This information can be seen by running nomad job status <job> and is also available by querying the jobs evaluations.

  1. Batch/Service interplay and priority.

Nomad uses the priority to give first access to the schedulers but will not block access because a service job is trying to schedule. In future versions of Nomad, the priority will be also be used to do pre-emption.

Hope that helps! If this happens again, please grab the logs, evals, and allocations for effect jobs! 👍

@dadgar dadgar closed this as completed Sep 25, 2017
@nugend
Copy link
Author

nugend commented Dec 9, 2017

@dadgar Pretty sure I saw this again. Logs contained absolutely nothing of interest except that the jobs which were in a crashloop were getting restarted.

We had the same configuration problem in two environments this time and the "issue" seemed to only show up in one of the environments, so I'm guessing it's a bug or something.

If I can figure out how to reproduce the issue consistently, I'll ask you to reopen.

@nugend
Copy link
Author

nugend commented Jan 9, 2018

This may be a bad title for this issue. I've seen this again in two different environments this week.

There is really nothing interesting or revealing in the logs. I'm running a 3 node setup where all clients are also masters. The master node is executing allocations, the client nodes are not, the allocations assigned to those nodes are stuck pending.

Still, I've only noticed this when there's an infinite crash loop occurring on an unrelated process.

I was trying to determine if this had something to do with leader election, but don't see any leader elections in the last 24 hours of log files and unless I'm reading the documentation wrong, I don't think there's a way to see when the last leader election was.

I am at a total loss as to what's causing this issue.

Here's an example of the nomad job status for one job:

bash-4.2$ nomad job status -verbose REDACTED
ID            = REDACTED
Name          = REDACTED
Submit Date   = 01/09/18 13:30:41 EST
Type          = batch
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
REDACTED  0       1         0        0       0         0

Evaluations
ID                                    Priority  Triggered By  Status    Placement Failures
3a925517-f196-5c7b-41b6-018b46a54017  50        job-register  complete  false

Allocations
ID                                    Eval ID                               Node ID                               Task Group  Version  Desired  Status   Created                Modified
1e076522-3be9-4e81-58d6-bd8f3ef93a07  3a925517-f196-5c7b-41b6-018b46a54017  239f4156-410c-c04c-db00-e637b58392ac  REDACTED  0        run      pending  01/09/18 13:30:41 EST  01/09/18 13:30:41 EST

As of me executing this status command, the local time is 17:26 EST, but there doesn't appear to be an issue with actually placing the job, just getting it going from the pending state.

@dadgar
Copy link
Contributor

dadgar commented Jan 9, 2018

What is the output of nomad alloc-status 1e076522-3be9-4e81-58d6-bd8f3ef93a07

@nugend
Copy link
Author

nugend commented Jan 10, 2018

Had to restart the cluster. The next time I see this show up, I'll give that a shot. Hopefully it'll be a while.

@nugend
Copy link
Author

nugend commented Jan 11, 2018

Think I'm seeing another similar situation:

bash-4.2$ nomad job status -verbose REDACTED
ID            = REDACTED
Name          = REDACTED
Submit Date   = 01/11/18 15:20:28 EST
Type          = batch
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
REDACTED    0       1         0        0       0         0

Evaluations
ID                                    Priority  Triggered By  Status    Placement Failures
54572f7d-0b3f-d15c-2b48-2a6c95cab28b  50        job-register  complete  false

Allocations
ID                                    Eval ID                               Node ID                               Task Group  Version  Desired  Status   Created                Modified
ffe4b31c-c6b4-b6fc-983b-71206353d089  54572f7d-0b3f-d15c-2b48-2a6c95cab28b  a9ebba54-ed27-d1aa-0041-1d6b15743b9f  REDACTED    0        run      pending  01/11/18 15:20:28 EST  01/11/18 15:20:28 EST
bash-4.2$ nomad alloc-status -verbose ffe4b31c-c6b4-b6fc-983b-71206353d089
ID                  = ffe4b31c
Eval ID             = 54572f7d
Name                = REDACTED.REDACTED[0]
Node ID             = a9ebba54
Job ID              = REDACTED
Job Version         = 0
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created             = 1h8m ago
Modified            = 1h8m ago

Couldn't retrieve stats (HINT: ensure Client.Advertise.HTTP is set): Unexpected response code: 500 (unknown allocation ID "ffe4b31c-c6b4-b6fc-983b-71206353d089")

Placement Metrics
  * Score "8e788f15-43d0-47bd-7a38-9564df668812.binpack" = 3.474508
  * Score "a9ebba54-ed27-d1aa-0041-1d6b15743b9f.binpack" = 5.357172

I'm pretty sure the Client.Advertise.HTTP property is set since I can access the HTTP API on this and all other nodes, so I'm not sure what that error message is telling me. The agent-info command doesn't specify that value though.

Additionally, in case it's pertinent, the command produces identical results from all nodes in the cluster.

Here's the node status in case that helps:

bash-4.2$ nomad node-status -verbose
ID                                    DC   Name          Class   Version  Drain  Status
8e788f15-43d0-47bd-7a38-9564df668812  dc1  REDACTED      <none>  0.7.1    false  ready
95e6b64a-fa55-1408-b81c-40df588374ac  dc1  REDACTED      <none>  0.7.1    false  ready
a9ebba54-ed27-d1aa-0041-1d6b15743b9f  dc1  REDACTED      <none>  0.7.1    false  ready

@nugend
Copy link
Author

nugend commented Jan 18, 2018

bash-4.2$ nomad job status -verbose REDACTED-periodic-REDACTED-5562d24d-35dd-42c2-ab41-6e30c0aa5707
ID            = REDACTED-periodic-REDACTED-5562d24d-35dd-42c2-ab41-6e30c0aa5707
Name          = REDACTED
Submit Date   = 01/18/18 05:12:52 EST
Type          = batch
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
REDACTED-task  0       1         0        0       0         0

Evaluations
ID                                    Priority  Triggered By  Status    Placement Failures
3b00e719-55ee-4f69-3012-99bf973354a6  50        job-register  complete  false

Allocations
ID                                    Eval ID                               Node ID                               Task Group     Version  Desired  Status   Created                Modified
5af0891e-4a0c-36d1-98e5-a920aed564aa  3b00e719-55ee-4f69-3012-99bf973354a6  3106b225-dba4-85c2-dd59-24c2112d8906  REDACTED-task  0        run      pending  01/18/18 05:12:52 EST  01/18/18 05:12:52 EST
bash-4.2$ nomad alloc-status -verbose 5af0891e-4a0c-36d1-98e5-a920aed564aa
ID                  = 5af0891e-4a0c-36d1-98e5-a920aed564aa
Eval ID             = 3b00e719-55ee-4f69-3012-99bf973354a6
Name                = REDACTED-periodic-REDACTED-5562d24d-35dd-42c2-ab41-6e30c0aa5707.REDACTED-task[0]
Node ID             = 3106b225-dba4-85c2-dd59-24c2112d8906
Job ID              = REDACTED-periodic-REDACTED-5562d24d-35dd-42c2-ab41-6e30c0aa5707
Job Version         = 0
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created             = 01/18/18 05:12:52 EST
Modified            = 01/18/18 05:12:52 EST
Evaluated Nodes     = 2
Filtered Nodes      = 0
Exhausted Nodes     = 0
Allocation Time     = 1.175832ms
Failures            = 0

Couldn't retrieve stats (HINT: ensure Client.Advertise.HTTP is set): Unexpected response code: 500 (unknown allocation ID "5af0891e-4a0c-36d1-98e5-a920aed564aa")

Placement Metrics
  * Score "c2349873-9bbe-4642-9263-4ace73b24d23.binpack" = 2.998036
  * Score "3106b225-dba4-85c2-dd59-24c2112d8906.binpack" = 3.328022
bash-4.2$ nomad eval-status -verbose 3b00e719-55ee-4f69-3012-99bf973354a6
ID                 = 3b00e719-55ee-4f69-3012-99bf973354a6
Status             = complete
Status Description = complete
Type               = batch
TriggeredBy        = job-register
Job ID             = REDACTED-periodic-REDACTED-5562d24d-35dd-42c2-ab41-6e30c0aa5707
Priority           = 50
Placement Failures = false
Previous Eval      = <none>
Next Eval          = <none>
Blocked Eval       = <none>

Can you just confirm to me that this shouldn't be happening? That is, that this isn't due to resource exhaustion?

@dadgar
Copy link
Contributor

dadgar commented Jan 18, 2018

@nugend Can you output curl http://127.0.0.1:4646/v1/allocation/5af0891e-4a0c-36d1-98e5-a920aed564aa or replace the allocation ID with another exhibiting this behavior?

Can you also grab and post the client logs including from a little before the allocation was created so we can see the behavior start to finish in logs.

@nugend
Copy link
Author

nugend commented Jan 18, 2018

bash-4.2$ curl -v http://REDACTED:4646/v1/allocation/5af0891e-4a0c-36d1-98e5-a920aed564aa | jq
* About to connect() to REDACTED port 4646 (#0)
*   Trying REDACTED...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Connected to REDACTED (REDACTED) port 4646 (#0)
> GET /v1/allocation/5af0891e-4a0c-36d1-98e5-a920aed564aa HTTP/1.1
> User-Agent: curl/7.29.0
> Host: REDACTED:4646
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Type: application/json
< Vary: Accept-Encoding
< X-Nomad-Index: 594538
< X-Nomad-Knownleader: true
< X-Nomad-Lastcontact: 0
< Date: Thu, 18 Jan 2018 18:00:47 GMT
< Transfer-Encoding: chunked
<
{ [data not shown]
100  3131    0  3131    0     0  1204k      0 --:--:-- --:--:-- --:--:-- 1528k
* Connection #0 to host REDACTED left intact
{
  "ID": "5af0891e-4a0c-36d1-98e5-a920aed564aa",
  "Namespace": "default",
  "EvalID": "3b00e719-55ee-4f69-3012-99bf973354a6",
  "Name": "REDACTED-periodic-REDACTED-5562d24d-35dd-42c2-ab41-6e30c0aa5707.REDACTED[0]",
  "NodeID": "3106b225-dba4-85c2-dd59-24c2112d8906",
  "JobID": "REDACTED-periodic-REDACTED-5562d24d-35dd-42c2-ab41-6e30c0aa5707",
  "Job": {
    "Stop": false,
    "Region": "global",
    "Namespace": "default",
    "ID": "REDACTED-periodic-REDACTED-5562d24d-35dd-42c2-ab41-6e30c0aa5707",
    "ParentID": "REDACTED-periodic-REDACTED/periodic-1515888000",
    "Name": "REDACTED",
    "Type": "batch",
    "Priority": 50,
    "AllAtOnce": false,
    "Datacenters": [
      "dc1"
    ],
    "Constraints": null,
    "TaskGroups": [
      {
        "Name": "REDACTED",
        "Count": 1,
        "Update": null,
        "Constraints": null,
        "RestartPolicy": {
          "Attempts": 1,
          "Interval": 604800000000000,
          "Delay": 15000000000,
          "Mode": "fail"
        },
        "Tasks": [
          {
            "Name": "REDACTED",
            "Driver": "raw_exec",
            "User": "",
            "Config": {
              "command": "${meta.deploy_path}/REDACTED/runtrap",
              "args": [
                "REDACTED"
              ]
            },
            "Env": {
              "REDACTED": "REDACTED",
              "REDACTED": "REDACTED",
              "REDACTED": "REDACTED"
            },
            "Services": null,
            "Vault": null,
            "Templates": null,
            "Constraints": null,
            "Resources": {
              "CPU": 100,
              "MemoryMB": 256,
              "DiskMB": 0,
              "IOPS": 0,
              "Networks": null
            },
            "DispatchPayload": null,
            "Meta": null,
            "KillTimeout": 5000000000,
            "LogConfig": {
              "MaxFiles": 10,
              "MaxFileSizeMB": 10
            },
            "Artifacts": null,
            "Leader": false,
            "ShutdownDelay": 0,
            "KillSignal": ""
          }
        ],
        "EphemeralDisk": {
          "Sticky": false,
          "SizeMB": 300,
          "Migrate": false
        },
        "Meta": null
      }
    ],
    "Update": {
      "Stagger": 0,
      "MaxParallel": 0,
      "HealthCheck": "",
      "MinHealthyTime": 0,
      "HealthyDeadline": 0,
      "AutoRevert": false,
      "Canary": 0
    },
    "Periodic": null,
    "ParameterizedJob": null,
    "Payload": null,
    "Meta": null,
    "VaultToken": "",
    "Status": "pending",
    "StatusDescription": "",
    "Stable": false,
    "Version": 0,
    "SubmitTime": 1516270372104751400,
    "CreateIndex": 594536,
    "ModifyIndex": 594536,
    "JobModifyIndex": 594536
  },
  "TaskGroup": "REDACTED",
  "Resources": {
    "CPU": 100,
    "MemoryMB": 256,
    "DiskMB": 300,
    "IOPS": 0,
    "Networks": null
  },
  "SharedResources": {
    "CPU": 0,
    "MemoryMB": 0,
    "DiskMB": 300,
    "IOPS": 0,
    "Networks": null
  },
  "TaskResources": {
    "REDACTED": {
      "CPU": 100,
      "MemoryMB": 256,
      "DiskMB": 0,
      "IOPS": 0,
      "Networks": null
    }
  },
  "Metrics": {
    "NodesEvaluated": 2,
    "NodesFiltered": 0,
    "NodesAvailable": {
      "dc1": 3
    },
    "ClassFiltered": null,
    "ConstraintFiltered": null,
    "NodesExhausted": 0,
    "ClassExhausted": null,
    "DimensionExhausted": null,
    "QuotaExhausted": null,
    "Scores": {
      "3106b225-dba4-85c2-dd59-24c2112d8906.binpack": 3.328021522915815,
      "c2349873-9bbe-4642-9263-4ace73b24d23.binpack": 2.9980360779389414
    },
    "AllocationTime": 1175832,
    "CoalescedFailures": 0
  },
  "DesiredStatus": "run",
  "DesiredDescription": "",
  "ClientStatus": "pending",
  "ClientDescription": "",
  "TaskStates": null,
  "PreviousAllocation": "",
  "DeploymentID": "",
  "DeploymentStatus": null,
  "CreateIndex": 594538,
  "ModifyIndex": 594538,
  "AllocModifyIndex": 594538,
  "CreateTime": 1516270372211145700,
  "ModifyTime": 1516270372211145700
}

@dadgar
Copy link
Contributor

dadgar commented Jan 18, 2018

@nugend What version of Nomad?

@nugend
Copy link
Author

nugend commented Jan 18, 2018

Nomad v0.7.1 (0b295d3)

@dadgar
Copy link
Contributor

dadgar commented Jan 18, 2018

Can you share the client logs?

@nugend
Copy link
Author

nugend commented Jan 18, 2018

It's a bit harder for that. There's a lot of info that I need to remove and I don't know if we've got them back far enough to be useful at the moment. I'm working on my end to get the issues preventing me from discussing this more freely cleared up.

The "good" news is that this seems to happen regularly enough that we'll almost certainly be able to investigate it again at a later point.

@dadgar
Copy link
Contributor

dadgar commented Jan 18, 2018

@nugend Do you have reproduction steps that I could use?

@dadgar dadgar changed the title [question] Scheduling priority? Client not starting a placed allocation Jan 18, 2018
@dadgar dadgar reopened this Jan 18, 2018
@nugend
Copy link
Author

nugend commented Jan 18, 2018

Nope. Haven't been able to drop this down to a simple case yet.

At the moment, my working theory is "start a metric crapload of one off jobs" over a few weeks.

@vincenthuynh
Copy link

We've also seen that our service job allocations are becoming stuck in pending status when deploying on one of our clusters. They sit in this state for no obvious reason (resources are idle, debug logs are clean, not downloading a resource) before they eventually move to the 'running' state which takes anywhere from 5-30 minutes.

Nomad v0.7.0

When running a nomad status on the stuck allocation:

$ nomad status ba4f318a
ID                  = ba4f318a
Eval ID             = 9ab96049
Name                = REDACTED[0]
Node ID             = e43b2742
Job ID              = REDACTEDr
Job Version         = 1033
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created At          = 03/12/18 17:27:15 EDT
Deployment ID       = f0b28203
Deployment Health   = unset

Couldn't retrieve stats (HINT: ensure Client.Advertise.HTTP is set): Unexpected response code: 500 (unknown allocation ID "ba4f318a-6939-479e-2c4c-84f18c946390")

The results of curl http://127.0.0.1:4646/v1/allocation/ba4f318a-6939-479e-2c4c-84f18c946390 are:

{
...
...
...
  "Metrics": {
    "NodesEvaluated": 9,
    "NodesFiltered": 0,
    "NodesAvailable": {
      "development": 9
    },
    "ClassFiltered": null,
    "ConstraintFiltered": null,
    "NodesExhausted": 5,
    "ClassExhausted": null,
    "DimensionExhausted": {
      "memory exhausted": 2,
      "network: reserved port collision": 3
    },
    "QuotaExhausted": null,
    "Scores": {
      "e43b2742-9ecf-3cd3-1e65-cfb1de5dedc7.binpack": 16.708819396983444,
      "4207c53e-3413-767e-7723-4483e6c9f694.binpack": 8.331593517214175,
      "f6922742-7a4d-82d7-4314-e9b67c1cd4b1.binpack": 14.81238224370183,
      "47a0a478-d3a1-a0a9-cf64-63af7a8dc642.binpack": 10.4377954574019
    },
    "AllocationTime": 569492,
    "CoalescedFailures": 0
  },
  "DesiredStatus": "run",
  "DesiredDescription": "",
  "ClientStatus": "pending",
  "ClientDescription": "",
  "TaskStates": null,
  "PreviousAllocation": "",
  "DeploymentID": "f0b28203-ce5a-52e9-44b1-1aec494c7009",
  "DeploymentStatus": null,
  "CreateIndex": 4096384,
  "ModifyIndex": 4096384,
  "AllocModifyIndex": 4096384,
  "CreateTime": 1520890035377744600
}

And here are the client logs from when the job was submitted:

Mar 12 17:26:17 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:26:17.577341 [DEBUG] driver.docker: Setting default logging options to syslog and unix:///tmp/plugin918260731
Mar 12 17:26:17 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:26:17.577370 [DEBUG] driver.docker: Using config for logging: {Type:syslog ConfigRaw:[] Config:map[syslog-address:unix:///tmp/plugin918260731]}
Mar 12 17:26:17 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:26:17.577378 [DEBUG] driver.docker: using 536870912 bytes memory for logshipper
Mar 12 17:26:17 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:26:17.577384 [DEBUG] driver.docker: using 100 cpu shares for logshipper
Mar 12 17:26:17 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:26:17.577408 [DEBUG] driver.docker: binding directories []string{"/var/lib/nomad/alloc/83b2b238-14fc-3e82-48f8-fc5c007bc1ba/alloc:/alloc", "/var/lib/nomad/alloc/83b2b238-14fc-3e82-48f8-fc5c007bc1ba/logshipper/local:/local", "/var
/lib/nomad/alloc/83b2b238-14fc-3e82-48f8-fc5c007bc1ba/logshipper/secrets:/secrets", "/var/lib/nomad/alloc/83b2b238-14fc-3e82-48f8-fc5c007bc1ba/logshipper/local/data:/data", "/var/lib/nomad/alloc/83b2b238-14fc-3e82-48f8-fc5c007bc1ba/logshipper/filebeat.yml:/filebeat.yml", "/var/lib/noma
d/alloc/83b2b238-14fc-3e82-48f8-fc5c007bc1ba/logshipper/local/logstash-beats.crt:/logstash-beats.crt", "/var/lib/nomad/alloc:/nomad:ro"} for logshipper
Mar 12 17:26:17 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:26:17.577415 [DEBUG] driver.docker: networking mode not specified; defaulting to bridge
Mar 12 17:26:17 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:26:17.577420 [DEBUG] driver.docker: No network interfaces are available
Mar 12 17:26:17 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:26:17.577437 [DEBUG] driver.docker: setting container name to: logshipper-83b2b238-14fc-3e82-48f8-fc5c007bc1ba
Mar 12 17:26:17 nmd-dev-plm-06 kernel: [278062.517148] aufs au_opts_verify:1570:dockerd[2826]: dirperm1 breaks the protection by the permission bits on the lower branch
Mar 12 17:26:17 nmd-dev-plm-06 kernel: [278062.542344] aufs au_opts_verify:1570:dockerd[2826]: dirperm1 breaks the protection by the permission bits on the lower branch
Mar 12 17:26:17 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:26:17.635868 [INFO] driver.docker: created container 8aa8294c7cd72e7aabc40ff6b532bd2d2a8182774bcfae422d90b779a8d9f56e
Mar 12 17:26:17 nmd-dev-plm-06 kernel: [278062.574328] aufs au_opts_verify:1570:dockerd[952]: dirperm1 breaks the protection by the permission bits on the lower branch
Mar 12 17:26:17 nmd-dev-plm-06 kernel: [278062.590307] device veth4665811 entered promiscuous mode
Mar 12 17:26:17 nmd-dev-plm-06 kernel: [278062.590348] IPv6: ADDRCONF(NETDEV_UP): veth4665811: link is not ready
Mar 12 17:26:17 nmd-dev-plm-06 kernel: [278062.590352] docker0: port 9(veth4665811) entered forwarding state
Mar 12 17:26:17 nmd-dev-plm-06 kernel: [278062.590356] docker0: port 9(veth4665811) entered forwarding state
Mar 12 17:26:18 nmd-dev-plm-06 kernel: [278062.980555] IPv6: ADDRCONF(NETDEV_CHANGE): veth4665811: link becomes ready
Mar 12 17:26:18 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:26:18.097566 [INFO] driver.docker: started container 8aa8294c7cd72e7aabc40ff6b532bd2d2a8182774bcfae422d90b779a8d9f56e
Mar 12 17:26:18 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:26:18.112532 [WARN] client: error fetching stats of task logshipper: stats collection hasn't started yet
Mar 12 17:26:21 nmd-dev-plm-06 ntpd[608]: Listen normally on 675 veth4665811 fe80::9403:9eff:fe77:8008 UDP 123
Mar 12 17:26:21 nmd-dev-plm-06 ntpd[608]: Deleting interface #674 vethe2970b8, fe80::58a6:79ff:fea3:c998#123, interface stats: received=0, sent=0, dropped=0, active_time=5 secs
Mar 12 17:26:21 nmd-dev-plm-06 ntpd[608]: peers refreshed
Mar 12 17:26:24 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:26:24.944258 [DEBUG] http: Request /v1/agent/health?type=client (269.918µs)
Mar 12 17:26:32 nmd-dev-plm-06 kernel: [278077.618020] docker0: port 9(veth4665811) entered forwarding state
Mar 12 17:26:34 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:26:34.946639 [DEBUG] http: Request /v1/agent/health?type=client (291.941µs)
Mar 12 17:26:44 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:26:44.948167 [DEBUG] http: Request /v1/agent/health?type=client (301.169µs)
Mar 12 17:26:54 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:26:54.951984 [DEBUG] http: Request /v1/agent/health?type=client (297.461µs)
Mar 12 17:27:01 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:01 [INFO] agent: Synced check 'ff3a250f24f3144630eec17b6b248de2178296c1'
Mar 12 17:27:01 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:01 [INFO] agent: Synced check '0661e1c6f3fb9854780d33665b389d331fd88a8d'
Mar 12 17:27:01 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:01 [INFO] agent: Synced check '00849d6791950ac6f2e1323fb2f02f52fb73a26e'
Mar 12 17:27:01 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:01 [INFO] agent: Synced check '8926cb2cf90793666cda8c89313df9cef6e11215'
Mar 12 17:27:01 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:01 [INFO] agent: Synced check 'cf409ba91fe298b56d0a1d4482dd80d9d9892523'
Mar 12 17:27:01 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:01 [INFO] agent: Synced check 'fa8b616a880b907054764fc4f1fffe93c09a7942'
Mar 12 17:27:01 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:01 [INFO] agent: Synced check '306b71e4c8d818d97c31da1b5c48ae7b19e9fdcc'
Mar 12 17:27:01 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:01 [INFO] agent: Synced check '920fe576a0b1a1077af7b274de8d2509d8a28edf'
Mar 12 17:27:01 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:01 [INFO] agent: Synced check '03c5d9215842a65d5e87823dd04553f6c3837785'
Mar 12 17:27:01 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:01 [INFO] agent: Synced check '6beb15962a6d953c4a1e440723bbfe9c036813c2'
Mar 12 17:27:01 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:01 [INFO] agent: Synced check '2c0c36107151704e42f262ff3bdc85e81c0fcb47'
Mar 12 17:27:01 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:01 [INFO] agent: Synced check '30e9c5ec34ca8cb6e8ff4570d4cc3ace5e026b34'
Mar 12 17:27:01 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:01 [INFO] agent: Synced check '4e64d239d399924994e4dc3863428c17c129cc1e'
Mar 12 17:27:01 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:01 [INFO] agent: Synced check '6792c39c95236e29439ad7da8620c860d99c38d1'
Mar 12 17:27:01 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:01 [INFO] agent: Synced check '8e411f64d7b605a0fdd7606e2f3c7bccc46ba81a'
Mar 12 17:27:01 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:01 [INFO] agent: Synced check '848748024fc1752f611716ff16df75b89d83d246'
Mar 12 17:27:01 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:01 [INFO] agent: Synced check 'f3f0561e6825a45aa10938ca4a96a6e2e84b78a7'
Mar 12 17:27:01 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:01 [INFO] agent: Synced check 'fdceac1de9a145133bcd0190f25a16c840c9cfed'
Mar 12 17:27:01 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:01 [INFO] agent: Synced check 'f22b3112dc5bbb5020ba562cfcd4b1d0f37daef4'
Mar 12 17:27:01 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:01 [INFO] agent: Synced check '64d26403968da7c0cf50119ddcfaf8f84b6d2ca2'
Mar 12 17:27:01 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:01 [INFO] agent: Synced check '8c49dfafddcb8f6eb0109e7fd8f9d61a92a9e34c'
Mar 12 17:27:01 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:01 [INFO] agent: Synced check 'fad3562a06b71fc7d6c91ea4987b5642f6fed13d'
Mar 12 17:27:01 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:01 [INFO] agent: Synced check '4476c4bcfaa3a0bd8903e1f507e7d656b4eabce2'
Mar 12 17:27:01 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:01 [INFO] agent: Synced check '885fdbc5331dfb16447fd6d49b8cb64bc297d202'
Mar 12 17:27:01 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:01 [INFO] agent: Synced check '48561aa62bf843a9c05b0de2e8202e35ca88d5d1'
Mar 12 17:27:01 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:01 [INFO] agent: Synced check 'efcebd2c41322ac83a745876717e2a194a03ef05'
Mar 12 17:27:01 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:01 [INFO] agent: Synced check '4f734fbc2c21eba5fdd98a0e4572ae7064c2d964'
Mar 12 17:27:01 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:01 [INFO] agent: Synced check '070003ae57fa6c322ca1dbac88299d1daf46ab34'
Mar 12 17:27:01 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:01 [INFO] agent: Synced check '153cb7f92f311f3df78552fa82c571ee9ed31582'
Mar 12 17:27:01 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:01 [INFO] agent: Synced check '2f9df0f65e401ce322137d364e00ceb89c3e5962'
Mar 12 17:27:04 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:27:04.953554 [DEBUG] http: Request /v1/agent/health?type=client (329.857µs)
Mar 12 17:27:14 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:27:14.954773 [DEBUG] http: Request /v1/agent/health?type=client (316.115µs)
Mar 12 17:27:24 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:27:24.955947 [DEBUG] http: Request /v1/agent/health?type=client (305.462µs)
Mar 12 17:27:34 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:27:34.957064 [DEBUG] http: Request /v1/agent/health?type=client (259.226µs)
Mar 12 17:27:44 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:27:44.959056 [DEBUG] http: Request /v1/agent/health?type=client (617.744µs)
Mar 12 17:27:50 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:27:50 [DEBUG] (runner) receiving dependency health.service(redis|passing)
Mar 12 17:27:50 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:27:50 [INFO] (runner) initiating run
Mar 12 17:27:50 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:27:50 [DEBUG] (runner) checking template 08b5fbc7335407f902456c66901ad6ef
Mar 12 17:27:50 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:27:50 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/56df8b65-e77b-951a-2bf4-906b5799477f/redis-commander/local/configuration"
Mar 12 17:27:50 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:27:50 [DEBUG] (runner) diffing and updating dependencies
Mar 12 17:27:50 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:27:50 [DEBUG] (runner) health.service(redis|passing) is still needed
Mar 12 17:27:50 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:27:50 [DEBUG] (runner) watching 1 dependencies
Mar 12 17:27:50 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:27:50 [DEBUG] (runner) all templates rendered
Mar 12 17:27:54 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:27:54.960573 [DEBUG] http: Request /v1/agent/health?type=client (291.272µs)
Mar 12 17:27:55 nmd-dev-plm-06 dockerd[616]: time="2018-03-12T17:27:55.623913603-04:00" level=info msg="Container 37697d8360d3466306e1a6395386803defc9951b82ca2950a2e33cfbd3ab3859 failed to exit within 5 seconds of signal 15 - using the force"
Mar 12 17:27:55 nmd-dev-plm-06 kernel: [278160.687079] docker0: port 1(veth88c4200) entered disabled state
Mar 12 17:27:55 nmd-dev-plm-06 kernel: [278160.756806] docker0: port 1(veth88c4200) entered disabled state
Mar 12 17:27:55 nmd-dev-plm-06 kernel: [278160.757995] device veth88c4200 left promiscuous mode
Mar 12 17:27:55 nmd-dev-plm-06 kernel: [278160.758002] docker0: port 1(veth88c4200) entered disabled state
Mar 12 17:27:55 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:27:55.950533 [INFO] driver.docker: stopped container 37697d8360d3466306e1a6395386803defc9951b82ca2950a2e33cfbd3ab3859
Mar 12 17:27:55 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:27:55.950743 [DEBUG] driver.docker: error collecting stats from container 37697d8360d3466306e1a6395386803defc9951b82ca2950a2e33cfbd3ab3859: io: read/write on closed pipe
Mar 12 17:27:55 nmd-dev-plm-06 nomad[582]: 2018-03-12T17:27:55.952-0400 [DEBUG] plugin: plugin process exited: path=/usr/local/bin/nomad
Mar 12 17:27:55 nmd-dev-plm-06 dockerd[616]: time="2018-03-12T17:27:55.953340898-04:00" level=error msg="Handler for POST /containers/37697d8360d3466306e1a6395386803defc9951b82ca2950a2e33cfbd3ab3859/stop returned error: Container 37697d8360d3466306e1a6395386803defc9951b82ca2950a2e33cfbd3ab3859 is already stopped"
Mar 12 17:27:55 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:27:55.962246 [INFO] client: Restarting task "redis-commander" for alloc "56df8b65-e77b-951a-2bf4-906b5799477f" in 0s
Mar 12 17:27:55 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:27:55.962827 [DEBUG] client: driver event for alloc "56df8b65-e77b-951a-2bf4-906b5799477f": Downloading image rediscommander/redis-commander:latest
Mar 12 17:27:55 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:55 [INFO] agent: Deregistered service '_nomad-executor-56df8b65-e77b-951a-2bf4-906b5799477f-redis-commander-redis-commander-urlprivate-ha-redis-commander.query.devconsul/'
Mar 12 17:27:55 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:55 [INFO] agent: Deregistered check '153cb7f92f311f3df78552fa82c571ee9ed31582'
Mar 12 17:27:55 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:55 [INFO] agent: Deregistered check '153cb7f92f311f3df78552fa82c571ee9ed31582'
Mar 12 17:27:55 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:27:55.989764 [DEBUG] consul.sync: registered 0 services, 0 checks; deregistered 1 services, 1 checks
Mar 12 17:27:56 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:27:56.388720 [DEBUG] driver.docker: docker pull rediscommander/redis-commander:latest succeeded
Mar 12 17:27:56 nmd-dev-plm-06 nomad[582]: 2018-03-12T17:27:56.390-0400 [DEBUG] plugin: starting plugin: path=/usr/local/bin/nomad args="[/usr/local/bin/nomad executor {"LogFile":"/var/lib/nomad/alloc/56df8b65-e77b-951a-2bf4-906b5799477f/redis-commander/executor.out","LogLevel":"DEBUG"}]"
Mar 12 17:27:56 nmd-dev-plm-06 nomad[582]: 2018-03-12T17:27:56.390-0400 [DEBUG] plugin: waiting for RPC address: path=/usr/local/bin/nomad
Mar 12 17:27:56 nmd-dev-plm-06 nomad[582]: 2018-03-12T17:27:56.416-0400 [DEBUG] plugin.nomad: plugin address: timestamp=2018-03-12T17:27:56.415-0400 network=unix address=/tmp/plugin245284078
Mar 12 17:27:56 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:27:56.420548 [DEBUG] driver.docker: Setting default logging options to syslog and unix:///tmp/plugin902679413
Mar 12 17:27:56 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:27:56.420580 [DEBUG] driver.docker: Using config for logging: {Type:syslog ConfigRaw:[] Config:map[syslog-address:unix:///tmp/plugin902679413]}
Mar 12 17:27:56 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:27:56.420588 [DEBUG] driver.docker: using 268435456 bytes memory for redis-commander
Mar 12 17:27:56 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:27:56.420595 [DEBUG] driver.docker: using 100 cpu shares for redis-commander
Mar 12 17:27:56 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:27:56.420616 [DEBUG] driver.docker: binding directories []string{"/var/lib/nomad/alloc/56df8b65-e77b-951a-2bf4-906b5799477f/alloc:/alloc", "/var/lib/nomad/alloc/56df8b65-e77b-951a-2bf4-906b5799477f/redis-commander/local:/local", "/var/lib/nomad/alloc/56df8b65-e77b-951a-2bf4-906b5799477f/redis-commander/secrets:/secrets", "/var/lib/nomad/alloc/56df8b65-e77b-951a-2bf4-906b5799477f/redis-commander/local/configuration:/home/redis-commander/.redis-commander"} for redis-commander
Mar 12 17:27:56 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:27:56.420624 [DEBUG] driver.docker: networking mode not specified; defaulting to bridge
Mar 12 17:27:56 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:27:56.420634 [DEBUG] driver.docker: allocated port 192.168.80.207:22742 -> 8081 (mapped)
Mar 12 17:27:56 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:27:56.420642 [DEBUG] driver.docker: exposed port 8081
Mar 12 17:27:56 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:27:56.420661 [DEBUG] driver.docker: setting container name to: redis-commander-56df8b65-e77b-951a-2bf4-906b5799477f
Mar 12 17:27:56 nmd-dev-plm-06 kernel: [278161.324169] aufs au_opts_verify:1570:dockerd[945]: dirperm1 breaks the protection by the permission bits on the lower branch
Mar 12 17:27:56 nmd-dev-plm-06 kernel: [278161.414559] aufs au_opts_verify:1570:dockerd[945]: dirperm1 breaks the protection by the permission bits on the lower branch
Mar 12 17:27:56 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:27:56.563364 [INFO] driver.docker: created container 40ba95073a62ff38a4f87f9f2f8dbc4b4b3edd6820c9593dfb95ac7b62403b38
Mar 12 17:27:56 nmd-dev-plm-06 kernel: [278161.464155] aufs au_opts_verify:1570:dockerd[950]: dirperm1 breaks the protection by the permission bits on the lower branch
Mar 12 17:27:56 nmd-dev-plm-06 kernel: [278161.482682] device veth252edf2 entered promiscuous mode
Mar 12 17:27:56 nmd-dev-plm-06 kernel: [278161.482732] IPv6: ADDRCONF(NETDEV_UP): veth252edf2: link is not ready
Mar 12 17:27:56 nmd-dev-plm-06 kernel: [278161.482735] docker0: port 1(veth252edf2) entered forwarding state
Mar 12 17:27:56 nmd-dev-plm-06 kernel: [278161.482739] docker0: port 1(veth252edf2) entered forwarding state
Mar 12 17:27:56 nmd-dev-plm-06 kernel: [278161.714155] docker0: port 1(veth252edf2) entered disabled state
Mar 12 17:27:56 nmd-dev-plm-06 kernel: [278161.732651] IPv6: ADDRCONF(NETDEV_CHANGE): veth252edf2: link becomes ready
Mar 12 17:27:56 nmd-dev-plm-06 kernel: [278161.732683] docker0: port 1(veth252edf2) entered forwarding state
Mar 12 17:27:56 nmd-dev-plm-06 kernel: [278161.732687] docker0: port 1(veth252edf2) entered forwarding state
Mar 12 17:27:56 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:27:56.922248 [INFO] driver.docker: started container 40ba95073a62ff38a4f87f9f2f8dbc4b4b3edd6820c9593dfb95ac7b62403b38
Mar 12 17:27:56 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:27:56.940891 [WARN] client: error fetching stats of task redis-commander: stats collection hasn't started yet
Mar 12 17:27:56 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:56 [INFO] agent: Synced service '_nomad-executor-56df8b65-e77b-951a-2bf4-906b5799477f-redis-commander-redis-commander-urlprivate-ha-redis-commander.query.devconsul/'
Mar 12 17:27:56 nmd-dev-plm-06 consul[581]: 2018/03/12 17:27:56 [INFO] agent: Synced check '153cb7f92f311f3df78552fa82c571ee9ed31582'
Mar 12 17:27:56 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:27:56.971721 [DEBUG] consul.sync: registered 1 services, 1 checks; deregistered 0 services, 0 checks
Mar 12 17:28:00 nmd-dev-plm-06 ntpd[608]: Listen normally on 676 veth252edf2 fe80::7cc0:e0ff:fedc:a23f UDP 123
Mar 12 17:28:00 nmd-dev-plm-06 ntpd[608]: Deleting interface #673 veth88c4200, fe80::c0b9:bbff:fe20:57d1#123, interface stats: received=0, sent=0, dropped=0, active_time=636 secs
Mar 12 17:28:00 nmd-dev-plm-06 ntpd[608]: peers refreshed
Mar 12 17:28:04 nmd-dev-plm-06 nomad[582]: 2018/03/12 17:28:04.961877 [DEBUG] http: Request /v1/agent/health?type=client (292.911µs)
Mar 12 17:28:05 nmd-dev-plm-06 consul[581]: 2018/03/12 17:28:05 [INFO] agent: Synced check '4e64d239d399924994e4dc3863428c17c129cc1e'
Mar 12 17:28:05 nmd-dev-plm-06 consul[581]: 2018/03/12 17:28:05 [INFO] agent: Synced check '6792c39c95236e29439ad7da8620c860d99c38d1'

Let me know if there is anything additional logging I can enable to shed light why the allocations become stuck.

@vincenthuynh
Copy link

I think I found the source of the issue for our cluster. One of our server's log entry was our of sync with the rest of the nodes (3 server quorum). It was found by comparing their last_log_index:

nomad agent-info | grep last_log_index

The fix was to stop nomad, delete /var/lib/nomad/server and start the service again. Once the log entries were in sync, all the issues went away. Hope that helps!

@nugend
Copy link
Author

nugend commented Mar 13, 2018

@vincenthuynh I'll look out for it the next time we see this issue.

@nugend
Copy link
Author

nugend commented Mar 14, 2018

@vincenthuynh Nope. We're seeing the issue today and they're all the same on our env.

We'll try your fix anyway though.

@vincenthuynh
Copy link

Are you seeing the stuck allocations on a particular client node? We treat our client nodes as 'cattle'/stateless so we just created a new client node and kicked out the bad one, but it seemed to still come back. Perhaps you can try that?

@nugend
Copy link
Author

nugend commented Mar 14, 2018

Yeah. The one thing I'm noticing in the logs is that after the issue occurs, there are basically no more log messages for watchAllocations

The healthy nodes are emitting periodic messages with varying counts like the following:

    2018/03/14 14:03:08.191428 [DEBUG] client: updated allocations at index 6871995 (total 101) (pulled 0) (filtered 101)

The unhealthy node shows this as the last instance of that log message right about when the issue manifested:

    2018/03/14 09:05:27.509992 [DEBUG] client: updated allocations at index 6940587 (total 4021) (pulled 14) (filtered 4007)

Of course, that doesn't exactly tell me if it's a client or server issue. (Like, is the client blocking for an update and the server never sends one? Or is the server sending updates, but the client goproc is crashed?)

@nugend
Copy link
Author

nugend commented Mar 14, 2018

@vincenthuynh We tried your suggested remediation and it sort-of worked. The pending allocations began running, but the agent also tried to start new instances of long-running allocations that had remained executing.

@vincenthuynh
Copy link

Sounds like the scheduler lost track of the long-running allocation and/or got into an inconsistent state... Hopefully someone familiar with the scheduler can help shed some light on this?

@dadgar
Copy link
Contributor

dadgar commented Mar 22, 2018

@nugend Next time this happens would you be willing to kill the particular nomad agent that isn't starting the allocations with a SIGQUIT (3) so we can get a stack dump.

Are the allocations getting stuck always Docker driver? What else do they have in common, are they all using templates, vault, service definitions, etc? Could you potentially share the job file of a few that have gotten stuck

@nugend
Copy link
Author

nugend commented Mar 23, 2018

For us, they're always raw_exec because that's the only driver we use. There is no consistency in terms of the provenance except that it's across a given node that's in a "bad state". That is, usually it's only one node, on occasion it's been two nodes. We haven't seen all three in our setup have this problem yet.

I will be sure to capture a stack dump next time. Sorry, would've done so on last occurrence had I known.

@rickardrosen
Copy link

rickardrosen commented Apr 4, 2018

Seeing this as well, right now actually.
What triggered it was a flurry of batch jobs dispatched in a short period of time(~1600 parameterized).
We could se a dip in connectivity to the consul agents running on the docker hosts, and as such a loss of multiple services for a time.

I'll see if I can get a dump over here. @dadgar: do want it uploaded somewhere?

@atillamas
Copy link

Not sure if this is related or if i should start a new issue.
We bring up some instances during specific hours, and nomad fails to place one specific task from a group of 3 different tasks.
Some days it works, and some days i just have to rerun the jobspec to have nomad allocate the task. This does not happen in any of our other environments.
Running Nomad 0.8.6

@stale
Copy link

stale bot commented May 10, 2019

This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem 👍

@stale stale bot closed this as completed May 10, 2019
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants