intermittent error "reserved port collision" when running new job #9506

habibiefaried · 2020-12-03T04:41:48Z

Nomad version

Nomad v1.0.0-beta3 (fcb32ef)

Operating system and Environment details

Distributor ID: Debian
Description: Debian GNU/Linux 10 (buster)
Release: 10
Codename: buster

Issue

Sometimes, whenever I update or run a new job. I got error max-plan-attempts failed on its UI

Reproduction steps

Just run a job file

Job file (if appropriate)

Taken from: https://www.hashicorp.com/blog/consul-connect-native-tasks-in-hashicorp-nomad-0-12

job "cn-demo" {
      datacenters = ["dc1"]

      group "generator" {
        network {
          port "api" {}
        }

        service {
          name = "uuid-api"
          port = "${NOMAD_PORT_api}"

          connect {
            native = true
          }
        }

        task "generate" {
          driver = "docker"

          config {
            image        = "hashicorpnomad/uuid-api:v3"
            network_mode = "host"
          }

          env {
            BIND = "0.0.0.0"
            PORT = "${NOMAD_PORT_api}"
          }
        }
      }

      group "frontend" {
        network {
          port "http" {
            static = 9800
          }
        }

        service {
          name = "uuid-fe"
          port = "9800"

          connect {
            native = true
          }
        }

        task "frontend" {
          driver = "docker"

          config {
            image        = "hashicorpnomad/uuid-fe:v3"
            network_mode = "host"
          }

          env {
            UPSTREAM = "uuid-api"
            BIND     = "0.0.0.0"
            PORT     = "9800"
          }
        }
      }
    }

Nomad Client logs (if appropriate)

2020-12-03T05:32:38.511+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=442.672µs
    2020-12-03T05:32:48.516+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=1.008414ms
    2020-12-03T05:32:58.534+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=900.343µs

Nomad Server logs (if appropriate)

2020-12-03T05:33:01.954+0100 [DEBUG] worker: submitted plan for evaluation: eval_id=098a520b-4cd9-78a3-6ba2-613d94b33823
    2020-12-03T05:33:01.954+0100 [DEBUG] worker: refreshing state: refresh_index=345 eval_id=098a520b-4cd9-78a3-6ba2-613d94b33823
    2020-12-03T05:33:01.954+0100 [DEBUG] worker.service_sched: refresh forced: eval_id=098a520b-4cd9-78a3-6ba2-613d94b33823 job_id=cn-demo namespace=default
    2020-12-03T05:33:01.954+0100 [DEBUG] worker.service_sched: reconciled current state with desired state: eval_id=098a520b-4cd9-78a3-6ba2-613d94b33823 job_id=cn-demo namespace=default results="Total changes: (place 2) (destructive 0) (inplace 0) (stop 0)
Desired Changes for "generator": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)
Desired Changes for "frontend": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)"
    2020-12-03T05:33:01.987+0100 [DEBUG] nomad: plan for node rejected: node_id=9fa84370-c713-b914-d329-f6485951cddc reason="reserved port collision" eval_id=098a5

Nomad Server Config

bind_addr = "0.0.0.0"
log_level = "DEBUG"
data_dir = "/etc/nomad"
name = "vmi483776.contaboserver.net"
server {
  enabled = true
  bootstrap_expect = 1
  raft_protocol = 3
}
advertise {
  http = "161.97.158.38"
  rpc = "161.97.158.38"
  serf = "161.97.158.38"
}
consul {
  address = "127.0.0.1:8500"
  auto_advertise = true
  server_auto_join = true
  client_auto_join = true
}
autopilot {
  cleanup_dead_servers      = true
  last_contact_threshold    = "200ms"
  max_trailing_logs         = 2500
  server_stabilization_time = "10s"
  enable_redundancy_zones   = false
  disable_upgrade_migration = false
  enable_custom_upgrades    = false
}

Nomad Client config

bind_addr = "0.0.0.0"
log_level = "DEBUG"
data_dir = "/etc/nomad"
name = "vmi483778.contaboserver.net"
client {
  enabled = true
  reserved {
    reserved_ports = "22,53,68,111,8301,8500,8600,32768-60999"
  }
}
advertise {
  http = "161.97.158.40"
  rpc = "161.97.158.40"
  serf = "161.97.158.40"
}
consul {
  address = "127.0.0.1:8500"
  auto_advertise = true
  server_auto_join = true
  client_auto_join = true
}

Notes

Btw you can really visit this url: http://161.97.158.38:4646/ui/jobs/cn-demo. It's my testbed server for this nomad beta version. It's opened and unauthenticated, in which you can try to see it live

Thanks

The text was updated successfully, but these errors were encountered:

vincenthuynh · 2020-12-03T18:26:27Z

I'm seeing the same issue too: https://discuss.hashicorp.com/t/evaluation-maximum-attempts-reached-5/18449

rgl · 2020-12-03T20:16:56Z

I'm also observing this. But my job does not even exposes any port, so I'm a bit confused about the error message:

job "damon" {
  datacenters = ["dc1"]
  group "damon" {
    count = 1
    task "damon" {
      driver = "raw_exec"
      artifact {
        source = "https://github.com/rgl/damon/releases/download/v0.1.2/damon.zip"
      }
      template {
        destination = "local/example.ps1"
        data = <<-EOD
...
        EOD
      }
      config {
        command = "local/damon.exe"
        args = [
          "PowerShell.exe",
          "-File",
          "local/example.ps1"
        ]
      }
      resources {
        cpu = 100
        memory = 350
      }
    }
  }
}

Also, is there a better way to gather these errors messages besides monitoring the client logs from the nomad ui (e.g. http://localhost:4646/ui/clients/c9d4d1c5-51cc-7276-0949-638d0072061e/monitor?level=debug)?

It would be pretty nice to see these errors from e.g. http://localhost:4646/ui/jobs/damon/evaluations. For my particular problem, it only shows something like max-plan-attempts | failed | False.

I'm not sure if it makes any difference, but I'm running nomad in Windows Server 2019.

habibiefaried · 2020-12-03T22:41:56Z

Yes I agree. This kind of error should be visible from the UI of failed job.

Btw, I have a weird workaround for this bug. So, you can just restart all nomad servers and eventually, this job will be re-run again successfully. I've done it on my cluster multiple times.

However, not sure if I have to do this periodically -__-'

habibiefaried · 2020-12-07T04:15:47Z

Btw I have the solid workaround for this. So I did these steps:

Restarting all nomad servers
Stop that pending task
Re-run the same task

Well, at least I have this workaround. Hoping this will be useful for fellow users as well.

tgross · 2020-12-08T13:48:52Z

Hi @habibiefaried it looks like you're running into another example of #9492. The port reservation is only honored for static ports, not dynamic ports.

We intend to fix this (see #8186) and have a community-provided PR #8478 that doesn't seem like it's going to make it over the finish line so we're going to have to circle back to that.

exFalso · 2020-12-22T09:57:34Z

Getting this as well on 0.12, but the workaround doesn't work. The error seems persistent, I cannot run any job atm on the cluster, they all result in reserved port collision. Tried restarting servers+clients, stopping/restarting all jobs.

exFalso · 2020-12-22T10:11:30Z

Ok what worked:

Stopped all servers
Manually wiped server folders
restart servers
restart clients

As we have continued to see reports of #9506 we need to elevate this log line as it is the only way to detect when plans are being *erroneously* rejected. Users who see this log line repeatedly should drain and restart the node in the log line. This seems to workaorund the issue. Please post any details on #9506!

schmichael · 2021-12-07T21:33:03Z

We've seen this come up more and are actively investigating it. One workaround we've seen work every time is restarting the client agent (node) specified in the log message. It seems the node's state on one or more servers can get corrupted and restarting the it fixes the corruption.

Fixing this is a top priority of the Nomad team! Sorry for the delay, it's a tricky one.

badalex · 2021-12-30T05:37:58Z

For what it's worth, for me, I tracked this down to a bad puppet config. 80,8081 turned into reserved_ports: 808081. which blocked all allocations on affected nodes.

#9506 (comment) makes for an easy repro

schmichael · 2022-01-05T01:06:16Z

Thanks @badalex! While unfortunately that isn't the root cause for everyone affected, it does make for a nice easy reproduction!

As a Go test: https://github.com/hashicorp/nomad/compare/b-invalid-reserved-ports#diff-b501e4f2415a7da37e504a9ab69b251588909e41ffd809d0b88fe3c1b87d6751R2345-R2422

Outputs:

job_endpoint_test.go:2411:
                Error Trace:    job_endpoint_test.go:2411
                Error:          "map[]" should have 1 item(s), but has 0
                Test:           TestJobEndpoint_Register_BadNodeNetwork
                Messages:       &{_struct:false ID:f5d53f73-3d33-1c27-6e2d-f5bd7a5da1ea Namespace:default Priority:50 Type:service TriggeredBy:job-register JobID:mock-service-9d725069-2884-307e-53da-9d46aa40edcc JobModifyIndex:7 NodeID: NodeModifyIndex:0 DeploymentID: Status:failed StatusDescription:maximum attempts reached (5) Wait:0s WaitUntil:0001-01-01 00:00:00 +0000 UTC NextEval: PreviousEval: BlockedEval:caed821f-5c85-7bc3-8ffc-76eaf513ae05 FailedTGAllocs:map[] ClassEligibility:map[] QuotaLimitReached: EscapedComputedClass:false AnnotatePlan:false QueuedAllocations:map[web:1] LeaderACL: SnapshotIndex:7 CreateIndex:7 ModifyIndex:9 CreateTime:1641343586422081104 ModifyTime:1641343586428807197}

Note the StatusDescription:maximum attempts reached (5) and empty FailedTGAllocs. If the scheduler is unable to satisfy the network requirements for a job the eval should have:

FailedTGAllocs.<task_group>.DimensionExhausted = "network: dynamic port selection failed": 1

Dev Agent

You can also reproduce this with a dev agent by using the following reserved.hcl agent configuration and running any job (such as produced by nomad init -short).

cat > reserved.hcl << EOF
client {
  reserved {
    reserved_ports = "808081"
  }
}
EOF

nomad agent -dev -config reserved.hcl

# Now run a job in another terminal or the UI

Fixes

I intend to add 2 fixes for this:

Validate reserved_ports on client agent startup and refuse to start if the reservation is malformed. This is what Nomad client agents are intended to do if misconfigured as we don't want to simply ignore malformed inputs and assume you're ok with defaults instead!
In the scheduler check the return value of netIdx.SetNode(node) and skip the node if it returns true -- that indicates the node is misconfigured and the plan applier will reject any plan destined for it.

Note that @rgl's case is tricky in that there's no reason to even check the node's networking information if the job being scheduled doesn't use the network! I think we can cover that as well, but this brings me to my last point:

Limitations

I don't think there's just 1 bug here. netIdx.AddAllocs is another call the scheduler performs without checking the return value while the plan applier does check it. This could lead to the exact same problem where the scheduler keeps trying to make an invalid plan, and the plan applier just keeps rejecting it.

I think we'll have to tackle this class of bugs one at time, and SetNode/ReservedPorts is a great start.

Goal is to fix at least one of the causes that can cause a node to be ineligible to receive work: #9506 (comment)

The shortlink /s/port-plan-failure is logged when a plan for a node is rejected to help users debug and mitigate repeated `plan for node rejected` failures. The current link to #9506 is... less than useful. It is not clear to users what steps they should take to either fix their cluster or contribute to the issue. While .../monitoring-nomad#progess isn't as comprehensive as it could be, it's a much more gentle introduction to the class of bug than the original issue.

tgross added theme/networking stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Dec 8, 2020

tgross mentioned this issue Dec 8, 2020

Reserved ports are not honored #9492

Closed

tgross added this to Needs Roadmapping in Nomad - Community Issues Triage Feb 12, 2021

tgross removed this from Needs Roadmapping in Nomad - Community Issues Triage Mar 4, 2021

schmichael mentioned this issue Oct 31, 2021

core: bump rejected plans from debug -> info #11416

Merged

schmichael mentioned this issue Nov 3, 2021

Log client node plan rejection at INFO instead of DEBUG #11191

Closed

tgross mentioned this issue Dec 14, 2021

CLI displays reports host_network configuration and not fingerprint #11671

Open

schmichael mentioned this issue Dec 14, 2021

Nomad scheduling halting #10289

Closed

schmichael added a commit that referenced this issue Jan 5, 2022

wip but poc of #9506

af9734f

#9506 (comment) makes for an easy repro

schmichael mentioned this issue Jan 11, 2022

Validate Agent Configuration #11819

Draft

schmichael added a commit that referenced this issue Jan 12, 2022

agent: validate reserved_ports are valid

a39c8bc

Goal is to fix at least one of the causes that can cause a node to be ineligible to receive work: #9506 (comment)

schmichael mentioned this issue Jan 12, 2022

agent: validate reserved_ports are valid #11830

Merged

schmichael added a commit that referenced this issue Jan 12, 2022

agent: validate reserved_ports are valid

bfac5fe

Goal is to fix at least one of the causes that can cause a node to be ineligible to receive work: #9506 (comment)

schmichael added a commit that referenced this issue Jan 12, 2022

agent: validate reserved_ports are valid

b4d3a61

Goal is to fix at least one of the causes that can cause a node to be ineligible to receive work: #9506 (comment)

lgfa29 mentioned this issue Jan 13, 2022

scheduler: detect and log unexpected scheduling collisions #11793

Merged

schmichael mentioned this issue Feb 8, 2022

scheduler: seed random shuffle of nodes with eval ID #12008

Merged

mikenomitch mentioned this issue May 6, 2022

Automatically down-rank or quarantine nodes with failing placements #12920

Closed

schmichael mentioned this issue May 12, 2022

docs: link s/port-plan-failure to more helpful doc #12968

Merged

schmichael mentioned this issue May 13, 2022

Automatically mark repeatedly rejected nodes in plans as ineligible #13017

Closed

mmcquillan added this to the 1.3.x milestone May 17, 2022

mmcquillan removed this from the 1.3.x milestone Aug 31, 2022

tgross self-assigned this Jan 24, 2023

tgross mentioned this issue Mar 8, 2023

scheduling: prevent self-collision in dynamic port network offerings #16401

Merged

tgross closed this as completed in #16401 Mar 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

intermittent error "reserved port collision" when running new job #9506

intermittent error "reserved port collision" when running new job #9506

habibiefaried commented Dec 3, 2020 •

edited

Loading

vincenthuynh commented Dec 3, 2020

rgl commented Dec 3, 2020 •

edited

Loading

habibiefaried commented Dec 3, 2020

habibiefaried commented Dec 7, 2020

tgross commented Dec 8, 2020

exFalso commented Dec 22, 2020

exFalso commented Dec 22, 2020

schmichael commented Dec 7, 2021

badalex commented Dec 30, 2021

schmichael commented Jan 5, 2022 •

edited

Loading

intermittent error "reserved port collision" when running new job #9506

intermittent error "reserved port collision" when running new job #9506

Comments

habibiefaried commented Dec 3, 2020 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Job file (if appropriate)

Nomad Client logs (if appropriate)

Nomad Server logs (if appropriate)

Nomad Server Config

Nomad Client config

Notes

vincenthuynh commented Dec 3, 2020

rgl commented Dec 3, 2020 • edited Loading

habibiefaried commented Dec 3, 2020

habibiefaried commented Dec 7, 2020

tgross commented Dec 8, 2020

exFalso commented Dec 22, 2020

exFalso commented Dec 22, 2020

schmichael commented Dec 7, 2021

badalex commented Dec 30, 2021

schmichael commented Jan 5, 2022 • edited Loading

Dev Agent

Fixes

Limitations

habibiefaried commented Dec 3, 2020 •

edited

Loading

rgl commented Dec 3, 2020 •

edited

Loading

schmichael commented Jan 5, 2022 •

edited

Loading