Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intermittent error "reserved port collision" when running new job #9506

Closed
habibiefaried opened this issue Dec 3, 2020 · 10 comments · Fixed by #16401
Closed

intermittent error "reserved port collision" when running new job #9506

habibiefaried opened this issue Dec 3, 2020 · 10 comments · Fixed by #16401
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/networking

Comments

@habibiefaried
Copy link

habibiefaried commented Dec 3, 2020

Nomad version

Nomad v1.0.0-beta3 (fcb32ef)

Operating system and Environment details

Distributor ID: Debian
Description: Debian GNU/Linux 10 (buster)
Release: 10
Codename: buster

Issue

Sometimes, whenever I update or run a new job. I got error max-plan-attempts failed on its UI

Reproduction steps

Just run a job file

Job file (if appropriate)

Taken from: https://www.hashicorp.com/blog/consul-connect-native-tasks-in-hashicorp-nomad-0-12

job "cn-demo" {
      datacenters = ["dc1"]

      group "generator" {
        network {
          port "api" {}
        }

        service {
          name = "uuid-api"
          port = "${NOMAD_PORT_api}"

          connect {
            native = true
          }
        }

        task "generate" {
          driver = "docker"

          config {
            image        = "hashicorpnomad/uuid-api:v3"
            network_mode = "host"
          }

          env {
            BIND = "0.0.0.0"
            PORT = "${NOMAD_PORT_api}"
          }
        }
      }

      group "frontend" {
        network {
          port "http" {
            static = 9800
          }
        }

        service {
          name = "uuid-fe"
          port = "9800"

          connect {
            native = true
          }
        }

        task "frontend" {
          driver = "docker"

          config {
            image        = "hashicorpnomad/uuid-fe:v3"
            network_mode = "host"
          }

          env {
            UPSTREAM = "uuid-api"
            BIND     = "0.0.0.0"
            PORT     = "9800"
          }
        }
      }
    }

Nomad Client logs (if appropriate)

2020-12-03T05:32:38.511+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=442.672µs
    2020-12-03T05:32:48.516+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=1.008414ms
    2020-12-03T05:32:58.534+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=900.343µs

Nomad Server logs (if appropriate)

2020-12-03T05:33:01.954+0100 [DEBUG] worker: submitted plan for evaluation: eval_id=098a520b-4cd9-78a3-6ba2-613d94b33823
    2020-12-03T05:33:01.954+0100 [DEBUG] worker: refreshing state: refresh_index=345 eval_id=098a520b-4cd9-78a3-6ba2-613d94b33823
    2020-12-03T05:33:01.954+0100 [DEBUG] worker.service_sched: refresh forced: eval_id=098a520b-4cd9-78a3-6ba2-613d94b33823 job_id=cn-demo namespace=default
    2020-12-03T05:33:01.954+0100 [DEBUG] worker.service_sched: reconciled current state with desired state: eval_id=098a520b-4cd9-78a3-6ba2-613d94b33823 job_id=cn-demo namespace=default results="Total changes: (place 2) (destructive 0) (inplace 0) (stop 0)
Desired Changes for "generator": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)
Desired Changes for "frontend": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)"
    2020-12-03T05:33:01.987+0100 [DEBUG] nomad: plan for node rejected: node_id=9fa84370-c713-b914-d329-f6485951cddc reason="reserved port collision" eval_id=098a5

Nomad Server Config

bind_addr = "0.0.0.0"
log_level = "DEBUG"
data_dir = "/etc/nomad"
name = "vmi483776.contaboserver.net"
server {
  enabled = true
  bootstrap_expect = 1
  raft_protocol = 3
}
advertise {
  http = "161.97.158.38"
  rpc = "161.97.158.38"
  serf = "161.97.158.38"
}
consul {
  address = "127.0.0.1:8500"
  auto_advertise = true
  server_auto_join = true
  client_auto_join = true
}
autopilot {
  cleanup_dead_servers      = true
  last_contact_threshold    = "200ms"
  max_trailing_logs         = 2500
  server_stabilization_time = "10s"
  enable_redundancy_zones   = false
  disable_upgrade_migration = false
  enable_custom_upgrades    = false
}

Nomad Client config

bind_addr = "0.0.0.0"
log_level = "DEBUG"
data_dir = "/etc/nomad"
name = "vmi483778.contaboserver.net"
client {
  enabled = true
  reserved {
    reserved_ports = "22,53,68,111,8301,8500,8600,32768-60999"
  }
}
advertise {
  http = "161.97.158.40"
  rpc = "161.97.158.40"
  serf = "161.97.158.40"
}
consul {
  address = "127.0.0.1:8500"
  auto_advertise = true
  server_auto_join = true
  client_auto_join = true
}

Notes

Btw you can really visit this url: http://161.97.158.38:4646/ui/jobs/cn-demo. It's my testbed server for this nomad beta version. It's opened and unauthenticated, in which you can try to see it live

Thanks

@vincenthuynh
Copy link

@rgl
Copy link

rgl commented Dec 3, 2020

I'm also observing this. But my job does not even exposes any port, so I'm a bit confused about the error message:

job "damon" {
  datacenters = ["dc1"]
  group "damon" {
    count = 1
    task "damon" {
      driver = "raw_exec"
      artifact {
        source = "https://github.com/rgl/damon/releases/download/v0.1.2/damon.zip"
      }
      template {
        destination = "local/example.ps1"
        data = <<-EOD
...
        EOD
      }
      config {
        command = "local/damon.exe"
        args = [
          "PowerShell.exe",
          "-File",
          "local/example.ps1"
        ]
      }
      resources {
        cpu = 100
        memory = 350
      }
    }
  }
}

Also, is there a better way to gather these errors messages besides monitoring the client logs from the nomad ui (e.g. http://localhost:4646/ui/clients/c9d4d1c5-51cc-7276-0949-638d0072061e/monitor?level=debug)?

It would be pretty nice to see these errors from e.g. http://localhost:4646/ui/jobs/damon/evaluations. For my particular problem, it only shows something like max-plan-attempts | failed | False.

I'm not sure if it makes any difference, but I'm running nomad in Windows Server 2019.

@habibiefaried
Copy link
Author

Yes I agree. This kind of error should be visible from the UI of failed job.

Btw, I have a weird workaround for this bug. So, you can just restart all nomad servers and eventually, this job will be re-run again successfully. I've done it on my cluster multiple times.

However, not sure if I have to do this periodically -__-'

@habibiefaried
Copy link
Author

Btw I have the solid workaround for this. So I did these steps:

  1. Restarting all nomad servers
  2. Stop that pending task
  3. Re-run the same task

Well, at least I have this workaround. Hoping this will be useful for fellow users as well.

@tgross
Copy link
Member

tgross commented Dec 8, 2020

Hi @habibiefaried it looks like you're running into another example of #9492. The port reservation is only honored for static ports, not dynamic ports.

We intend to fix this (see #8186) and have a community-provided PR #8478 that doesn't seem like it's going to make it over the finish line so we're going to have to circle back to that.

@tgross tgross added theme/networking stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Dec 8, 2020
@exFalso
Copy link

exFalso commented Dec 22, 2020

Getting this as well on 0.12, but the workaround doesn't work. The error seems persistent, I cannot run any job atm on the cluster, they all result in reserved port collision. Tried restarting servers+clients, stopping/restarting all jobs.

@exFalso
Copy link

exFalso commented Dec 22, 2020

Ok what worked:

  1. Stopped all servers
  2. Manually wiped server folders
  3. restart servers
  4. restart clients

@tgross tgross added this to Needs Roadmapping in Nomad - Community Issues Triage Feb 12, 2021
@tgross tgross removed this from Needs Roadmapping in Nomad - Community Issues Triage Mar 4, 2021
schmichael added a commit that referenced this issue Oct 31, 2021
As we have continued to see reports of #9506 we need to elevate this log
line as it is the only way to detect when plans are being *erroneously*
rejected.

Users who see this log line repeatedly should drain and restart the node
in the log line. This seems to workaorund the issue.

Please post any details on #9506!
schmichael added a commit that referenced this issue Oct 31, 2021
As we have continued to see reports of #9506 we need to elevate this log
line as it is the only way to detect when plans are being *erroneously*
rejected.

Users who see this log line repeatedly should drain and restart the node
in the log line. This seems to workaorund the issue.

Please post any details on #9506!
schmichael added a commit that referenced this issue Oct 31, 2021
As we have continued to see reports of #9506 we need to elevate this log
line as it is the only way to detect when plans are being *erroneously*
rejected.

Users who see this log line repeatedly should drain and restart the node
in the log line. This seems to workaorund the issue.

Please post any details on #9506!
@schmichael
Copy link
Member

We've seen this come up more and are actively investigating it. One workaround we've seen work every time is restarting the client agent (node) specified in the log message. It seems the node's state on one or more servers can get corrupted and restarting the it fixes the corruption.

Fixing this is a top priority of the Nomad team! Sorry for the delay, it's a tricky one.

@badalex
Copy link

badalex commented Dec 30, 2021

For what it's worth, for me, I tracked this down to a bad puppet config. 80,8081 turned into reserved_ports: 808081. which blocked all allocations on affected nodes.

schmichael added a commit that referenced this issue Jan 5, 2022
#9506 (comment)
makes for an easy repro
@schmichael
Copy link
Member

schmichael commented Jan 5, 2022

Thanks @badalex! While unfortunately that isn't the root cause for everyone affected, it does make for a nice easy reproduction!

As a Go test: https://github.com/hashicorp/nomad/compare/b-invalid-reserved-ports#diff-b501e4f2415a7da37e504a9ab69b251588909e41ffd809d0b88fe3c1b87d6751R2345-R2422

Outputs:

job_endpoint_test.go:2411:
                Error Trace:    job_endpoint_test.go:2411
                Error:          "map[]" should have 1 item(s), but has 0
                Test:           TestJobEndpoint_Register_BadNodeNetwork
                Messages:       &{_struct:false ID:f5d53f73-3d33-1c27-6e2d-f5bd7a5da1ea Namespace:default Priority:50 Type:service TriggeredBy:job-register JobID:mock-service-9d725069-2884-307e-53da-9d46aa40edcc JobModifyIndex:7 NodeID: NodeModifyIndex:0 DeploymentID: Status:failed StatusDescription:maximum attempts reached (5) Wait:0s WaitUntil:0001-01-01 00:00:00 +0000 UTC NextEval: PreviousEval: BlockedEval:caed821f-5c85-7bc3-8ffc-76eaf513ae05 FailedTGAllocs:map[] ClassEligibility:map[] QuotaLimitReached: EscapedComputedClass:false AnnotatePlan:false QueuedAllocations:map[web:1] LeaderACL: SnapshotIndex:7 CreateIndex:7 ModifyIndex:9 CreateTime:1641343586422081104 ModifyTime:1641343586428807197}

Note the StatusDescription:maximum attempts reached (5) and empty FailedTGAllocs. If the scheduler is unable to satisfy the network requirements for a job the eval should have:

FailedTGAllocs.<task_group>.DimensionExhausted = "network: dynamic port selection failed": 1

Dev Agent

You can also reproduce this with a dev agent by using the following reserved.hcl agent configuration and running any job (such as produced by nomad init -short).

cat > reserved.hcl << EOF
client {
  reserved {
    reserved_ports = "808081"
  }
}
EOF

nomad agent -dev -config reserved.hcl

# Now run a job in another terminal or the UI

Fixes

I intend to add 2 fixes for this:

  1. Validate reserved_ports on client agent startup and refuse to start if the reservation is malformed. This is what Nomad client agents are intended to do if misconfigured as we don't want to simply ignore malformed inputs and assume you're ok with defaults instead!
  2. In the scheduler check the return value of netIdx.SetNode(node) and skip the node if it returns true -- that indicates the node is misconfigured and the plan applier will reject any plan destined for it.

Note that @rgl's case is tricky in that there's no reason to even check the node's networking information if the job being scheduled doesn't use the network! I think we can cover that as well, but this brings me to my last point:

Limitations

I don't think there's just 1 bug here. netIdx.AddAllocs is another call the scheduler performs without checking the return value while the plan applier does check it. This could lead to the exact same problem where the scheduler keeps trying to make an invalid plan, and the plan applier just keeps rejecting it.

I think we'll have to tackle this class of bugs one at time, and SetNode/ReservedPorts is a great start.

schmichael added a commit that referenced this issue Jan 12, 2022
Goal is to fix at least one of the causes that can cause a node to be
ineligible to receive work:
#9506 (comment)
schmichael added a commit that referenced this issue Jan 12, 2022
Goal is to fix at least one of the causes that can cause a node to be
ineligible to receive work:
#9506 (comment)
schmichael added a commit that referenced this issue Jan 12, 2022
Goal is to fix at least one of the causes that can cause a node to be
ineligible to receive work:
#9506 (comment)
schmichael added a commit that referenced this issue May 12, 2022
The shortlink /s/port-plan-failure is logged when a plan for a node is
rejected to help users debug and mitigate repeated `plan for node
rejected` failures.

The current link to #9506 is... less than useful. It is not clear to
users what steps they should take to either fix their cluster or
contribute to the issue.

While .../monitoring-nomad#progess isn't as comprehensive as it could
be, it's a much more gentle introduction to the class of bug than the
original issue.
schmichael added a commit that referenced this issue May 12, 2022
The shortlink /s/port-plan-failure is logged when a plan for a node is
rejected to help users debug and mitigate repeated `plan for node
rejected` failures.

The current link to #9506 is... less than useful. It is not clear to
users what steps they should take to either fix their cluster or
contribute to the issue.

While .../monitoring-nomad#progess isn't as comprehensive as it could
be, it's a much more gentle introduction to the class of bug than the
original issue.
@mmcquillan mmcquillan added this to the 1.3.x milestone May 17, 2022
@mmcquillan mmcquillan removed this from the 1.3.x milestone Aug 31, 2022
@tgross tgross self-assigned this Jan 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/networking
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants