Allocations should not overlap in execution #10440

prestonp · 2021-04-23T18:29:36Z

Nomad version

Nomad v1.1.0-dev (19d58ce)

Operating system and Environment details

Linux devbox 5.8.0-23-generic #24~20.04.1-Ubuntu SMP Sat Oct 10 04:57:02 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Issue

I have a service deployed on nomad that seems to occasionally run into bind: address already in use. I am using a static port so this suggests that there's multiple tasks running concurrently.

My expectation is that nomad stop would mean that tasks have fully exited, so that subsequent nomad runs don't contend with shared system resources like ports.

The service I'm working with handles graceful shutdown so sometimes there's a bit of a delay until it fully exits. Regardless, it seems that nomad stop doesn't block on tasks exiting and replacement tasks can be scheduled right away.

Reproduction steps

Submit job

nomad run job.hcl

Confirm functionality

curl localhost:6900 # hello world

Then restart

nomad stop bug && nomad run job.hcl

Expected Result

I expected the second allocation to start after the first one exits.

Actual Result

Both allocations overlap in execution. The second alloc is started immediately despite the first allocation is still shutting down.

Checking the second allocation reveals a conflict:

$ nomad logs -stderr ef9696c9
2021/04/23 18:11:05 listening on addr 127.0.0.1:6900
2021/04/23 18:11:05 listen tcp 127.0.0.1:6900: bind: address already in use

Job

Simple Go http server that handles graceful but slow shutdown:

package main

import (
	"flag"
	"fmt"
	"log"
	"net/http"
	"os"
	"os/signal"
	"syscall"
	"time"
)

func main() {
	var addr = flag.String("addr", "127.0.0.1:8080", "address")
	flag.Parse()

	go func() {
		http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
			fmt.Fprintf(w, "hello world")
		})
		log.Println("listening on addr", *addr)
		log.Fatal(http.ListenAndServe(*addr, nil))
	}()

	c := make(chan os.Signal, 1)
	signal.Notify(c, syscall.SIGINT, syscall.SIGTERM)
	sig := <-c
	log.Println("got signal", sig)
	time.Sleep(10 * time.Second)
	log.Println("exiting now")
	os.Exit(0)
}

Build it

go build -o /tmp/server server.go

Set up job.hcl

job "bug" {
  datacenters = ["dc1"]
  type        = "batch"

  group "bug" {
    network {
      port "foo" { static = 6900 }
    }
    task "bug" {
      driver      = "raw_exec"
      kill_signal = "SIGTERM"
      config {
        command = "/tmp/server"
        args = ["-addr", "${NOMAD_IP_foo}:${NOMAD_HOST_PORT_foo}"]
      }
    }
  }
}

The text was updated successfully, but these errors were encountered:

notnoop · 2021-04-23T21:25:26Z

Hi @prestonp ! Thanks for reporting this issue and the detailed notes. I agree that this behavior isn't ideal. Looking at the code, it doesn't seem like a simple change I can do quickly - Will raise it with the team and consider it for our backlog.

notnoop · 2021-04-25T16:27:51Z

Upon looking into it again, I opened a draft PR in #10446 . As it touches the scheduler logic, we will need to examine and test it more rigorously. If you can give it a try, we'd appreciate your feedback or experience.

prestonp · 2021-04-26T09:59:55Z

Thank you @notnoop, with my repro the new evaluation fails with a clear message, this is great!

$ nomad stop bug && nomad run fixed.hcl
==> Monitoring evaluation "490de6ef"
    Evaluation triggered by job "bug"
==> Monitoring evaluation "490de6ef"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "490de6ef" finished with status "complete"
==> Monitoring evaluation "b71f39e7"
    Evaluation triggered by job "bug"
==> Monitoring evaluation "b71f39e7"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "b71f39e7" finished with status "complete" but failed to place all allocations:
    Task Group "bug" (failed to place 1 allocation):
      * Resources exhausted on 1 nodes
      * Dimension "network: reserved port collision foo=6900" exhausted on 1 nodes
    Evaluation "d11a2537" waiting for additional capacity to place remainder

github-actions · 2023-01-12T02:15:30Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

prestonp added the type/bug label Apr 23, 2021

notnoop added stage/accepted Confirmed, and intend to work on. No timeline committment though. stage/needs-discussion labels Apr 23, 2021

notnoop added this to Needs Triage in Nomad - Community Issues Triage via automation Apr 23, 2021

notnoop moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Apr 23, 2021

notnoop mentioned this issue Apr 25, 2021

scheduler: stopped-yet-running allocs are still running #10446

Merged

notnoop self-assigned this Apr 25, 2021

tgross unassigned notnoop Nov 8, 2021

schmichael closed this as completed in #10446 Sep 13, 2022

Nomad - Community Issues Triage automation moved this from Needs Roadmapping to Done Sep 13, 2022

github-actions bot locked as resolved and limited conversation to collaborators Jan 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allocations should not overlap in execution #10440

Allocations should not overlap in execution #10440

prestonp commented Apr 23, 2021

notnoop commented Apr 23, 2021 •

edited

Loading

notnoop commented Apr 25, 2021

prestonp commented Apr 26, 2021

github-actions bot commented Jan 12, 2023

Allocations should not overlap in execution #10440

Allocations should not overlap in execution #10440

Comments

prestonp commented Apr 23, 2021

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job

notnoop commented Apr 23, 2021 • edited Loading

notnoop commented Apr 25, 2021

prestonp commented Apr 26, 2021

github-actions bot commented Jan 12, 2023

notnoop commented Apr 23, 2021 •

edited

Loading