Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allocations should not overlap in execution #10440

Closed
prestonp opened this issue Apr 23, 2021 · 4 comments · Fixed by #10446
Closed

Allocations should not overlap in execution #10440

prestonp opened this issue Apr 23, 2021 · 4 comments · Fixed by #10446
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. stage/needs-discussion type/bug

Comments

@prestonp
Copy link

Nomad version

Nomad v1.1.0-dev (19d58ce)

Operating system and Environment details

Linux devbox 5.8.0-23-generic #24~20.04.1-Ubuntu SMP Sat Oct 10 04:57:02 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Issue

I have a service deployed on nomad that seems to occasionally run into bind: address already in use. I am using a static port so this suggests that there's multiple tasks running concurrently.

My expectation is that nomad stop would mean that tasks have fully exited, so that subsequent nomad runs don't contend with shared system resources like ports.

The service I'm working with handles graceful shutdown so sometimes there's a bit of a delay until it fully exits. Regardless, it seems that nomad stop doesn't block on tasks exiting and replacement tasks can be scheduled right away.

Reproduction steps

Submit job

nomad run job.hcl

Confirm functionality

curl localhost:6900 # hello world

Then restart

nomad stop bug && nomad run job.hcl

Expected Result

I expected the second allocation to start after the first one exits.

Actual Result

Both allocations overlap in execution. The second alloc is started immediately despite the first allocation is still shutting down.

Checking the second allocation reveals a conflict:

$ nomad logs -stderr ef9696c9
2021/04/23 18:11:05 listening on addr 127.0.0.1:6900
2021/04/23 18:11:05 listen tcp 127.0.0.1:6900: bind: address already in use

Job

Simple Go http server that handles graceful but slow shutdown:

package main

import (
	"flag"
	"fmt"
	"log"
	"net/http"
	"os"
	"os/signal"
	"syscall"
	"time"
)

func main() {
	var addr = flag.String("addr", "127.0.0.1:8080", "address")
	flag.Parse()

	go func() {
		http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
			fmt.Fprintf(w, "hello world")
		})
		log.Println("listening on addr", *addr)
		log.Fatal(http.ListenAndServe(*addr, nil))
	}()

	c := make(chan os.Signal, 1)
	signal.Notify(c, syscall.SIGINT, syscall.SIGTERM)
	sig := <-c
	log.Println("got signal", sig)
	time.Sleep(10 * time.Second)
	log.Println("exiting now")
	os.Exit(0)
}

Build it

go build -o /tmp/server server.go

Set up job.hcl

job "bug" {
  datacenters = ["dc1"]
  type        = "batch"

  group "bug" {
    network {
      port "foo" { static = 6900 }
    }
    task "bug" {
      driver      = "raw_exec"
      kill_signal = "SIGTERM"
      config {
        command = "/tmp/server"
        args = ["-addr", "${NOMAD_IP_foo}:${NOMAD_HOST_PORT_foo}"]
      }
    }
  }
}
@notnoop notnoop added stage/accepted Confirmed, and intend to work on. No timeline committment though. stage/needs-discussion labels Apr 23, 2021
@notnoop notnoop added this to Needs Triage in Nomad - Community Issues Triage via automation Apr 23, 2021
@notnoop notnoop moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Apr 23, 2021
@notnoop
Copy link
Contributor

notnoop commented Apr 23, 2021

Hi @prestonp ! Thanks for reporting this issue and the detailed notes. I agree that this behavior isn't ideal. Looking at the code, it doesn't seem like a simple change I can do quickly - Will raise it with the team and consider it for our backlog.

@notnoop
Copy link
Contributor

notnoop commented Apr 25, 2021

Upon looking into it again, I opened a draft PR in #10446 . As it touches the scheduler logic, we will need to examine and test it more rigorously. If you can give it a try, we'd appreciate your feedback or experience.

@notnoop notnoop self-assigned this Apr 25, 2021
@prestonp
Copy link
Author

Thank you @notnoop, with my repro the new evaluation fails with a clear message, this is great!

$ nomad stop bug && nomad run fixed.hcl
==> Monitoring evaluation "490de6ef"
    Evaluation triggered by job "bug"
==> Monitoring evaluation "490de6ef"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "490de6ef" finished with status "complete"
==> Monitoring evaluation "b71f39e7"
    Evaluation triggered by job "bug"
==> Monitoring evaluation "b71f39e7"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "b71f39e7" finished with status "complete" but failed to place all allocations:
    Task Group "bug" (failed to place 1 allocation):
      * Resources exhausted on 1 nodes
      * Dimension "network: reserved port collision foo=6900" exhausted on 1 nodes
    Evaluation "d11a2537" waiting for additional capacity to place remainder

Nomad - Community Issues Triage automation moved this from Needs Roadmapping to Done Sep 13, 2022
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. stage/needs-discussion type/bug
Projects
Development

Successfully merging a pull request may close this issue.

2 participants