Nomad segfaults when trying to preempt a docker-based job with lower priority #11342

aneutron · 2021-10-18T16:02:22Z

Nomad version

Nomad v1.1.6 (b83d623fb5ff475d5e40df21e9e7a61834071078)

Operating system and Environment details

# cat /etc/redhat-release
Red Hat Enterprise Linux release 8.4 (Ootpa)
# cat /proc/cpuinfo | grep EPYC | uniq
model name      : AMD EPYC 7763 64-Core Processor
# cat /proc/meminfo | grep -i Memtot
MemTotal:       527815668 kB
# nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-X-Y-Z-T-F)
GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-X-Y-Z-T-F)
GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-X-Y-Z-T-F)
GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-X-Y-Z-T-F)

Issue

Hi,

First of all, thanks for the amazing product that's Nomad. I'm currently in the process of PoC-ing Nomad for a use case at our company, and it involves running jobs that use GPUs.

As it is a PoC, I'm only running Nomad in dev mode. I'm trying to use the default scheduler w/ the pre-empting feature enabled for all types of jobs.

My test scenario was the following:

Create a job (with a single docker task) that requires 4 GPUs (4 allocations of 1 GPU)
Create a job (with a single docker task) that requires 2 GPUs (1 allocation of 2 GPUs) and has higher priority (delta=20)
Observe that Nomade vacates 2 of the 4 allocations in Job 1 and schedules an allocation of Job 2.

Instead what happened is once I tried to run Job 2, the server/client segfaulted (due to a panic).

I successfully reproduced the error at least 5 times, using different configurations of GPU requirements but with the same global idea (multiple single GPU jobs, one multi-GPU job).

The jobs schedule fine on their own, but once I schedule the higher priority job where the lower prio job is already deployed, it crashes.

Reproduction steps

The server / client configuration:

datacenter = "dev"

log_file = "nomad.log"

client {
    enabled = true
    options {
        docker.cleanup.image = false
    }
}

server {
  default_scheduler_config {
    preemption_config {
      batch_scheduler_enabled    = true
      system_scheduler_enabled   = true
      service_scheduler_enabled  = true
      sysbatch_scheduler_enabled = true # New in Nomad 1.2
    }
  }
}


plugin "nvidia-gpu" {
  config {
    enabled            = true
    fingerprint_period = "1m"
  }
}

The command line to run Nomad in dev mode:
nomad agent -dev -bind 0.0.0.0 -plugin-dir=./plugins -config=./server-config.hcl -log-level=WARN

Then the steps to reproduce are as follows:

Enable preemption on all types
Deploy Job 1
Deploy Job 2

Expected Result

Job 1 (or some of its allocations) are vacated
Job 2 is deployed

Actual Result

Server / Client segfaults.

Job file (if appropriate)

This is the file for Job 1:

job "jupyterlab" {
  datacenters = ["dev"]
  group "jupyter" {
    count = 4
    network {
      port "jupyter" {
        to = 8091
      }
    }
    task "jupyter-docker" {
      driver = "docker"
      config {
        # A custom cuda+jupyter image but anything will do
        image = "cuda-centos8-jupyter-pytorch"
        ports = ["jupyter"]
      }
      resources {
        cpu    = 500
        memory = 2048
         device "nvidia/gpu" {
          count = 1
         }
      }
    }
  }
}

The second Job is verbatim except for the Job Name and The Count (both for the group and the GPU).

Nomad Server logs (if appropriate)

with this stack trace:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x70 pc=0x1b09597]

goroutine 374 [running]:
github.com/hashicorp/nomad/scheduler.(*JobAntiAffinityIterator).Next(0xc003c1b5e0, 0x0)
        github.com/hashicorp/nomad/scheduler/rank.go:581 +0x1f7
github.com/hashicorp/nomad/scheduler.(*NodeReschedulingPenaltyIterator).Next(0xc001d1a870, 0x0)
        github.com/hashicorp/nomad/scheduler/rank.go:627 +0x38
github.com/hashicorp/nomad/scheduler.(*NodeAffinityIterator).Next(0xc003c1b630, 0x203000)
        github.com/hashicorp/nomad/scheduler/rank.go:699 +0x49
github.com/hashicorp/nomad/scheduler.(*SpreadIterator).Next(0xc0012a6180, 0xc0036eb240)
        github.com/hashicorp/nomad/scheduler/spread.go:112 +0x49
github.com/hashicorp/nomad/scheduler.(*PreemptionScoringIterator).Next(0xc00392b7c0, 0x60)
        github.com/hashicorp/nomad/scheduler/rank.go:794 +0x38
github.com/hashicorp/nomad/scheduler.(*ScoreNormalizationIterator).Next(0xc00392b7e0, 0x265f6a0)
        github.com/hashicorp/nomad/scheduler/rank.go:758 +0x38
github.com/hashicorp/nomad/scheduler.(*LimitIterator).nextOption(0xc0012a6360, 0x265c940)
        github.com/hashicorp/nomad/scheduler/select.go:60 +0x34
github.com/hashicorp/nomad/scheduler.(*LimitIterator).Next(0xc0012a6360, 0xc0012a6e01)
        github.com/hashicorp/nomad/scheduler/select.go:39 +0x3d
github.com/hashicorp/nomad/scheduler.(*MaxScoreIterator).Next(0xc001d1ac30, 0xc0036e2600)
        github.com/hashicorp/nomad/scheduler/select.go:102 +0x4d
github.com/hashicorp/nomad/scheduler.(*GenericStack).Select(0xc0027a4000, 0xc0036e2600, 0xc0027b2ec0, 0x0)
        github.com/hashicorp/nomad/scheduler/stack.go:174 +0x7c6
github.com/hashicorp/nomad/scheduler.(*GenericScheduler).selectNextOption(0xc000ad25a0, 0xc0036e2600, 0xc0027b2ec0, 0x0)
        github.com/hashicorp/nomad/scheduler/generic_sched.go:789 +0xe5
github.com/hashicorp/nomad/scheduler.(*GenericScheduler).computePlacements(0xc000ad25a0, 0x451dea8, 0x0, 0x0, 0xc002ff67c0, 0x1, 0x1, 0x2, 0x0)
        github.com/hashicorp/nomad/scheduler/generic_sched.go:552 +0x4fa
github.com/hashicorp/nomad/scheduler.(*GenericScheduler).computeJobAllocs(0xc000ad25a0, 0xc000ac0000, 0xc003c1b4f0)
        github.com/hashicorp/nomad/scheduler/generic_sched.go:430 +0x1239
github.com/hashicorp/nomad/scheduler.(*GenericScheduler).process(0xc000ad25a0, 0x0, 0x0, 0x0)
        github.com/hashicorp/nomad/scheduler/generic_sched.go:257 +0x36f
github.com/hashicorp/nomad/scheduler.retryMax(0x5, 0xc0036ebe00, 0xc0036ebdf0, 0x6, 0x30ec198)
        github.com/hashicorp/nomad/scheduler/util.go:275 +0x42
github.com/hashicorp/nomad/scheduler.(*GenericScheduler).Process(0xc000ad25a0, 0xc0010fc480, 0x30ec198, 0xc001021b30)
        github.com/hashicorp/nomad/scheduler/generic_sched.go:156 +0x2b7
github.com/hashicorp/nomad/nomad.(*Worker).invokeScheduler(0xc000931180, 0xc00120fb90, 0xc0010fc480, 0xc00211a6c0, 0x24, 0x0, 0x0)
        github.com/hashicorp/nomad/nomad/worker.go:268 +0x42c
github.com/hashicorp/nomad/nomad.(*Worker).run(0xc000931180)
        github.com/hashicorp/nomad/nomad/worker.go:129 +0x286
created by github.com/hashicorp/nomad/nomad.NewWorker
        github.com/hashicorp/nomad/nomad/worker.go:81 +0x152

Nomad Client logs (if appropriate)

(See above)

The text was updated successfully, but these errors were encountered:

notnoop · 2021-10-19T14:12:37Z

Hi @aneutron ! Thanks for letting us know. I was able to reproduce and have a fix. Will PR the fix soon.

aneutron · 2021-10-19T14:31:57Z

Hey @notnoop ! Thanks a lot for the swift action on your part. Looking forward to build it and keep testing Nomad. Cheers !

Fix a bug where the scheduler may panic when preemption is enabled. The conditions are a bit complicated: A job with higher priority that schedule multiple allocations that preempt other multiple allocations on the same node, due to port/network/device assignments. The cause of the bug is incidental mutation of internal cached data. `RankedNode` computes and cache proposed allocations in https://github.com/hashicorp/nomad/blob/v1.1.6/scheduler/rank.go#L42-L53 . But scheduler then mutates the list to remove pre-emptable allocs in https://github.com/hashicorp/nomad/blob/v1.1.6/scheduler/rank.go#L293-L294, and `RemoveAllocs` mutates and sets the tail of cached slice with `nil`s triggering a nil-pointer derefencing case. I fixed the issue by avoiding the mutation in `RemoveAllocs` - the micro-optimization there doesn't seem necessary. Fixes #11342

github-actions · 2022-10-15T02:44:38Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

aneutron added the type/bug label Oct 18, 2021

notnoop self-assigned this Oct 18, 2021

notnoop mentioned this issue Oct 19, 2021

Fix preemption panic #11346

Merged

notnoop closed this as completed in #11346 Oct 20, 2021

aneutron mentioned this issue Nov 8, 2021

Nomad plugin nvidia-gpu does not detect multi-instance GPUs hashicorp/nomad-device-nvidia#3

Closed

github-actions bot locked as resolved and limited conversation to collaborators Oct 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad segfaults when trying to preempt a docker-based job with lower priority #11342

Nomad segfaults when trying to preempt a docker-based job with lower priority #11342

aneutron commented Oct 18, 2021

notnoop commented Oct 19, 2021

aneutron commented Oct 19, 2021

github-actions bot commented Oct 15, 2022

Nomad segfaults when trying to preempt a docker-based job with lower priority #11342

Nomad segfaults when trying to preempt a docker-based job with lower priority #11342

Comments

aneutron commented Oct 18, 2021

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

notnoop commented Oct 19, 2021

aneutron commented Oct 19, 2021

github-actions bot commented Oct 15, 2022