Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allocrunner: prevent panic on network manager #16921

Merged
merged 3 commits into from
Apr 18, 2023
Merged

Conversation

lgfa29
Copy link
Contributor

@lgfa29 lgfa29 commented Apr 18, 2023

Check the task group network length before trying to access the first element.

I haven't been able to reproduce the problem but the fix seems clear enough.

Closes #16863

@lgfa29 lgfa29 requested review from tgross and jrasell April 18, 2023 01:53
@lgfa29 lgfa29 added backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line labels Apr 18, 2023
Copy link
Member

@jrasell jrasell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

As far as reproduction goes, I have a strong suspicion that there are other cases of #16722 where the alloc runner and task runner aren't initializing state on all code paths, but I haven't yet figured out a way to validate that automatically without just sitting down and tracing every path 😀

@lgfa29
Copy link
Contributor Author

lgfa29 commented Apr 18, 2023

Found a repro:

job "example" {
  group "sleep" {
    task "sleep" {
      driver = "exec"

      config {
        command = "/bin/bash"
        args    = ["-c", "while true; do sleep 1; done"]
      }

      resources {
        network {
          mode = "bridge"
        }
      }
    }
  }
}

I missed this continue on the network mode being hosts:

// netmode host should always work to support backwards compat
if taskNetMode == "host" {
continue
}

So the panic happens when you have a bridge network defined at the task level with a driver that doesn't' support MustInitiateNetwork.

I added a test case for this scenario and update the CHANGELOG to better describe the conditions for the panic to happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nomad crashes "runtime error: index out of range [0] with length 0"
3 participants