Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

network hook fails after client restart w/ non-Docker driver #9750

Closed
shishir-a412ed opened this issue Jan 7, 2021 · 9 comments · Fixed by #9757
Closed

network hook fails after client restart w/ non-Docker driver #9750

shishir-a412ed opened this issue Jan 7, 2021 · 9 comments · Fixed by #9757
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/driver/exec theme/driver/java theme/networking type/bug
Milestone

Comments

@shishir-a412ed
Copy link
Contributor

Nomad version

Nomad v0.11.4+ent

Operating system and Environment details

Ubuntu 18.04.5 LTS (Bionic Beaver)

Issue

We are seeing this error when the containerd-driver restarts and it tries to reattach to the existing allocation.
nomad is unable to attach to the existing allocation and throws this error:

Recent Events:
Time                       Type           Description
2021-01-07T10:19:06-08:00  Killing        Sent interrupt. Waiting 5s before force killing
2021-01-07T10:19:05-08:00  Setup Failure  failed to setup alloc: pre-run hook "network" failed: failed to create network for alloc: open /var/run/netns/e728e152-7304-c42c-b7e7-21bf63fc44c4: operation not permitted
2021-01-07T10:16:37-08:00  Started        Task started by client
2021-01-07T10:16:36-08:00  Task Setup     Building Task Directory
2021-01-07T10:16:36-08:00  Received       Task received by client

and starts a new allocation.

Reproduction steps

  1. Launch a nomad job using containerd-driver
  2. Wait for the job to get into the running state.
  3. SSH into the nomad client node, and restart nomad + containerd-driver
systemctl restart nomad
  1. nomad job status <job> should show a new allocation and a previously failed allocation.
  2. nomad alloc status <failed_alloc_id> should show the above error message.

Logs

Jan 07 18:58:31 ip-10-102-98-114 nomad[15339]:     2021-01-07T18:58:31.277Z [INFO]  client.driver_mgr.containerd-driver: HELLO HELLO: Recover Task: driver=containerd-driver @module=containerd-driver timestamp=2021-01-07T18:58:31.277Z
Jan 07 18:58:31 ip-10-102-98-114 nomad[15339]:     2021-01-07T18:58:31.277Z [INFO]  client: started client: node_id=2bec1ea8-0a91-76a5-2241-ce62e083d2b3
Jan 07 18:58:31 ip-10-102-98-114 nomad[15339]:     2021-01-07T18:58:31.279Z [ERROR] client.alloc_runner: prerun failed: alloc_id=6d449ba6-9190-741e-7ce3-b37c85640151 error="pre-run hook "network" failed: failed to create network for alloc: open /var/run/netns/6d449ba6-9190-741e-7ce3-b37c85640151: operation not permitted"
Jan 07 18:58:31 ip-10-102-98-114 nomad[15339]: client.alloc_runner: prerun failed: alloc_id=6d449ba6-9190-741e-7ce3-b37c85640151 error="pre-run hook "network" failed: failed to create network for alloc: open /var/run/netns/6d449ba6-9190-741e-7ce3-b37c85640151: operation not permitted"
@shishir-a412ed
Copy link
Contributor Author

@tgross @notnoop Any ideas about this issue?

@tgross
Copy link
Member

tgross commented Jan 7, 2021

Hi @shishir-a412ed!

The CreateNetwork call can either delegate to the task driver or use the Linux default. If you're using the containerd plugin I can see at https://github.com/Roblox/nomad-driver-containerd, it looks like it's being deferred to the Linux default. We define that in network_manager_linux.go#L92-L96 and it looks like the only place you can be getting the error is in the nsutil.NewNS call.

So I think that error is bubbling up either from os.MkdirAll(NetNSRunDir, 0755) or possibly os.Create(nsPath) as those are the two likely unwrapped errors I see in that function. Does your Nomad client have the appropriate permissions to the netns locations?

@shishir-a412ed
Copy link
Contributor Author

shishir-a412ed commented Jan 8, 2021

@tgross Thank you for the quick response. Your analysis is spot on!

I added some fmt.Println() statements in the nomad codebase to validate.

Jan 07 23:34:09 ip-10-102-98-114 nomad[19401]: HELLO: ERROR IN CREATING NETWORK NAMESPACE FILE: /var/run/netns/e2839014-1bf8-1993-bee7-9b225bf5465f
Jan 07 23:34:09 ip-10-102-98-114 nomad[19401]: HELLO HELLO ERROR: open /var/run/netns/e2839014-1bf8-1993-bee7-9b225bf5465f: operation not permitted

The error is indeed coming from os.Create(nsPath)

Looks like the nomad client has no problems creating the nsPath file the first time it launches the job.
When the first time it creates the file, the file is created with 0444 permissions (not sure if that's related)

When the nomad + containerd-driver restarts, it tries to make an os.Create(nsPath) call on the existing file and throws operation not permitted error.

Does your Nomad client have the appropriate permissions to the netns locations? This definitely seems like the reason, when the nomad client + containerd-driver restarts, nomad client doesn't have the right permissions.

I checked the process, and it's running as root. Not sure why it is not able to re-create the /var/run/netns/e2839014-1bf8-1993-bee7-9b225bf5465f file?

Also,

If you're using the containerd plugin I can see at https://github.com/Roblox/nomad-driver-containerd, it looks like it's being deferred to the Linux default. 

Where do you see in the containerd-driver it's being deferred to the Linux default?

@tgross
Copy link
Member

tgross commented Jan 8, 2021

Ok, so good news and bad news. The good news is that I was able to reproduce the behavior with the exec driver on the current HEAD so it's not a problem specific to the version of Nomad you're running or the containerd driver. The bad news is that I was able reproduce the behavior with the exec driver. 😀

Jobspec:

job "execjob" {
  datacenters = ["dc1"]

  group "execgroup" {

    network {
      mode = "bridge"
      port "www" {
        to = "8000"
      }
    }

    task "exectask" {
      driver = "exec"

      config {
        command = "python"
        args    = ["-m", "SimpleHTTPServer"]
      }
    }
  }
}

Run the job, which works fine. Take a look at the permissions for that netns:

$ sudo ls -lah /var/run/netns/01f3d62d-82a0-9ab2-a09e-85d7f6e7a7e1
-r--r--r-- 1 root root 0 Jan  8 13:35 /var/run/netns/01f3d62d-82a0-9ab2-a09e-85d7f6e7a7e1

Restart the Nomad client, as root:

2021-01-08T13:36:36.424Z [ERROR] client.alloc_runner: prerun failed: alloc_id=01f3d62d-82a0-9ab2-a09e-85d7f6e7a7e1 error="pre-run hook "network" failed: failed to create network for alloc: open /var/run/netns/01f3d62d-82a0-9ab2-a09e-85d7f6e7a7e1: operation not permitted"

And this ends up causing a restart of the task.

Where do you see in the containerd-driver it's being deferred to the Linux default?

I might be missing it, but there's no implementation of CreateNetwork in the driver. So in that case the network_hook code falls back to the default in network_manager_linux. That's pretty typical; of the HashiCorp drivers only docker implements it, and that's why we see the same behavior in both containerd and exec drivers.

I'm going to rename this bug, and we'll dig in further to figure out what's going on here.

Edit: interesting, it looks like way back in 0.10.0 I'd tried to solve for not recreate network namespaces: e17901d I suspect either there's a bug there we missed or a regression since then.

@tgross tgross changed the title Issue: pre-run hook "network" failed: failed to create network for alloc network hook fails after client restart w/ non-Docker driver Jan 8, 2021
@tgross tgross added theme/driver/exec stage/accepted Confirmed, and intend to work on. No timeline committment though. type/bug theme/driver/java and removed stage/waiting-reply labels Jan 8, 2021
@tgross
Copy link
Member

tgross commented Jan 8, 2021

Ok I went thru #6315 and it looks like I introduced a fix for Docker (see e17901d#diff-13af1c2034f8a861c687bbeea321da745d2490f0110857c3a805fb385bcf0804R50-R59) but that the fix was missing what we needed for the default path.

I think when we create the file, if we get an error, we should then check for the existence of the file (which means it was previously created), and return nil, true, nil from CreateNetwork if it already exists. I'm not totally sure I understand what that file is doing other than acting as a sentinel value though (ah I see from the error we get that it's the namespace file)... I'll push up a PR with the fix and then ping one of my colleagues who knows that area of the code a bit better.

@tgross
Copy link
Member

tgross commented Jan 8, 2021

I've opened #9757 with patch for this.

@shishir-a412ed
Copy link
Contributor Author

Hi @tgross! Thank you for taking a look and the quick response! This looks great. Looking forward to #9757.

@tgross
Copy link
Member

tgross commented Jan 11, 2021

That PR is merged and the fix will ship in 1.0.2

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/driver/exec theme/driver/java theme/networking type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants