Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic on nil pointer dereference using CNI with Calico #9647

Closed
radriaanse opened this issue Dec 16, 2020 · 4 comments · Fixed by #9648
Closed

Panic on nil pointer dereference using CNI with Calico #9647

radriaanse opened this issue Dec 16, 2020 · 4 comments · Fixed by #9648
Assignees
Milestone

Comments

@radriaanse
Copy link

Nomad version

Nomad v1.0.0 (a480eed0815c54612856d9115a34bb1d1a773e8c+CHANGES)

Compiled for debugging with:

$ git checkout v1.0.0
$ git diff GNUmakefile
+               -gcflags="all=-N -l" \

Operating system and Environment details

LSB Version:    :core-4.1-amd64:core-4.1-noarch
Distributor ID: CentOS
Description:    CentOS Linux release 8.2.2004 (Core)
Release:        8.2.2004
Codename:       Core
Linux version 4.18.0-193.28.1.el8_2.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 8.3.1 20191121 (Red Hat 8.3.1-5) (GCC)) #1 SMP Thu Oct 22 00:20:22 UTC 2020
Installed Packages
Name         : docker-ce
Epoch        : 3
Version      : 20.10.1
Release      : 3.el8
Architecture : x86_64
Size         : 115 M
Source       : docker-ce-20.10.1-3.el8.src.rpm
Repository   : @System
From repo    : docker-ce-stable

Name         : containerd.io
Version      : 1.4.3
Release      : 3.1.el8
Architecture : x86_64
Size         : 127 M
Source       : containerd.io-1.4.3-3.1.el8.src.rpm
Repository   : @System
From repo    : docker-ce-stable
# systemd-detect-virt
kvm

Issue

Hi!
I'm trying to use Calico via calico-cni using Nomad for a Docker container. On my test setup when I run the bellow job file Nomad panics with the nil pointer error as seen in the logs.
Because I'm not quite sure if this was caused by either my setup and/or Nomad/Calico config, I did some debugging yesterday but can't quite make out where this goes wrong.

I've recompiled Nomad on tag v1.0.0 without optimization and went through the program a couple of times, I noticed that the CNI interfaces configuration is indeed empty, causing this exception when Nomad tries to parse it's IP from it:

# ~/go/bin/dlv exec /usr/bin/nomad-git -- agent -config /etc/nomad.d
(dlv) b setup github.com/hashicorp/nomad/client/allocrunner/networking_cni.go:122
Breakpoint setup set at 0x22033dc for github.com/hashicorp/nomad/client/allocrunner.(*cniNetworkConfigurator).Setup() github.com/hashicorp/nomad/client/allocrunner/networking_cni.go:122
(dlv) on setup locals
...
> [setup] github.com/hashicorp/nomad/client/allocrunner.(*cniNetworkConfigurator).Setup() github.com/hashicorp/nomad/client/allocrunner/networking_cni.go:122 (hits goroutine(178):1 total:1) (PC: 0x22033dc)
	firstError: error nil
	res: ("*github.com/containerd/go-cni.CNIResult")(0xc00004fc40)
	netStatus: ("*github.com/hashicorp/nomad/nomad/structs.AllocNetworkStatus")(0xc000c8cb70)
	iface: *github.com/containerd/go-cni.Config nil
	name: ""
   117:					}
   118:				}
   119:				return nil, ""
   120:			}(res)
   121:	
=> 122:			netStatus.InterfaceName = name

But i'm having a bit of a hard time figuring out why that's the case, the res that's created earlier seems to contain my interfaces but the Sandbox value is unset; this seems to be somehow important since the logic on line 113+. I sadly can't figur out a way to debug this properly since setting breakpoints there will result in multiple RPC timeout errors so that doesn't seem reliable.

Something else that's interesting is that after Nomad crashes, systemd restarts it and Nomad continues with executing the Job. It eventually crashes again but leaves behind the init container (it also doesn't clean this up after running gc later on):

cat init_container.json| jq '.[].NetworkSettings'
{
  "Bridge": "",
  "SandboxID": "e803ace98470ea32d32bfb7372a5a93a13710a4350982aa2e63a113d7d5e1869",
  "HairpinMode": false,
  "LinkLocalIPv6Address": "",
  "LinkLocalIPv6PrefixLen": 0,
  "Ports": {},
  "SandboxKey": "/var/run/docker/netns/e803ace98470",
  "SecondaryIPAddresses": null,
  "SecondaryIPv6Addresses": null,
  "EndpointID": "",
  "Gateway": "",
  "GlobalIPv6Address": "",
  "GlobalIPv6PrefixLen": 0,
  "IPAddress": "",
  "IPPrefixLen": 0,
  "IPv6Gateway": "",
  "MacAddress": "",
  "Networks": {
    "none": {
      "IPAMConfig": null,
      "Links": null,
      "Aliases": null,
      "NetworkID": "fa228c98804fc0656a2b4538ededfd10d83a0f3c21f2125c8916164c7c569caa",
      "EndpointID": "7074fb6e342335a321a073da73db086611c3d8e58b72d2a7c37ce42548cc9a55",
      "Gateway": "",
      "IPAddress": "",
      "IPPrefixLen": 0,
      "IPv6Gateway": "",
      "GlobalIPv6Address": "",
      "GlobalIPv6PrefixLen": 0,
      "MacAddress": "",
      "DriverOpts": null
    }
  }
}

This also means that the network seems to have been setup regarding of the crash:

41: caliafeeca5e-83@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::ecee:eeff:feee:eeee/64 scope link
       valid_lft forever preferred_lft forever

And I'm able to ping the host from within the netns, so Calico seems to be functioning fine (IP is a placeholder):

[root@host ~]# ln -sfT /proc/846536/ns/net /var/run/netns/4bb3cec1e204
[root@host ~]# ip netns exec 4bb3cec1e204 ping -c 1 192.0.2.1
PING 192.0.2.1 (192.0.2.1) 56(84) bytes of data.
64 bytes from 192.0.2.1: icmp_seq=1 ttl=64 time=0.113 ms

--- 192.0.2.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.113/0.113/0.113/0.000 ms

Any ideas on what might cause this?
I tried multiple settings but nothing really seems to make a difference, should there be anything else I can test let me know!
And if it's something with my setup; sorry for the noise but might be good either way to make sure Nomad will error more gracefully should this happen :).

Job file (if appropriate)

job "jobwithcni01" {
   datacenters = ["dc01-dev"]

    group "group01" {

        network {
            mode = "cni/network01"
        }

        task "task01" {
            driver = "docker"

            config {
                image = "archlinux"
            }
        }
    }
}

Nomad Client logs

nomad-git[845888]: panic: runtime error: invalid memory address or nil pointer dereference
nomad-git[845888]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x2203415]
nomad-git[845888]: goroutine 127 [running]:
nomad-git[845888]: github.com/hashicorp/nomad/client/allocrunner.(*cniNetworkConfigurator).Setup(0xc000a32410, 0x456a6c0, 0xc0001b4018, 0xc00000cf00, 0xc00084e450, 0x0, 0x0, 0x0)
nomad-git[845888]:         github.com/hashicorp/nomad/client/allocrunner/networking_cni.go:123 +0x835
nomad-git[845888]: github.com/hashicorp/nomad/client/allocrunner.(*networkHook).Prerun(0xc00047b020, 0x0, 0x0)
nomad-git[845888]:         github.com/hashicorp/nomad/client/allocrunner/network_hook.go:105 +0x4c7
nomad-git[845888]: github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).prerun(0xc0003baa00, 0x0, 0x0)
nomad-git[845888]:         github.com/hashicorp/nomad/client/allocrunner/alloc_runner_hooks.go:185 +0x5f2
nomad-git[845888]: github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).Run(0xc0003baa00)
nomad-git[845888]:         github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:304 +0x156
nomad-git[845888]: created by github.com/hashicorp/nomad/client.(*Client).addAlloc
nomad-git[845888]:         github.com/hashicorp/nomad/client/client.go:2435 +0xf2a

Gist for full logs: https://gist.github.com/radriaanse/d2f46f6bd92126906430aab078735474
Coredump on: nomad-oss-debug@hashicorp.com

@tgross tgross self-assigned this Dec 16, 2020
@tgross
Copy link
Member

tgross commented Dec 16, 2020

Hi @radriaanse! Looks from the stacktrace that you're right that the NPE is getting hit in networking_cni.go#L113-L123. If none of the interfaces we get have the Sandbox field set, we get a nil iface field and that hits the NPE. So we should patch that for sure... I'll need to dig a little bit to see if I can come up with why that field isn't being set.

@tgross
Copy link
Member

tgross commented Dec 16, 2020

@radriaanse I've got a patch #9648 which will protect against the NPE. We'll land that in 1.0.1 which should ship very soon.

Something we'll need to follow up on: what's the set of circumstances that causes the plugin to give us no interfaces with a sandbox? The CNI spec suggests this is a "MUST" field for this use case.

sandbox (string): container/namespace-based environments should return the full filesystem path to the network namespace of that sandbox. Hypervisor/VM-based plugins should return an ID unique to the virtualized sandbox the interface was created in. This item must be provided for interfaces created or moved into a sandbox like a network namespace or a hypervisor/VM.

@radriaanse
Copy link
Author

@tgross Nice, thanks!
I've tested it and Nomad isn't crashing anymore.

The CNI spec is clear about that field indeed, something I noticed while checking that just now is that:

2020-12-16 11:40:24.610 [DEBUG][846019] plugin.go 161: Extracted identifiers EndpointIDs=&utils.WEPIdentifiers{Namespace:"default", WEPName:"", WorkloadEndpointIdentifiers:names.WorkloadEndpointIdentifiers{Node:"host.name.tld", Orchestrator:"cni", Endpoint:"eth0", Workload:"", Pod:"", ContainerID:"061cad7b-a10c-6fc6-0445-407edf453893"}}

From the calico-cni.log only includes the host's endpoint interface.
If I'm understanding correctly, after Calico creates a veth interface if should also append this to that response; but with a sandbox field this time.

I'll try to figure out why it's doing that, but that might not have anything to do with Nomad after all.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants