Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad crashes "runtime error: index out of range [0] with length 0" #16863

Closed
wusikijeronii opened this issue Apr 12, 2023 · 5 comments · Fixed by #16921
Closed

Nomad crashes "runtime error: index out of range [0] with length 0" #16863

wusikijeronii opened this issue Apr 12, 2023 · 5 comments · Fixed by #16921
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/crash theme/networking type/bug

Comments

@wusikijeronii
Copy link

wusikijeronii commented Apr 12, 2023

Our company is evaluating using nomad, however nomad frequently crashes when a user submits a job.
The main error is:

panic: runtime error: index out of range [0] with length 0

goroutine 262 [running]:
github.com/hashicorp/nomad/client/allocrunner.newNetworkManager(0xc000b28100?, {0x34c1258, 0xc0002e69a0})
        github.com/hashicorp/nomad/client/allocrunner/network_manager_linux.go:93 +0x9e5
github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).initRunnerHooks(0xc000b16000, 0xc000b17180)
        github.com/hashicorp/nomad/client/allocrunner/alloc_runner_hooks.go:125 +0x165
github.com/hashicorp/nomad/client/allocrunner.NewAllocRunner(0xc000965b90)
        github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:259 +0x99f
github.com/hashicorp/nomad/client.(*Client).addAlloc(0xc000c29500, 0xc000d2e800, {0x0, 0x0})
        github.com/hashicorp/nomad/client/client.go:2643 +0x6da
github.com/hashicorp/nomad/client.(*Client).runAllocs(0xc000c29500, 0xc0009c8c30)
        github.com/hashicorp/nomad/client/client.go:2451 +0x62c
github.com/hashicorp/nomad/client.(*Client).run(0xc000c29500)
        github.com/hashicorp/nomad/client/client.go:1863 +0x15f
created by github.com/hashicorp/nomad/client.NewClient
        github.com/hashicorp/nomad/client/client.go:597 +0x23ed

Job:

job "cored-job" {
  datacenters = ["dc1"]
  type = "service"
  group "cored-group" {
    count = 1
    constraint {
      operator  = "distinct_hosts"
      value     = "true"
    }
    
    volume "cored" {
      type      = "host"
      read_only = true
      source    = "cored"
    }
    
    restart {
      attempts = 2
      interval = "5m"
      delay = "15s"
      mode = "fail"
    }
    ephemeral_disk {
      size = 300
    }
    task "cored-job" {
      user = "cored"
      driver = "exec"
      
      volume_mount {
        volume      = "cored"
        destination = "/usr/local/cored-d/"
        read_only   = true
      }
      
      config {
        command = "/usr/local/cored-d/bin/linux/cored"
        args = []
      }

      resources {
        cpu = 200
        network {
          port "http" {
            static = 8090
          }
        }
      }
      service {
        name = "cored"
        tags = ["http"]
        port = "http"
        check {
          name     = "alive"
          type     = "http"
          path     = "/health"
          interval = "10s"
          timeout  = "2s"
        }
        }
    }
  }
}

Nomad version

Nomad v1.5.3
BuildDate 2023-04-04T20:09:50Z
Revision 434f7a1745c6304d607562daa9a4a635def7153f

I also tested on Nomad v1.5.0

Operating system and Environment details

[root@srv1-prod ~]# cat /etc/os-release
NAME="Oracle Linux Server"
VERSION="8.7"
ID="ol"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="8.7"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Oracle Linux Server 8.7"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:oracle:linux:8:7:server"
HOME_URL="https://linux.oracle.com/"
BUG_REPORT_URL="https://bugzilla.oracle.com/"

ORACLE_BUGZILLA_PRODUCT="Oracle Linux 8"
ORACLE_BUGZILLA_PRODUCT_VERSION=8.7
ORACLE_SUPPORT_PRODUCT="Oracle Linux"
ORACLE_SUPPORT_PRODUCT_VERSION=8.7

Full log:

[root@srv1-prod ~]# /usr/bin/nomad agent -config /etc/nomad.d
==> Loaded configuration from /etc/nomad.d/nomad.hcl
==> Starting Nomad agent...
==> Nomad agent configuration:

       Advertise Addrs: HTTP: 10.0.1.4:4646; RPC: 10.0.1.4:4647; Serf: 10.0.1.4:4648
            Bind Addrs: HTTP: [0.0.0.0:4646]; RPC: 0.0.0.0:4647; Serf: 0.0.0.0:4648
                Client: true
             Log Level: DEBUG
                Region: global (DC: dc1)
                Server: true
               Version: 1.5.3

==> Nomad agent started! Log data will stream in below:

    2023-04-12T18:21:27.389+0300 [INFO]  nomad: setting up raft bolt store: no_freelist_sync=false
    2023-04-12T18:21:27.402+0300 [INFO]  nomad.raft: initial configuration: index=18 servers="[{Suffrage:Voter ID:d38fd5ec-024d-b74d-b967-b7c3b552c287 Address:10.0.1.5:4647} {Suffrage:Voter ID:d96500bc-b791-ad11-cc8b-03304ba38a7b Address:10.0.1.4:4647} {Suffrage:Voter ID:8dcab666-8272-fac9-f168-5322d1a74df4 Address:10.0.1.8:4647}]"
    2023-04-12T18:21:27.402+0300 [INFO]  nomad.raft: entering follower state: follower="Node at 10.0.1.4:4647 [Follower]" leader-address= leader-id=
    2023-04-12T18:21:27.403+0300 [INFO]  nomad: serf: EventMemberJoin: srv1-prod.global 10.0.1.4
    2023-04-12T18:21:27.403+0300 [INFO]  nomad: starting scheduling worker(s): num_workers=4 schedulers=["batch", "system", "sysbatch", "service", "_core"]
    2023-04-12T18:21:27.403+0300 [DEBUG] nomad: started scheduling worker: id=b49607d4-c753-98cb-0908-9d8268f4dfc8 index=1 of=4
    2023-04-12T18:21:27.403+0300 [DEBUG] nomad: started scheduling worker: id=a6f2b5f7-8a29-c892-5a3d-dff2df85bece index=2 of=4
    2023-04-12T18:21:27.403+0300 [DEBUG] nomad: started scheduling worker: id=1b7f1423-8e30-75c1-c8d7-7c906f86fef1 index=3 of=4
    2023-04-12T18:21:27.403+0300 [INFO]  nomad: serf: Attempting re-join to previously known node: srv2-prod.global: 10.0.1.5:4648
    2023-04-12T18:21:27.403+0300 [DEBUG] worker: running: worker_id=b49607d4-c753-98cb-0908-9d8268f4dfc8
    2023-04-12T18:21:27.403+0300 [DEBUG] worker: running: worker_id=a6f2b5f7-8a29-c892-5a3d-dff2df85bece
    2023-04-12T18:21:27.403+0300 [DEBUG] worker: running: worker_id=f9508ba3-4bf2-b281-1c7a-ce13e5d5c839
    2023-04-12T18:21:27.403+0300 [DEBUG] nomad: started scheduling worker: id=f9508ba3-4bf2-b281-1c7a-ce13e5d5c839 index=4 of=4
    2023-04-12T18:21:27.403+0300 [INFO]  nomad: started scheduling worker(s): num_workers=4 schedulers=["batch", "system", "sysbatch", "service", "_core"]
    2023-04-12T18:21:27.403+0300 [INFO]  nomad: adding server: server="srv1-prod.global (Addr: 10.0.1.4:4647) (DC: dc1)"
    2023-04-12T18:21:27.403+0300 [DEBUG] worker: running: worker_id=1b7f1423-8e30-75c1-c8d7-7c906f86fef1
    2023-04-12T18:21:27.403+0300 [DEBUG] nomad.keyring.replicator: starting encryption key replication
    2023-04-12T18:21:27.403+0300 [DEBUG] nomad: memberlist: Initiating push/pull sync with: srv2-prod.global 10.0.1.5:4648
    2023-04-12T18:21:27.404+0300 [WARN]  agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/opt/nomad/plugins
    2023-04-12T18:21:27.404+0300 [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/opt/nomad/plugins
    2023-04-12T18:21:27.404+0300 [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/opt/nomad/plugins
    2023-04-12T18:21:27.404+0300 [ERROR] agent.plugin_loader.docker: failed to list pause containers: plugin_dir=/opt/nomad/plugins error=<nil>
    2023-04-12T18:21:27.405+0300 [INFO]  nomad: serf: EventMemberJoin: srv3-prod.global 10.0.1.8
    2023-04-12T18:21:27.405+0300 [INFO]  nomad: serf: EventMemberJoin: srv2-prod.global 10.0.1.5
    2023-04-12T18:21:27.405+0300 [DEBUG] nomad: serf: Refuting an older leave intent
    2023-04-12T18:21:27.405+0300 [INFO]  nomad: adding server: server="srv3-prod.global (Addr: 10.0.1.8:4647) (DC: dc1)"
    2023-04-12T18:21:27.405+0300 [INFO]  nomad: adding server: server="srv2-prod.global (Addr: 10.0.1.5:4647) (DC: dc1)"
    2023-04-12T18:21:27.405+0300 [INFO]  nomad: serf: Re-joined to previously known node: srv2-prod.global: 10.0.1.5:4648
    2023-04-12T18:21:27.407+0300 [ERROR] agent.plugin_loader.docker: failed to list pause containers: plugin_dir=/opt/nomad/plugins error=<nil>
    2023-04-12T18:21:27.407+0300 [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
    2023-04-12T18:21:27.407+0300 [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
    2023-04-12T18:21:27.407+0300 [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
    2023-04-12T18:21:27.407+0300 [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
    2023-04-12T18:21:27.407+0300 [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
    2023-04-12T18:21:27.416+0300 [INFO]  client: using state directory: state_dir=/opt/nomad/client
    2023-04-12T18:21:27.417+0300 [INFO]  client: using alloc directory: alloc_dir=/opt/nomad/alloc
    2023-04-12T18:21:27.417+0300 [INFO]  client: using dynamic ports: min=20000 max=32000 reserved=""
    2023-04-12T18:21:27.417+0300 [INFO]  client.cpuset.v1: initialized cpuset cgroup manager: parent=/nomad cpuset=0-3
    2023-04-12T18:21:27.460+0300 [DEBUG] client.fingerprint_mgr: built-in fingerprints: fingerprinters=["arch", "bridge", "cgroup", "cni", "consul", "cpu", "host", "landlock", "memory", "network", "nomad", "plugins_cni", "signal", "storage", "vault", "env_aws", "env_gce", "env_azure", "env_digitalocean"]
    2023-04-12T18:21:27.462+0300 [INFO]  client.fingerprint_mgr.cgroup: cgroups are available
    2023-04-12T18:21:27.462+0300 [DEBUG] client.fingerprint_mgr: CNI config dir is not set or does not exist, skipping: cni_config_dir=/opt/cni/config
    2023-04-12T18:21:27.463+0300 [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=cgroup initial_period=15s
    2023-04-12T18:21:27.469+0300 [INFO]  client.fingerprint_mgr.consul: consul agent is available
    2023-04-12T18:21:27.471+0300 [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=consul initial_period=15s
    2023-04-12T18:21:27.471+0300 [DEBUG] client.fingerprint_mgr.cpu: detected cpu frequency: MHz=1900
    2023-04-12T18:21:27.471+0300 [DEBUG] client.fingerprint_mgr.cpu: detected core count: cores=4
    2023-04-12T18:21:27.472+0300 [DEBUG] client.fingerprint_mgr.cpu: detected reservable cores: cpuset=[0, 1, 2, 3]
    2023-04-12T18:21:27.472+0300 [WARN]  nomad.raft: failed to get previous log: previous-index=77 last-index=65 error="log not found"
    2023-04-12T18:21:27.473+0300 [WARN]  client.fingerprint_mgr.landlock: failed to fingerprint kernel landlock feature: error="function not implemented"
    2023-04-12T18:21:27.482+0300 [DEBUG] client.fingerprint_mgr.network: link speed detected: interface=enp1s0 mbits=100
    2023-04-12T18:21:27.483+0300 [DEBUG] client.fingerprint_mgr.network: detected interface IP: interface=enp1s0 IP=10.0.1.4
    2023-04-12T18:21:27.488+0300 [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=lo
    2023-04-12T18:21:27.488+0300 [DEBUG] client.fingerprint_mgr.network: unable to read link speed: path=/sys/class/net/lo/speed device=lo
    2023-04-12T18:21:27.488+0300 [DEBUG] client.fingerprint_mgr.network: link speed could not be detected, falling back to default speed: interface=lo mbits=1000
    2023-04-12T18:21:27.513+0300 [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=wlp2s0
    2023-04-12T18:21:27.513+0300 [DEBUG] client.fingerprint_mgr.network: unable to read link speed: path=/sys/class/net/wlp2s0/speed device=wlp2s0
    2023-04-12T18:21:27.513+0300 [DEBUG] client.fingerprint_mgr.network: link speed could not be detected, falling back to default speed: interface=wlp2s0 mbits=1000
    2023-04-12T18:21:27.513+0300 [WARN]  client.fingerprint_mgr.cni_plugins: failed to read CNI plugins directory: cni_path=/opt/cni/bin error="open /opt/cni/bin: no such file or directory"
    2023-04-12T18:21:27.516+0300 [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=vault initial_period=15s
    2023-04-12T18:21:28.205+0300 [DEBUG] nomad: serf: messageJoinType: srv1-prod.global
    2023-04-12T18:21:28.213+0300 [DEBUG] nomad: serf: messageJoinType: srv1-prod.global
    2023-04-12T18:21:28.705+0300 [DEBUG] nomad: serf: messageJoinType: srv1-prod.global
    2023-04-12T18:21:28.712+0300 [DEBUG] nomad: serf: messageJoinType: srv1-prod.global
    2023-04-12T18:21:30.119+0300 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:57938
    2023-04-12T18:21:33.518+0300 [DEBUG] client.fingerprint_mgr.env_gce: could not read value for attribute: attribute=machine-type error="Get \"http://169.254.169.254/computeMetadata/v1/instance/machine-type\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
    2023-04-12T18:21:33.518+0300 [DEBUG] client.fingerprint_mgr.env_gce: error querying GCE Metadata URL, skipping
    2023-04-12T18:21:35.519+0300 [DEBUG] client.fingerprint_mgr.env_azure: could not read value for attribute: attribute=compute/azEnvironment error="Get \"http://169.254.169.254/metadata/instance/compute/azEnvironment?api-version=2019-06-04&format=text\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
    2023-04-12T18:21:37.520+0300 [DEBUG] client.fingerprint_mgr.env_digitalocean: failed to request metadata: attribute=region error="Get \"http://169.254.169.254/metadata/v1/region\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
    2023-04-12T18:21:37.520+0300 [DEBUG] client.fingerprint_mgr: detected fingerprints: node_attrs=["arch", "bridge", "cgroup", "consul", "cpu", "host", "network", "nomad", "signal", "storage"]
    2023-04-12T18:21:37.520+0300 [INFO]  client.plugin: starting plugin manager: plugin-type=csi
    2023-04-12T18:21:37.520+0300 [INFO]  client.plugin: starting plugin manager: plugin-type=driver
    2023-04-12T18:21:37.520+0300 [INFO]  client.plugin: starting plugin manager: plugin-type=device
    2023-04-12T18:21:37.520+0300 [DEBUG] client.device_mgr: exiting since there are no device plugins
    2023-04-12T18:21:37.521+0300 [DEBUG] client.plugin: waiting on plugin manager initial fingerprint: plugin-type=driver
    2023-04-12T18:21:37.521+0300 [DEBUG] client.plugin: waiting on plugin manager initial fingerprint: plugin-type=device
    2023-04-12T18:21:37.521+0300 [DEBUG] client.plugin: finished plugin manager initial fingerprint: plugin-type=device
    2023-04-12T18:21:37.521+0300 [DEBUG] client.driver_mgr: initial driver fingerprint: driver=raw_exec health=undetected description=disabled
    2023-04-12T18:21:37.521+0300 [DEBUG] client.driver_mgr: initial driver fingerprint: driver=exec health=healthy description=Healthy
    2023-04-12T18:21:37.521+0300 [ERROR] client.driver_mgr.docker: failed to list pause containers: driver=docker error=<nil>
    2023-04-12T18:21:37.522+0300 [DEBUG] client.driver_mgr.docker: could not connect to docker daemon: driver=docker endpoint=unix:///var/run/docker.sock error="Get \"http://unix.sock/version\": dial unix /var/run/docker.sock: connect: no such file or directory"
    2023-04-12T18:21:37.522+0300 [DEBUG] client.driver_mgr: initial driver fingerprint: driver=docker health=undetected description="Failed to connect to docker daemon"
    2023-04-12T18:21:37.523+0300 [DEBUG] client.driver_mgr: initial driver fingerprint: driver=java health=undetected description=""
    2023-04-12T18:21:37.523+0300 [DEBUG] client.driver_mgr: initial driver fingerprint: driver=qemu health=undetected description=""
    2023-04-12T18:21:37.523+0300 [DEBUG] client.driver_mgr: detected drivers: drivers="map[healthy:[exec] undetected:[raw_exec docker java qemu]]"
    2023-04-12T18:21:37.523+0300 [DEBUG] client.plugin: finished plugin manager initial fingerprint: plugin-type=driver
    2023-04-12T18:21:37.525+0300 [DEBUG] client.server_mgr: new server list: new_servers=[0.0.0.0:4647, 10.0.1.4:4647, 10.0.1.4:4647, 10.0.1.5:4647, 10.0.1.8:4647] old_servers=[]
    2023-04-12T18:21:37.526+0300 [WARN]  client: found an alloc without any local state, skipping restore: alloc_id=b5410467-c64f-f58f-6319-0091dc5c9786
    2023-04-12T18:21:37.526+0300 [INFO]  client: started client: node_id=c71ac8be-2b05-3af7-ab2d-cb472da2d1e5
    2023-04-12T18:21:37.526+0300 [DEBUG] http: UI is enabled
    2023-04-12T18:21:37.526+0300 [DEBUG] http: UI is enabled
    2023-04-12T18:21:37.528+0300 [INFO]  agent.joiner: starting retry join: servers="10.0.1.4 10.0.1.5 10.0.1.8"
    2023-04-12T18:21:37.529+0300 [DEBUG] nomad: memberlist: Initiating push/pull sync with:  10.0.1.4:4648
    2023-04-12T18:21:37.530+0300 [DEBUG] nomad: memberlist: Stream connection from=10.0.1.4:49312
    2023-04-12T18:21:37.532+0300 [DEBUG] nomad: memberlist: Initiating push/pull sync with:  10.0.1.5:4648
    2023-04-12T18:21:37.532+0300 [DEBUG] client: updated allocations: index=72 total=1 pulled=1 filtered=0
    2023-04-12T18:21:37.533+0300 [DEBUG] client: allocation updates: added=1 removed=0 updated=0 ignored=0
    2023-04-12T18:21:37.534+0300 [DEBUG] nomad: memberlist: Initiating push/pull sync with:  10.0.1.8:4648
    2023-04-12T18:21:37.536+0300 [INFO]  agent.joiner: retry join completed: initial_servers=3 agent_mode=server
panic: runtime error: index out of range [0] with length 0

goroutine 262 [running]:
github.com/hashicorp/nomad/client/allocrunner.newNetworkManager(0xc000b28100?, {0x34c1258, 0xc0002e69a0})
        github.com/hashicorp/nomad/client/allocrunner/network_manager_linux.go:93 +0x9e5
github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).initRunnerHooks(0xc000b16000, 0xc000b17180)
        github.com/hashicorp/nomad/client/allocrunner/alloc_runner_hooks.go:125 +0x165
github.com/hashicorp/nomad/client/allocrunner.NewAllocRunner(0xc000965b90)
        github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:259 +0x99f
github.com/hashicorp/nomad/client.(*Client).addAlloc(0xc000c29500, 0xc000d2e800, {0x0, 0x0})
        github.com/hashicorp/nomad/client/client.go:2643 +0x6da
github.com/hashicorp/nomad/client.(*Client).runAllocs(0xc000c29500, 0xc0009c8c30)
        github.com/hashicorp/nomad/client/client.go:2451 +0x62c
github.com/hashicorp/nomad/client.(*Client).run(0xc000c29500)
        github.com/hashicorp/nomad/client/client.go:1863 +0x15f
created by github.com/hashicorp/nomad/client.NewClient
        github.com/hashicorp/nomad/client/client.go:597 +0x23ed

I also tried to run this task on three different machines using same OL8 and Ubuntu. Same result.
Nomad config:

atacenter = "dc1"
data_dir = "/opt/nomad"

server {
  enabled = true
  bootstrap_expect = 3
  server_join {
    retry_join = ["10.0.1.4", "10.0.1.5", "10.0.1.8"]
  }
}

consul {
  address             = "127.0.0.1:8500"
  server_service_name = "nomad"
  client_service_name = "nomad-client"
  auto_advertise      = true
  server_auto_join    = true
  client_auto_join    = true
}

bind_addr = "0.0.0.0"
log_level = "DEBUG"

advertise {
  http = "10.0.1.4"
  rpc = "10.0.1.4"
  serf = "10.0.1.4"
}

client {
  enabled = true
  servers = ["10.0.1.4", "10.0.1.5", "10.0.1.8"]

  host_volume "cored" {
    path = "/usr/local/cored-d/"
    read_only = true
  }
}

leave_on_terminate = true
leave_on_interrupt = true

In according to this issue I also tried to specify bridge mode in the network configuration,
UPD: I was wrong. I purged all jobs. Same issue on the standby time

@wusikijeronii wusikijeronii changed the title Nomad crashes when submitting job "runtime error: index out of range [0] with length 0" Nomad crashes "runtime error: index out of range [0] with length 0" Apr 12, 2023
@iluminae
Copy link

client.fingerprint_mgr.cni_plugins: failed to read CNI plugins directory: cni_path=/opt/cni/bin error="open /opt/cni/bin: no such file or directory"

bridge mode is not going to work for you since you dont have CNI plugins installed on your host (ref)

The network config you posted is deprecated, however I cannot reproduce this even if I use that deprecated syntax (in resources). Peeking at the code it seems like you would need a network defined at the group level to pass this check that is panicking, is that the config you tried after reading #8875? Maybe share a minimally reproducing config without eg: volumes and services?

Obviously nothing from the user should be able to panic the server, so definitely still a bug.

@wusikijeronii
Copy link
Author

wusikijeronii commented Apr 12, 2023

@iluminae thank you.   I removed data_dir (/opt/nomad/*) from all servers. But it is strange, 'cos on each Nomad UI page I saw no active jobs. Maybe there was some temporary data. Then I restarted all servers, and now Nomad works without crashing. I have to say, my first attempt was using dynamic ports instead of static ones. So now, after deploying, I see the error about `localhost" CNI plugin not being available. But before, I didn't get any errors. The deployment was successful. Just network ports worked as static. Now I've installed CNI plugins, and it works even with dynamic ports without any errors.
I think when I ran the job with dynamic ports, purging via the UI didn't finally clear the job, and this "bad task" made Nomad crash.
Yes, now my problem is solved, but the bug is still there.
So, for the reproducing issue:

  1. Remove CNI plugins.
  2. Run any task with dynamic ports in exec mode.
  3. Purge the created job in the Nomad UI.

@lgfa29
Copy link
Contributor

lgfa29 commented Apr 18, 2023

Hi @wusikijeronii 👋

Thanks for the report, and I'm sorry that you have such a poor first experience with Nomad, this is definitely not the experience we want to provide our users.

I tried to reproduce this problem using the steps you described but I was not able to, so I suspect it may have been caused by a bad alloc in state that end up preventing the client from starting because restoring the alloc triggered the crash.

Even without a reproduction the fix seemed clear enough from the details you provided so I opened #16921 to fix this.

Until the fix is released I would suggest you make sure all network configuration is defined at the group level since, as @iluminae mentioned, task level network configuration is deprecated. Having a group level network prevents the 0 index panic at least.

@lgfa29 lgfa29 added theme/networking theme/crash stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Apr 18, 2023
@wusikijeronii
Copy link
Author

Hello.
Thank you for your response. I thought it was just a recommendation. Yes, I try to use the new syntax, but I could forget about this in some quick tests, and these wrong jobs were perhaps stored in the cache in the data_dir folder. Why do old jobs and allocations, even after purging the job, still exist in this folder?

@lgfa29
Copy link
Contributor

lgfa29 commented Apr 18, 2023

but I could forget about this in some quick tests

I was able to find a way to reproduce this. You need to have a task-level network with bridge mode:

job "example" {
  group "sleep" {
    task "sleep" {
      driver = "exec"

      config {
        command = "/bin/bash"
        args    = ["-c", "while true; do sleep 1; done"]
      }

      resources {
        network {
          mode = "bridge"
        }
      }
    }
  }
}

So as long as you avoid this it should be fine. After #16921 is release this will not cause a panic anymore.

Why do old jobs and allocations, even after purging the job, still exist in this folder?

Purging the job only deletes the job itself, so related objects, like allocations, evals, deployments etc. are still kept in state until the garbage collector runs.

You can manually trigger the GC with the nomad system gc command.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/crash theme/networking type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants