Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad crashes when submitting job "runtime error: index out of range [0] with length 0" #8875

Closed
spuder opened this issue Sep 11, 2020 · 5 comments · Fixed by #8882
Closed

Comments

@spuder
Copy link
Contributor

spuder commented Sep 11, 2020

Our company is evaluating using nomad, however nomad frequently crashes when a user submits a job.
The main error is:

Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]:     2020-09-11T17:28:53.418Z [ERROR] http: http: panic serving 127.0.0.1:35224: runtime error: index out of range [0] with length 0
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]: goroutine 157 [running]:
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]: net/http.(*conn).serve.func1(0xc000e92640)
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]:         net/http/server.go:1800 +0x139
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]: panic(0x2f79900, 0xc001172c00)
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]:         runtime/panic.go:975 +0x3e3
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]: 

Version

0.12.4

Nomad Job

job "pulse-predictions-service-mm-nomad-and-roam" {
  region = "sand"
  datacenters = ["sb"]
  type = "service"
  spread {
    attribute = "${node.datacenter}"
    weight = 100
  }
  constraint {
    attribute = "${meta.mx_app}"
    value = "common"
  }
  group "group" {
    count = 1    
    service { 
      name = "foobar"
      tags = [ "foobar" ]
      port = "http"
      check {
        name = "foobar-health"
        type = "http"
        path = "/"
        interval = "10s"
        timeout = "2s"
        expose   = true
      }
      connect {
        sidecar_service {}
      }
    }
    
    task "app" {
      driver = "docker"
      config {
        image = "gitlab.example.com/redacted"
        volumes = []
      }
      env {
        APP_NAME = "foobar"
        APP_HOST = "${attr.unique.network.ip-address}"
      }
      resources {
        cpu    = 250
        memory = 512
        network {
            mbits = 10
        }
      }
    } # task
  } # group
  meta {
    CI_PROJECT_URL = "https://gitlab.example.com/redacted"
    CI_COMMIT_REF_NAME = "redacted"
    EPHEMERAL = "true"
    CI_PIPELINE_IID = "14"
    CI_COMMIT_SHA   = "redacted"
  }
} # job

Server Logs

Sep 11 17:28:33 sb-sand-nomad2 nomad[6188]:     2020-09-11T17:28:33.594Z [INFO]  nomad: serf: EventMemberJoin: sb-sand-nomad1.sand 10.47.69.194
Sep 11 17:28:33 sb-sand-nomad2 nomad[6188]:     2020-09-11T17:28:33.594Z [INFO]  nomad: serf: EventMemberJoin: sb-sand-nomad3.sand 10.47.69.196
Sep 11 17:28:33 sb-sand-nomad2 nomad[6188]:     2020-09-11T17:28:33.594Z [INFO]  nomad: serf: Re-joined to previously known node: sb-sand-nomad1.sand: 10.47.69.194:4648
Sep 11 17:28:33 sb-sand-nomad2 nomad[6188]:     2020-09-11T17:28:33.594Z [INFO]  nomad: adding server: server="sb-sand-nomad1.sand (Addr: 10.47.69.194:4647) (DC: sb)"
Sep 11 17:28:33 sb-sand-nomad2 nomad[6188]:     2020-09-11T17:28:33.594Z [INFO]  nomad: adding server: server="sb-sand-nomad3.sand (Addr: 10.47.69.196:4647) (DC: sb)"
Sep 11 17:28:33 sb-sand-nomad2 nomad[6188]:     2020-09-11T17:28:33.694Z [INFO]  nomad.vault: successfully renewed token: next_renewal=35h59m59.999993572s
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]:     2020-09-11T17:28:53.418Z [ERROR] http: http: panic serving 127.0.0.1:35224: runtime error: index out of range [0] with length 0
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]: goroutine 157 [running]:
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]: net/http.(*conn).serve.func1(0xc000e92640)
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]:         net/http/server.go:1800 +0x139
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]: panic(0x2f79900, 0xc001172c00)
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]:         runtime/panic.go:975 +0x3e3
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]: github.com/hashicorp/nomad/nomad.exposePathForCheck(0xc001127000, 0xc0009b24d0, 0xc001165b00, 0xc00112d440, 0x0, 0x0)
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]:         github.com/hashicorp/nomad/nomad/job_endpoint_hook_expose_check.go:213 +0x59e
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]: github.com/hashicorp/nomad/nomad.jobExposeCheckHook.Mutate(0xc00112d440, 0xc00155b060, 0x411858, 0x60, 0x2965820, 0x0, 0x0)
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]:         github.com/hashicorp/nomad/nomad/job_endpoint_hook_expose_check.go:27 +0x8e
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]: github.com/hashicorp/nomad/nomad.(*Job).admissionMutators(0xc0004bab40, 0xc00112d440, 0x7, 0xc0009def01, 0xc00155b368, 0xc001325140, 0xc00062eb90, 0x2d62640)
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]:         github.com/hashicorp/nomad/nomad/job_endpoint_hooks.go:61 +0xd3
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]: github.com/hashicorp/nomad/nomad.(*Job).Validate(0xc0004bab40, 0xc000af2bc0, 0xc000af2c00, 0x0, 0x0)
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]:         github.com/hashicorp/nomad/nomad/job_endpoint.go:537 +0x382
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]: reflect.Value.call(0xc000234240, 0xc000010ab0, 0x13, 0x31f31a9, 0x4, 0xc00155b698, 0x3, 0x3, 0xc000af2c00, 0xc00155b6f0, ...)
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]:         reflect/value.go:460 +0x8ab
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]: reflect.Value.Call(0xc000234240, 0xc000010ab0, 0x13, 0xc00155b698, 0x3, 0x3, 0x8, 0x16, 0x16)
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]:         reflect/value.go:321 +0xb4
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]: net/rpc.(*service).call(0xc00040fe40, 0xc0004ba960, 0xc0003c5300, 0x0, 0xc00017f400, 0xc000c38040, 0x2f85b60, 0xc000af2bc0, 0x16, 0x2ccd1c0, ...)
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]:         net/rpc/server.go:377 +0x17f
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]: net/rpc.(*Server).ServeRequest(0xc0004ba960, 0x38c6e00, 0xc000af2b80, 0xc0012242c8, 0xc00006ca80)
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]:         net/rpc/server.go:498 +0x23e
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]: github.com/hashicorp/nomad/nomad.(*Server).RPC(0xc000503b00, 0x3203c7c, 0xc, 0x2f85b60, 0xc000af2b00, 0x2ccd1c0, 0xc000af2b40, 0xc001126f01, 0xc000af2b40)
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]:         github.com/hashicorp/nomad/nomad/server.go:1558 +0xb9
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]: github.com/hashicorp/nomad/command/agent.(*Agent).RPC(0xc0002fc7e0, 0x3203c7c, 0xc, 0x2f85b60, 0xc000af2b00, 0x2ccd1c0, 0xc000af2b40, 0xc000bae958, 0x65cb1d)
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]:         github.com/hashicorp/nomad/command/agent/agent.go:977 +0x7d
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]: github.com/hashicorp/nomad/command/agent.(*HTTPServer).ValidateJobRequest(0xc0007904b0, 0x38ab540, 0xc001326d20, 0xc001126f00, 0x4d8096, 0x5f5bb3d5, 0x18d83e15, 0xbf041b82a426f)
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]:         github.com/hashicorp/nomad/command/agent/job_endpoint.go:190 +0x1de
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]: github.com/hashicorp/nomad/command/agent.(*HTTPServer).wrap.func1(0x38ab540, 0xc001326d20, 0xc001126f00)
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]:         github.com/hashicorp/nomad/command/agent/http.go:448 +0x176
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]: net/http.HandlerFunc.ServeHTTP(0xc0007f5720, 0x38ab540, 0xc001326d20, 0xc001126f00)
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]:         net/http/server.go:2041 +0x44
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]: net/http.(*ServeMux).ServeHTTP(0xc0007cfdc0, 0x38ab540, 0xc001326d20, 0xc001126f00)
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]:         net/http/server.go:2416 +0x1a5
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]: github.com/NYTimes/gziphandler.GzipHandlerWithOpts.func1.1(0x38b3bc0, 0xc0004cd180, 0xc001126f00)
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]:         github.com/NYTimes/gziphandler@v1.0.1/gzip.go:277 +0x1e6
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]: net/http.HandlerFunc.ServeHTTP(0xc000c880c0, 0x38b3bc0, 0xc0004cd180, 0xc001126f00)
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]:         net/http/server.go:2041 +0x44
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]: net/http.serverHandler.ServeHTTP(0xc0007a7180, 0x38b3bc0, 0xc0004cd180, 0xc001126f00)
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]:         net/http/server.go:2836 +0xa3
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]: net/http.(*conn).serve(0xc000e92640, 0x38c6000, 0xc000af2100)
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]:         net/http/server.go:1924 +0x86c
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]: created by net/http.(*Server).Serve
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]:         net/http/server.go:2962 +0x35c
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]:     2020-09-11T17:28:53.908Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: EOF"
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]:     2020-09-11T17:28:53.908Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: EOF"
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]:     2020-09-11T17:28:53.930Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: failed to get conn: dial tcp 10.47.69.194:4647: connect: connection refused"
Sep 11 17:28:53 sb-sand-nomad2 nomad[6188]:     2020-09-11T17:28:53.930Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: failed to get conn: rpc error: lead thread didn't get connection"
Sep 11 17:28:54 sb-sand-nomad2 nomad[6188]:     2020-09-11T17:28:54.011Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: failed to get conn: dial tcp 10.47.69.194:4647: connect: connection refused"
Sep 11 17:28:54 sb-sand-nomad2 nomad[6188]:     2020-09-11T17:28:54.011Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: failed to get conn: rpc error: lead thread didn't get connection"
Sep 11 17:28:54 sb-sand-nomad2 nomad[6188]:     2020-09-11T17:28:54.011Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: failed to get conn: rpc error: lead thread didn't get connection"
Sep 11 17:28:54 sb-sand-nomad2 nomad[6188]:     2020-09-11T17:28:54.333Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: failed to get conn: dial tcp 10.47.69.194:4647: connect: connection refused"
Sep 11 17:28:54 sb-sand-nomad2 nomad[6188]:     2020-09-11T17:28:54.333Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: failed to get conn: rpc error: lead thread didn't get connection"
Sep 11 17:28:54 sb-sand-nomad2 nomad[6188]:     2020-09-11T17:28:54.333Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: failed to get conn: rpc error: lead thread didn't get connection"
Sep 11 17:28:55 sb-sand-nomad2 nomad[6188]:     2020-09-11T17:28:55.275Z [WARN]  nomad.raft: heartbeat timeout reached, starting election: last-leader=10.47.69.194:4647
Sep 11 17:28:55 sb-sand-nomad2 nomad[6188]:     2020-09-11T17:28:55.275Z [INFO]  nomad.raft: entering candidate state: node="Node at 10.47.69.195:4647 [Candidate]" term=34763
Sep 11 17:28:55 sb-sand-nomad2 nomad[6188]:     2020-09-11T17:28:55.294Z [ERROR] nomad.raft: failed to make requestVote RPC: target="{Voter 10.47.69.194:4647 10.47.69.194:4647}" error="dial tcp 10.47.69.194:4647: connect: connection refused"
</p>
</details>

@tgross
Copy link
Member

tgross commented Sep 11, 2020

Hi @spuder. Sorry to hear about that. It looks like there's a null pointer bubbling up from the new expose hook. We'll get that fixed ASAP, but in the meantime if you omit the expose = true field that should avoid the crash.

cc @nickethier

@shoenig
Copy link
Member

shoenig commented Sep 11, 2020

We should fix the null pointer (adding validation), but the job really just needs a bridge mode network configuration in the group stanza when using sidecar proxies.

@spuder
Copy link
Contributor Author

spuder commented Sep 11, 2020

Great thank you. Adding the missing networking in bridged mode and removing expose = true did fix the crash.

Is there an existing issue for the expose = true bug? If so I'll link and close this out.

@shoenig
Copy link
Member

shoenig commented Sep 11, 2020

There isn't an issue yet, so let's leave this open to track the fix. Turns out there has been validation for the missing network block since forever, it's just that it exists in a validation hook that runs after this expose mutator hook 🤕

@notnoop notnoop added this to the 0.12.5 milestone Sep 14, 2020
@shoenig shoenig self-assigned this Sep 14, 2020
shoenig added a commit that referenced this issue Sep 14, 2020
In #7800, Nomad would automatically generate a port label for service
checks making use of the expose feature, if the port was not already
set. This change assumed the group network would be correctly defined
(as is checked in a validation hook later). If the group network was
not definied, a panic would occur on job submisssion. This change
re-uses the group network validation helper to make sure the network
is correctly definied before adding ports to it.

Fixes #8875
shoenig added a commit that referenced this issue Sep 14, 2020
In #7800, Nomad would automatically generate a port label for service
checks making use of the expose feature, if the port was not already
set. This change assumed the group network would be correctly defined
(as is checked in a validation hook later). If the group network was
not definied, a panic would occur on job submisssion. This change
re-uses the group network validation helper to make sure the network
is correctly definied before adding ports to it.

Fixes #8875
fredrikhgrelland pushed a commit to fredrikhgrelland/nomad that referenced this issue Sep 28, 2020
In hashicorp#7800, Nomad would automatically generate a port label for service
checks making use of the expose feature, if the port was not already
set. This change assumed the group network would be correctly defined
(as is checked in a validation hook later). If the group network was
not definied, a panic would occur on job submisssion. This change
re-uses the group network validation helper to make sure the network
is correctly definied before adding ports to it.

Fixes hashicorp#8875
@github-actions
Copy link

github-actions bot commented Nov 2, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 2, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants