Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No stats in alloc-status in nomad 0.8.1 (no path to node) #4203

Closed
commarla opened this issue Apr 24, 2018 · 22 comments · Fixed by #4222 or #4317
Closed

No stats in alloc-status in nomad 0.8.1 (no path to node) #4203

commarla opened this issue Apr 24, 2018 · 22 comments · Fixed by #4222 or #4317

Comments

@commarla
Copy link

commarla commented Apr 24, 2018

Nomad version

Nomad v0.8.1 (46aa11b) on both server and client

Operating system and Environment details

Debian Jessie.
Cluster was updated from 0.7.1

Issue

Got a Couldn't retrieve stats: Unexpected response code: 404 (No path to node) during nomad alloc-status command.

Got an error with nomad node-status : error fetching node stats: actual resource usage not present

Reproduction steps

Start a job, try to get the alloc-status.

Nomad Server logs (if appropriate)

Apr 24 16:45:05 admin-10-32-152-38 nomad[812]: 2018/04/24 16:45:05.380360 [ERR] http: Request /v1/client/allocation/89248058-b626-b376-c36b-595836816243/stats, error: No path to node
@dadgar
Copy link
Contributor

dadgar commented Apr 24, 2018

@commarla Is that node that was running that allocation still alive? Have all nodes and servers been upgraded? Do you see this for all node-status calls?

@commarla
Copy link
Author

@dadgar Yes the node is still alive. I can reproduce on any 0.8.1 nodes.
We also have a dev cluster in 0.8.0 and we don't have the issue.

We have some nodes in 0.7.x with a different class. (20 nodes in 0.8.1, 20 nodes in 0.7.1 and 10 in 0.7.0).

On a node in 0.7 I have a 500 :

error fetching node stats: Unexpected response code: 500 (Node does not support RPC; requires 0.8 or later)

error fetching node stats: actual resource usage not present

@dadgar
Copy link
Contributor

dadgar commented Apr 25, 2018

@commarla Do you have any reproduction steps? I can not reproduce it.

@commarla
Copy link
Author

@dadgar I have this error all the time, I don't know what to do to help you reproduce. The cluster was updated from 0.7.1 and before that 0.6.3, 0.7.0-beta1, 0.7.1-rc1

@commarla
Copy link
Author

@nanoz do you have something to add ?

@nanoz
Copy link
Contributor

nanoz commented Apr 26, 2018

Even though I can GET allocs and nodes informations, stats are the only endpoints not working properly, probably because of the RPC proxying feature.

Allocation API calls

GET /v1/client/allocation/2488cf92-19ef-2954-78a7-2217f9372cc1/stats HTTP/1.1
Host: nomad.service.consul:4646
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip
Connection: close

HTTP/1.1 404 Not Found
Access-Control-Allow-Origin: *
Content-Encoding: gzip
Content-Type: text/plain; charset=utf-8
Date: Thu, 26 Apr 2018 09:07:59 GMT
Vary: Accept-Encoding
Vary: Origin
Content-Length: 39
Connection: Close
GET /v1/allocation/2488cf92-19ef-2954-78a7-2217f9372cc1 HTTP/1.1
Host: nomad.service.consul:4646
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip
Connection: close

HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
Content-Encoding: gzip
Content-Type: application/json
Date: Thu, 26 Apr 2018 09:07:57 GMT
Vary: Accept-Encoding
X-Nomad-Index: 10765450
X-Nomad-Knownleader: true
X-Nomad-Lastcontact: 0
Content-Length: 3062
Connection: Close

Node API calls

GET /v1/client/stats?node_id=4c7b3e14-f073-5a4b-3f7f-5533551628a7 HTTP/1.1
Host: nomad.service.consul:4646
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip
Connection: close

HTTP/1.1 404 Not Found
Access-Control-Allow-Origin: *
Content-Encoding: gzip
Content-Type: text/plain; charset=utf-8
Date: Thu, 26 Apr 2018 09:08:18 GMT
Vary: Accept-Encoding
Vary: Origin
Content-Length: 39
Connection: Close
GET /v1/node/4c7b3e14-f073-5a4b-3f7f-5533551628a7 HTTP/1.1
Host: nomad.service.consul:4646
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip
Connection: close

HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
Content-Encoding: gzip
Content-Type: application/json
Date: Thu, 26 Apr 2018 09:08:18 GMT
Vary: Accept-Encoding
X-Nomad-Index: 11048032
X-Nomad-Knownleader: true
X-Nomad-Lastcontact: 0
Content-Length: 1609
Connection: Close

The HTTP call works fine if we curl it from the nomad server to the client node.

GET /v1/client/allocation/2488cf92-19ef-2954-78a7-2217f9372cc1/stats HTTP/1.1
User-Agent: curl/7.38.0
Host: 10.32.24.149:4646
Accept: */*

HTTP/1.1 200 OK
Content-Type: application/json
Vary: Accept-Encoding
Vary: Origin
Date: Thu, 26 Apr 2018 09:30:28 GMT
Content-Length: 883

My understanding is that Nomad 0.8 now proxies those calls using the RPC port. I think the problem is that we are not listening on anything else than 4646 on the client node.

$ netstat -lnpt | grep nomad
tcp6       0      0 :::4646                 :::*                    LISTEN      1905/nomad

Here is our client node configuration

client {
  enabled          = true

  node_class       = "app"

  options {
    "driver.raw_exec.enable" = "1"
    "docker.auth.config" = "/usr/local/etc/docker/auth.json"
  }

  reserved {
    reserved_ports = "1-10,22"
  }
}

bind_addr          = "0.0.0.0"

advertise {
  http = "1.2.3.4:4646"
  rpc = "1.2.3.4:4647"
  serf = "1.2.3.4:4648"
}

data_dir           = "/data/nomad"

enable_debug       = "false"
enable_syslog      = "false"
log_level          = "INFO"

leave_on_interrupt = true
leave_on_terminate = true

telemetry {
  prometheus_metrics         = true
  publish_allocation_metrics = true
  publish_node_metrics       = true
}

Isn't bind_addr enough to activate RPC? I also wonder why are we obligated to go through the server RPC proxying, isn't the "old way" of retrieving stats by requesting the client's HTTP API directly the default way of retrieving stats?

@dadgar
Copy link
Contributor

dadgar commented Apr 26, 2018

A few questions:

  1. Is there any material difference in the networking setup between the 0.8 cluster and 0.8.1 cluster?
  2. What happens if you restart one of the client nodes and then try requesting stats from it?
  3. How did you upgrade the cluster from 0.7.1 to 0.8? In-place or new machines?
  4. If you stand-up a new cluster do you have this issue or only on this cluster?
  5. Would you be willing to run a custom build to gather more info?

@dadgar
Copy link
Contributor

dadgar commented Apr 26, 2018

@commarla @nanoz Pretty certain I have fixed the issue! The change will be going into 0.8.2!

@dadgar
Copy link
Contributor

dadgar commented Apr 27, 2018

@commarla @nanoz 0.8.2 is out! Please report back if it doesn't fix your issue!

@nanoz
Copy link
Contributor

nanoz commented Apr 27, 2018

thanks @dadgar , we will test this in the next two weeks

@grin0c
Copy link

grin0c commented May 7, 2018

Nomad v0.8.3 (c85483d)

New installation.

nomad alloc status 23b5d55d

Error: Couldn't retrieve stats: Unexpected response code: 404 (No path to node)

Logs from the node on which the command is run:

2018/05/07 19:09:11.743547 [WARN] nomad.client_rpc: node "cca70d66-8872-60b4-d6d4-086daf191838" exists in node connection map without any connection 2018/05/07 19:09:11.743624 [WARN] nomad.client_rpc: node "cca70d66-8872-60b4-d6d4-086daf191838" exists in node connection map without any connection 2018/05/07 19:09:11.744922 [ERR] http: Request /v1/client/allocation/23b5d55d-f878-b872-ab6d-6c86ed6b85ba/stats, error: No path to node

Logs from the remote node on which the container is running:

2018/05/07 19:09:11.740687 [WARN] nomad.client_rpc: node "cca70d66-8872-60b4-d6d4-086daf191838" exists in node connection map without any connection

If you run the command nomad alloc status 23b5d55d on the same node where the container is running, there is no error

curl http://<remote_ip>:4646/v1/client/allocation/23b5d55d-f878-b872-ab6d-6c86ed6b85ba/stats - works

@yellowmegaman
Copy link

yellowmegaman commented May 10, 2018

@grin0c i confirm, same here.

@grin0c
Copy link

grin0c commented May 10, 2018

@yellowmegaman I confirm

@dadgar dadgar reopened this May 10, 2018
@dadgar
Copy link
Contributor

dadgar commented May 10, 2018

@grin0c @yellowmegaman Any steps to follow to reproduce? The log message I see how to reproduce (it is a noisy log that I will fix) but more asking about the 404. Can you all share your client/server configs/how long it takes for this to happen, what you do to the nodes/servers/etc for this to occur.

@grin0c
Copy link

grin0c commented May 11, 2018

@dadgar this is my config:

data_dir = "/var/lib/nomad"
region = "russia"
datacenter = "dc1"

advertise {
  http = "172.20.1.1"
  rpc = "172.20.1.1"
  serf = "172.20.1.1"
}

server {
  enabled = true
  bootstrap_expect = 3
  retry_join = ["172.20.2.1","172.20.6.1"]
  encrypt = "<key>"
  heartbeat_grace = "60s"
}

client {
  enabled = true
  node_class = "prod"
  meta {
    "virtual.node.name" = "node1"
  }
}

vault {
  enabled = true
  address = "http://127.0.0.1:8200"
  create_from_role = "nomad-cluster"
}

On other nodes, only the ip is changed

What I do for this:
Assume that "task group" is running on node 2 and alloc id is 47140375. On the other node (1, 3) you need to run nomad alloc status 47140375 and get the message Couldn't retrieve stats: Unexpected response code: 404 (no path to node). If you run node alloc status 47140375 on node 2, this message will not be.

nomad_alloc_status

@dadgar
Copy link
Contributor

dadgar commented May 11, 2018

@grin0c Are you using Consul as well? If you are can you share: curl http://127.0.0.1:8500/v1/agent/services

@grin0c
Copy link

grin0c commented May 11, 2018

@dadgar

{
  "_nomad-client-e665t472fei3l2dlymnwjlxm7nlxdmzz": {
    "ID": "_nomad-client-e665t472fei3l2dlymnwjlxm7nlxdmzz",
    "Service": "nomad-client",
    "Tags": [
      "http"
    ],
    "Address": "172.20.1.1",
    "Meta": null,
    "Port": 4646,
    "EnableTagOverride": false,
    "CreateIndex": 0,
    "ModifyIndex": 0
  },
  "_nomad-server-4qobc5b4jjrx4pmrw3jcdgnplk3azji7": {
    "ID": "_nomad-server-4qobc5b4jjrx4pmrw3jcdgnplk3azji7",
    "Service": "nomad",
    "Tags": [
      "serf"
    ],
    "Address": "172.20.1.1",
    "Meta": null,
    "Port": 4648,
    "EnableTagOverride": false,
    "CreateIndex": 0,
    "ModifyIndex": 0
  },
  "_nomad-server-jvrfxal23gfncqjkft5otnmu7v6fsiqn": {
    "ID": "_nomad-server-jvrfxal23gfncqjkft5otnmu7v6fsiqn",
    "Service": "nomad",
    "Tags": [
      "http"
    ],
    "Address": "172.20.1.1",
    "Meta": null,
    "Port": 4646,
    "EnableTagOverride": false,
    "CreateIndex": 0,
    "ModifyIndex": 0
  },
  "_nomad-server-qbhkseswqmujxeeistllxgen5r2wtuxz": {
    "ID": "_nomad-server-qbhkseswqmujxeeistllxgen5r2wtuxz",
    "Service": "nomad",
    "Tags": [
      "rpc"
    ],
    "Address": "172.20.1.1",
    "Meta": null,
    "Port": 4647,
    "EnableTagOverride": false,
    "CreateIndex": 0,
    "ModifyIndex": 0
  },
  "_nomad-task-opz4lzlmabl6qq3dwacovsndgamrlwwm": {
    "ID": "_nomad-task-opz4lzlmabl6qq3dwacovsndgamrlwwm",
    "Service": "node-app1",
    "Tags": [
      "node-app"
    ],
    "Address": "172.20.1.3",
    "Meta": null,
    "Port": 11000,
    "EnableTagOverride": false,
    "CreateIndex": 0,
    "ModifyIndex": 0
  },
  "_nomad-task-qrchy7e4z5k454giivl6x2epzsq5ckwz": {
    "ID": "_nomad-task-qrchy7e4z5k454giivl6x2epzsq5ckwz",
    "Service": "node-app2",
    "Tags": [
      "node-app"
    ],
    "Address": "172.20.1.12",
    "Meta": null,
    "Port": 11000,
    "EnableTagOverride": false,
    "CreateIndex": 0,
    "ModifyIndex": 0
  },
  "_nomad-task-zlqzzqka63feupnvmajldzyjm2iig4tu": {
    "ID": "_nomad-task-zlqzzqka63feupnvmajldzyjm2iig4tu",
    "Service": "node-app3",
    "Tags": [
      "node-app"
    ],
    "Address": "172.20.1.5",
    "Meta": null,
    "Port": 11000,
    "EnableTagOverride": false,
    "CreateIndex": 0,
    "ModifyIndex": 0
  },
  "vault:172.20.1.1:8200": {
    "ID": "vault:172.20.1.1:8200",
    "Service": "vault",
    "Tags": [
      "standby"
    ],
    "Address": "172.20.1.1",
    "Meta": null,
    "Port": 8200,
    "EnableTagOverride": false,
    "CreateIndex": 0,
    "ModifyIndex": 0
  }
}

@dadgar
Copy link
Contributor

dadgar commented May 11, 2018

@grin0c Could you also display nomad agent-info please from the node that you can't connect to

@grin0c
Copy link

grin0c commented May 11, 2018

@dadgar

client
  heartbeat_ttl = 14.858366292s
  known_servers = 172.20.1.1:4647,172.20.2.1:4647,172.20.6.1:4647
  last_heartbeat = 813.115665ms
  node_id = cae5bc37-6646-c7e7-4441-ed1a06066f5f
  num_allocations = 11
nomad
  bootstrap = false
  known_regions = 1
  leader = false
  leader_addr = 172.20.6.1:4647
  server = true
raft
  applied_index = 22056
  commit_index = 22056
  fsm_pending = 0
  last_contact = 36.544257ms
  last_log_index = 22056
  last_log_term = 8
  last_snapshot_index = 16403
  last_snapshot_term = 8
  latest_configuration = [{Suffrage:Voter ID:172.20.2.1:4647 Address:172.20.2.1:4647} {Suffrage:Voter ID:172.20.1.1:4647 Address:172.20.1.1:4647} {Suffrage:Voter ID:172.20.6.1:4647 Address:172.20.6.1:4647}]
  latest_configuration_index = 1
  num_peers = 2
  protocol_version = 2
  protocol_version_max = 3
  protocol_version_min = 0
  snapshot_version_max = 1
  snapshot_version_min = 0
  state = Follower
  term = 8
runtime
  arch = amd64
  cpu_count = 8
  goroutines = 462
  kernel.name = linux
  max_procs = 8
  version = go1.9.2
serf
  coordinate_resets = 0
  encrypted = true
  event_queue = 0
  event_time = 1
  failed = 0
  health_score = 0
  intent_queue = 0
  left = 0
  member_time = 144
  members = 3
  query_queue = 0
  query_time = 1

@dadgar
Copy link
Contributor

dadgar commented May 11, 2018

@yellowmegaman Are you also running servers and clients together?

@yellowmegaman
Copy link

yellowmegaman commented May 11, 2018

@dadgar yes, i can provide addotional info tomorrow.

We're using small clusters, 5-7 nodes, sometimes on baremetal, so all nodes are clients too, since running server-only mode on 3 nodes would be serious waste of resources.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
5 participants