Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad client tries to connect to serf IP address, not to RPC #16211

Closed
Kamilcuk opened this issue Feb 17, 2023 · 6 comments · Fixed by #16217
Closed

Nomad client tries to connect to serf IP address, not to RPC #16211

Kamilcuk opened this issue Feb 17, 2023 · 6 comments · Fixed by #16217

Comments

@Kamilcuk
Copy link
Contributor

Kamilcuk commented Feb 17, 2023

Nomad version

1.4.3

Operating system and Environment details

Fedora 29

Issue

Nomad client does not use RPC IP to connect to Nomad servers. He uses serf IP. He should be using RPC IP.

Reproduction steps

Configured the server with the following advertise block:

advertise {
  http = "172.29.248.59:4646"
  rpc  = "172.29.248.59:4647"
  serf = "10.120.18.153:4648"
}

Nomad servers have registered services in consul:

image

$ curl 'http://localhost:8500/v1/catalog/service/nomad' | ..... | jq -c '.[]'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  5180    0  5180    0     0  1686k      0 --:--:-- --:--:-- --:--:-- 1686k
{"ID":"6172a751-c027-acc1-8c32-d1bb749760f9","Node":"--","Address":"172.29.248.59","Datacenter":"dev","TaggedAddresses":{"lan":"172.29.248.59","lan_ipv4":"172.29.248.59","wan":"172.29.248.59","wan_ipv4":"172.29.248.59"},"NodeMeta":{"consul-network-segment":""},"ServiceKind":"","ServiceID":"_nomad-server-3vwcahrnlg4szdzuqmlmufy5gqq6xvov","ServiceName":"nomad","ServiceTags":["rpc"],"ServiceAddress":"172.29.248.59","ServiceTaggedAddresses":{"lan_ipv4":{"Address":"172.29.248.59","Port":4647},"wan_ipv4":{"Address":"172.29.248.59","Port":4647}},"ServiceWeights":{"Passing":1,"Warning":1},"ServiceMeta":{"external-source":"nomad"},"ServicePort":4647,"ServiceSocketPath":"","ServiceEnableTagOverride":false,"ServiceProxy":{"Mode":"","MeshGateway":{},"Expose":{}},"ServiceConnect":{},"CreateIndex":5707990,"ModifyIndex":5707990}
{"ID":"6172a751-c027-acc1-8c32-d1bb749760f9","Node":"--","Address":"172.29.248.59","Datacenter":"dev","TaggedAddresses":{"lan":"172.29.248.59","lan_ipv4":"172.29.248.59","wan":"172.29.248.59","wan_ipv4":"172.29.248.59"},"NodeMeta":{"consul-network-segment":""},"ServiceKind":"","ServiceID":"_nomad-server-g4w26cd2apukkybdzp3ihd4ipmwh34mb","ServiceName":"nomad","ServiceTags":["http"],"ServiceAddress":"172.29.248.59","ServiceTaggedAddresses":{"lan_ipv4":{"Address":"172.29.248.59","Port":4646},"wan_ipv4":{"Address":"172.29.248.59","Port":4646}},"ServiceWeights":{"Passing":1,"Warning":1},"ServiceMeta":{"external-source":"nomad"},"ServicePort":4646,"ServiceSocketPath":"","ServiceEnableTagOverride":false,"ServiceProxy":{"Mode":"","MeshGateway":{},"Expose":{}},"ServiceConnect":{},"CreateIndex":5707992,"ModifyIndex":5707992}
{"ID":"6172a751-c027-acc1-8c32-d1bb749760f9","Node":"--","Address":"172.29.248.59","Datacenter":"dev","TaggedAddresses":{"lan":"172.29.248.59","lan_ipv4":"172.29.248.59","wan":"172.29.248.59","wan_ipv4":"172.29.248.59"},"NodeMeta":{"consul-network-segment":""},"ServiceKind":"","ServiceID":"_nomad-server-gsrn26tollr6ayiyuq3yjtylhnzz7d3t","ServiceName":"nomad","ServiceTags":["serf"],"ServiceAddress":"10.120.18.153","ServiceTaggedAddresses":{"lan_ipv4":{"Address":"10.120.18.153","Port":4648},"wan_ipv4":{"Address":"10.120.18.153","Port":4648}},"ServiceWeights":{"Passing":1,"Warning":1},"ServiceMeta":{"external-source":"nomad"},"ServicePort":4648,"ServiceSocketPath":"","ServiceEnableTagOverride":false,"ServiceProxy":{"Mode":"","MeshGateway":{},"Expose":{}},"ServiceConnect":{},"CreateIndex":5707991,"ModifyIndex":5707991}
{"ID":"eeea99dd-8e71-5dc6-f0df-c56e9a25bf49","Node":"--","Address":"172.29.192.150","Datacenter":"dev","TaggedAddresses":{"lan":"172.29.192.150","lan_ipv4":"172.29.192.150","wan":"172.29.192.150","wan_ipv4":"172.29.192.150"},"NodeMeta":{"consul-network-segment":""},"ServiceKind":"","ServiceID":"_nomad-server-4omkog7wzh355vphr6ivre26f2l2fqtd","ServiceName":"nomad","ServiceTags":["serf"],"ServiceAddress":"10.120.18.150","ServiceTaggedAddresses":{"lan_ipv4":{"Address":"10.120.18.150","Port":4648},"wan_ipv4":{"Address":"10.120.18.150","Port":4648}},"ServiceWeights":{"Passing":1,"Warning":1},"ServiceMeta":{"external-source":"nomad"},"ServicePort":4648,"ServiceSocketPath":"","ServiceEnableTagOverride":false,"ServiceProxy":{"Mode":"","MeshGateway":{},"Expose":{}},"ServiceConnect":{},"CreateIndex":5552076,"ModifyIndex":5552076}
{"ID":"eeea99dd-8e71-5dc6-f0df-c56e9a25bf49","Node":"--","Address":"172.29.192.150","Datacenter":"dev","TaggedAddresses":{"lan":"172.29.192.150","lan_ipv4":"172.29.192.150","wan":"172.29.192.150","wan_ipv4":"172.29.192.150"},"NodeMeta":{"consul-network-segment":""},"ServiceKind":"","ServiceID":"_nomad-server-amo36cavowl2cavng4ehhu3ymlneoygo","ServiceName":"nomad","ServiceTags":["rpc"],"ServiceAddress":"172.29.192.150","ServiceTaggedAddresses":{"lan_ipv4":{"Address":"172.29.192.150","Port":4647},"wan_ipv4":{"Address":"172.29.192.150","Port":4647}},"ServiceWeights":{"Passing":1,"Warning":1},"ServiceMeta":{"external-source":"nomad"},"ServicePort":4647,"ServiceSocketPath":"","ServiceEnableTagOverride":false,"ServiceProxy":{"Mode":"","MeshGateway":{},"Expose":{}},"ServiceConnect":{},"CreateIndex":5552075,"ModifyIndex":5552075}
{"ID":"eeea99dd-8e71-5dc6-f0df-c56e9a25bf49","Node":"--","Address":"172.29.192.150","Datacenter":"dev","TaggedAddresses":{"lan":"172.29.192.150","lan_ipv4":"172.29.192.150","wan":"172.29.192.150","wan_ipv4":"172.29.192.150"},"NodeMeta":{"consul-network-segment":""},"ServiceKind":"","ServiceID":"_nomad-server-lpfilfijfob3k75p7qahkwbvn6lzcd2u","ServiceName":"nomad","ServiceTags":["http"],"ServiceAddress":"172.29.192.150","ServiceTaggedAddresses":{"lan_ipv4":{"Address":"172.29.192.150","Port":4646},"wan_ipv4":{"Address":"172.29.192.150","Port":4646}},"ServiceWeights":{"Passing":1,"Warning":1},"ServiceMeta":{"external-source":"nomad"},"ServicePort":4646,"ServiceSocketPath":"","ServiceEnableTagOverride":false,"ServiceProxy":{"Mode":"","MeshGateway":{},"Expose":{}},"ServiceConnect":{},"CreateIndex":5552077,"ModifyIndex":5552077}

However, nomad agents connect with 10.120 ip address to port 4647??

Feb 17 06:31:28 taskset[2798]:     2023-02-17T06:31:28.513-0500 [ERROR] client.rpc: error performing RPC to server: error="rpc error: failed to get conn: rpc error: lead thread didn't get connection" rpc=Node.UpdateAlloc server=10.120.18.153:4647
Feb 17 06:31:28 taskset[2798]:     2023-02-17T06:31:28.513-0500 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: failed to get conn: rpc error: lead thread didn't get connection" rpc=Node.UpdateAlloc server=10.120.18.153:4647
Feb 17 06:31:28 taskset[2798]:     2023-02-17T06:31:28.513-0500 [ERROR] client: error updating allocations: error="rpc error: failed to get conn: rpc error: lead thread didn't get connection"
Feb 17 06:31:41 taskset[2798]:     2023-02-17T06:31:41.744-0500 [ERROR] client.rpc: error performing RPC to server: error="rpc error: failed to get conn: dial tcp 10.120.18.150:4647: i/o timeout" rpc=Node.GetClientAllocs server=10.120.18.150:4647
Feb 17 06:31:41 taskset[2798]:     2023-02-17T06:31:41.744-0500 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: failed to get conn: dial tcp 10.120.18.150:4647: i/o timeout" rpc=Node.GetClientAllocs server=10.120.18.150:4647
Feb 17 06:31:41 taskset[2798]:     2023-02-17T06:31:41.744-0500 [ERROR] client: error querying node allocations: error="rpc error: failed to get conn: dial tcp 10.120.18.150:4647: i/o timeout"

Expected Result

Nomad agents should connect to RPC IP port.

Actual Result

Nomad agents try to connect to serf IP and fail.

Client config

client config
log_level = "INFO"
region = "us"
datacenter = "..."
data_dir = "..."
disable_update_check = true

bind_addr = "0.0.0.0"
advertise {
  http = "172.29.192.52:4646"
  rpc  = "172.29.192.52:4647"
  serf = "172.29.192.52:4648"
}

ui {
  enabled = false
}


plugin "raw_exec" {
  config {
    enabled = true
  }
}

plugin "docker" {
  config {
    extra_labels = ["job_name", "job_id", "task_group_name", "task_name", "namespace", "node_name", "node_id"]
    allow_privileged = true
    allow_caps = ["audit_write", "chown", "dac_override", "fowner", "fsetid", "kill", "mknod", "net_bind_service", "setfcap", "setgid", "setpcap", "setuid", "sys_chroot", "sys_ptrace", "net_admin", "ipc_lock", "sys_nice"]
    volumes {
      enabled = true
    }
  }
}

client {
  enabled = true
  node_class = "..."
  network_interface = "..."
  max_kill_timeout = "5m"
}


consul {
  address = "127.0.0.1:8500"
}

acl {
  enabled = true
}

telemetry {
  collection_interval = "10s"
  disable_hostname = true
  prometheus_metrics = true
  publish_allocation_metrics = true
  publish_node_metrics = true
}
@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Feb 17, 2023
@tgross
Copy link
Member

tgross commented Feb 17, 2023

@Kamilcuk can you share your server_join block from the client configuration?

@tgross tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Feb 17, 2023
@tgross tgross self-assigned this Feb 17, 2023
@tgross tgross changed the title Nomad agent tries to connect to SERF IP address, not to RPC Nomad client tries to connect to serf IP address, not to RPC Feb 17, 2023
@Kamilcuk
Copy link
Contributor Author

Hi! It's missing. Default is [] empty array.

@tgross
Copy link
Member

tgross commented Feb 17, 2023

Hi @Kamilcuk! This sounds an awful lot like the situation that was described in #11895 but that I was never able to replicate.

The function on the client that handles this is consulDiscoveryImpl. The steps it takes are:

  • Use the Consul service API to get the service with the RPC tag (ref client.go#L2863).
  • Query each of those addresses that it finds and sends a Nomad RPC request to the Status.Peers endpoint to get the list of Nomad servers client.go#L2881).
  • If there are servers in the response, use that list as the list of Nomad servers to connect to. We can tell that your clients got a list of servers here because otherwise the logs wouldn't show the Node.UpdateAlloc RPCs at all.

So Consul is used to initially discover an advertised list of servers, but then the client uses that list to get the peers (which I think is intended to exclude the non-voting servers).

There are two serf tags of interest to us here. One is the port which says it's used for server-to-server communication and the other is rpc_addr which is supposed to be the address advertised to clients (ref server.go#L1528-L1529). If we go to the source for those configuration values in agent.go#L313-L333 we can see the following:

  • The client advertise address is the RPC advertise address + the RPC port.
  • The server advertise address is the serf advertise address + the RPC port!

It looks like that behavior was introduced in 5976511 in v0.8.0

So the last thing to do is figure out where Status.Peers is getting the address from. The Status.Peers RPC is querying the list of servers from the raft configuration. This should be the RPC address, because that's the address that the Raft protocol itself is using for communication, but I suspect what's happening here is that it's actually the "server advertise address" which as we saw for some reason is the serf advertise address.

There's two diagnostics you can do to help us debug further:

  • The Status.Peers RPC used by the client is the same RPC that serves the List Peers API, so if you do nomad operator api /v1/status/peers you should see the RPC addresses and not the serf addresses.
  • The List Members API should have all the serf tags described above. So if you do nomad operator api /v1/agent/members to one of the servers, that should let us verify which tags are getting set with which values.

@Kamilcuk
Copy link
Contributor Author

Kamilcuk commented Feb 17, 2023

Thank you. Your posts as always way too comprehensive for me. I fixed the issue by using: server_join: [ 'rpc.nomad.service.consul' ], as RPC service will have only RPC IP. On Monday, I can put the node config back and test if needed. I have only one region "us".

I see status/peers listens serf addresses, not RPC. As I understand, this is unexpected?

outputs
$ nomad operator api /v1/status/peers
["10.120.18.150:4647","10.120.18.153:4647","10.120.18.52:4647"]
$ nomad operator api /v1/agent/members | jq '.'
{
  "ServerName": "...",
  "ServerRegion": "us",
  "ServerDC": "server",
  "Members": [
    {
      "Name": "...",
      "Addr": "10.120.18.150",
      "Port": 4648,
      "Tags": {
        "dc": "server",
        "port": "4647",
        "revision": "f464aca721d222ae9c1f3df643b3c3aaa20e2da7",
        "raft_vsn": "3",
        "role": "nomad",
        "vsn": "1",
        "id": "e0b5de2c-8946-3b12-f0be-63bf6667ea19",
        "build": "1.4.3",
        "rpc_addr": "172.29.192.150",
        "region": "us"
      },
      "Status": "alive",
      "ProtocolMin": 1,
      "ProtocolMax": 5,
      "ProtocolCur": 2,
      "DelegateMin": 2,
      "DelegateMax": 5,
      "DelegateCur": 4
    },
    {
      "Name": "...",
      "Addr": "10.120.18.153",
      "Port": 4648,
      "Tags": {
        "raft_vsn": "3",
        "id": "73234d8d-4371-7745-d2bd-9d74480e5229",
        "port": "4647",
        "dc": "server",
        "role": "nomad",
        "revision": "f464aca721d222ae9c1f3df643b3c3aaa20e2da7",
        "region": "us",
        "vsn": "1",
        "rpc_addr": "172.29.248.59",
        "build": "1.4.3"
      },
      "Status": "alive",
      "ProtocolMin": 1,
      "ProtocolMax": 5,
      "ProtocolCur": 2,
      "DelegateMin": 2,
      "DelegateMax": 5,
      "DelegateCur": 4
    },
    {
      "Name": "...",
      "Addr": "10.120.18.52",
      "Port": 4648,
      "Tags": {
        "role": "nomad",
        "dc": "server",
        "raft_vsn": "3",
        "region": "us",
        "port": "4647",
        "revision": "f464aca721d222ae9c1f3df643b3c3aaa20e2da7",
        "vsn": "1",
        "build": "1.4.3",
        "id": "a6f08bcd-839a-efd7-d227-b1dc82d92878",
        "rpc_addr": "172.29.192.51"
      },
      "Status": "alive",
      "ProtocolMin": 1,
      "ProtocolMax": 5,
      "ProtocolCur": 2,
      "DelegateMin": 2,
      "DelegateMax": 5,
      "DelegateCur": 4
    }
  ]
}

(Relevant) server config:

log_level = "INFO"
region = "us"
datacenter = "server"
data_dir = "..."
disable_update_check = true

bind_addr = "0.0.0.0"
advertise {
  http = "172.29.192.51:4646"
  rpc  = "172.29.192.51:4647"
  serf = "10.120.18.52:4648"
}

client {
  enabled = true
  node_class = "prod"
  network_interface = "<interface with 172.29. ip address>"
  servers = [
    # added this today
    "http.nomad.service.consul....",
  ]
  max_kill_timeout = "5m"
}

leave_on_interrupt = true
leave_on_terminate = true
server {
  enabled = true
  server_join {
    retry_join = [
      # Also added this today. I think I should add serf. prefix here.
      "nomad.service.consul...",
    ]
  }
  rejoin_after_leave = true
  node_gc_threshold = "24h"
  job_gc_threshold = "48h"
  eval_gc_threshold = "1h"
  deployment_gc_threshold = "1h"
}

consul {
  address = "127.0.0.1:8500"
}

acl {
  enabled = true
}

telemetry {
  collection_interval = "10s"
  disable_hostname = true
  prometheus_metrics = true
  publish_allocation_metrics = true
  publish_node_metrics = true
}

@tgross
Copy link
Member

tgross commented Feb 17, 2023

I see status/peers listens serf addresses, not RPC. As I understand, this is unexpected?

Yeah, /v1/status/peers is returning the serf address. But /v1/agent/members is showing the Members.Addr to be the serf address (as apparently expected as of 0.8) but the rpc_addr is showing the expected RPC address. So either we need a new RPC for the client addresses (which is what was proposed in #11895), or we need to grab the members endpoint instead and extract that rpc_addr tag.

There's going to be some backwards compatibility concerns here that we'll need to discuss. I'm going to try to reproduce on my own in the meantime.

@tgross
Copy link
Member

tgross commented Feb 17, 2023

Hi @Kamilcuk! I've just opened #16217 with what looks like a working fix. This isn't going to make it for Nomad 1.5.0 GA because we need to do some testing around the other auto-discovery mechanisms. But if you want to check out that patch and make dev you can give this a test in your own environment as well. In the meantime, the workaround you discovered around the names should do the job until this goes out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

2 participants