Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad 0.7.x gives error parsing some ports #4008

Closed
rberlind opened this issue Mar 20, 2018 · 9 comments
Closed

Nomad 0.7.x gives error parsing some ports #4008

rberlind opened this issue Mar 20, 2018 · 9 comments

Comments

@rberlind
Copy link
Contributor

rberlind commented Mar 20, 2018

Nomad version

Output from nomad version
Nomad v0.7.1 (0b295d3)

Operating system and Environment details

Ubuntu 16.04.2 LTS (Xenial Xerus)

Issue

Nomad configuration that works on Nomad 0.6.3 does not work on 0.7.0 or 0.7.1
After running nomad run sockshop.nomad, two of the database services fail to start, giving error:
"unable to get address for service "catalogue-db": invalid port "http": strconv.Atoi: parsing "http": invalid syntax"

Note that other services in the Nomad configuration also have http ports which are parsed fine.

Reproduction steps

See README.md for https://github.com/rberlind/nomad-microservices-demo-with-weave/tree/nomad-0.7.1

Nomad Server logs (if appropriate)

nomad status sockshop
ID            = sockshop
Name          = sockshop
Submit Date   = 03/20/18 04:42:56 UTC
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group    Queued  Starting  Running  Failed  Complete  Lost
carts         0       0         1        0       0         0
catalogue     0       0         0        1       0         0
frontend      0       0         1        0       0         0
orders        0       0         1        0       0         0
payment       0       0         1        0       0         0
queue-master  0       0         1        0       0         0
rabbitmq      0       0         1        0       0         0
shipping      0       0         1        0       0         0
user          0       0         0        1       0         0

Latest Deployment
ID          = aeaeadef
Status      = failed
Description = Failed due to unhealthy allocations

Deployed
Task Group    Desired  Placed  Healthy  Unhealthy
carts         1        1       1        0
catalogue     1        1       0        1
frontend      1        1       1        0
orders        1        1       1        0
payment       1        1       1        0
queue-master  1        1       1        0
rabbitmq      1        1       1        0
shipping      1        1       1        0
user          1        1       0        1

Allocations
ID        Node ID   Task Group    Version  Desired  Status   Created   Modified
16c9dac7  aa449611  carts         0        run      running  1m4s ago  29s ago
20dd1ab9  5d632dd3  rabbitmq      0        run      running  1m4s ago  33s ago
7456f312  aa449611  catalogue     0        run      failed   1m4s ago  23s ago
79a99b94  5d632dd3  orders        0        run      running  1m4s ago  29s ago
b96d516d  aa449611  user          0        run      failed   1m4s ago  23s ago
cfd9f35b  aa449611  shipping      0        run      running  1m4s ago  36s ago
d0fb4130  5d632dd3  payment       0        run      running  1m4s ago  39s ago
db44b894  aa449611  queue-master  0        run      running  1m4s ago  36s ago
ec96d641  5d632dd3  frontend      0        run      running  1m4s ago  32s ago

ubuntu@ip-172-23-3-48:~$ nomad alloc-status 7456f312
ID                  = 7456f312
Eval ID             = 3247aa2b
Name                = sockshop.catalogue[0]
Node ID             = aa449611
Job ID              = sockshop
Job Version         = 0
Client Status       = failed
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created             = 1m25s ago
Modified            = 44s ago
Deployment ID       = aeaeadef
Deployment Health   = unhealthy

Task "catalogue" is "dead"
Task Resources
CPU      Memory   Disk     IOPS  Addresses
100 MHz  128 MiB  300 MiB  0     http: 172.23.0.61:22227

Task Events:
Started At     = 03/20/18 04:43:12 UTC
Finished At    = 03/20/18 04:43:37 UTC
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                   Type                 Description
03/20/18 04:43:37 UTC  Killed               Task successfully killed
03/20/18 04:43:36 UTC  Killing              Sent interrupt. Waiting 5s before force killing
03/20/18 04:43:36 UTC  Sibling Task Failed  Task's sibling "cataloguedb" failed
03/20/18 04:43:12 UTC  Started              Task started by client
03/20/18 04:42:56 UTC  Driver               Downloading image weaveworksdemos/catalogue:0.3.5
03/20/18 04:42:56 UTC  Task Setup           Building Task Directory
03/20/18 04:42:56 UTC  Received             Task received by client

Task "cataloguedb" is "dead"
Task Resources
CPU      Memory   Disk     IOPS  Addresses
100 MHz  256 MiB  300 MiB  0     http: 172.23.0.61:23682

Task Events:
Started At     = N/A
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                   Type             Description
03/20/18 04:43:36 UTC  Alloc Unhealthy  Unhealthy because of failed task
03/20/18 04:43:36 UTC  Not Restarting   Error was unrecoverable
03/20/18 04:43:36 UTC  Driver Failure   unable to get address for service "catalogue-db": invalid port "http": strconv.Atoi: parsing "http": invalid syntax
03/20/18 04:42:57 UTC  Driver           Downloading image weaveworksdemos/catalogue-db:0.3.5
03/20/18 04:42:56 UTC  Task Setup       Building Task Directory
03/20/18 04:42:56 UTC  Received         Task received by client

ubuntu@ip-172-23-3-48:~$ nomad alloc-status b96d516d
ID                  = b96d516d
Eval ID             = 3247aa2b
Name                = sockshop.user[0]
Node ID             = aa449611
Job ID              = sockshop
Job Version         = 0
Client Status       = failed
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created             = 5m27s ago
Modified            = 4m46s ago
Deployment ID       = aeaeadef
Deployment Health   = unhealthy

Task "user" is "dead"
Task Resources
CPU      Memory   Disk     IOPS  Addresses
100 MHz  256 MiB  300 MiB  0     http: 172.23.0.61:27997

Task Events:
Started At     = 03/20/18 04:43:12 UTC
Finished At    = 03/20/18 04:43:36 UTC
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                   Type                 Description
03/20/18 04:43:36 UTC  Killed               Task successfully killed
03/20/18 04:43:36 UTC  Killing              Sent interrupt. Waiting 5s before force killing
03/20/18 04:43:36 UTC  Sibling Task Failed  Task's sibling "user-db" failed
03/20/18 04:43:12 UTC  Started              Task started by client
03/20/18 04:42:56 UTC  Driver               Downloading image weaveworksdemos/user:master-5e88df65
03/20/18 04:42:56 UTC  Task Setup           Building Task Directory
03/20/18 04:42:56 UTC  Received             Task received by client

Task "user-db" is "dead"
Task Resources
CPU      Memory  Disk     IOPS  Addresses
100 MHz  96 MiB  300 MiB  0     http: 172.23.0.61:25649

Task Events:
Started At     = N/A
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                   Type             Description
03/20/18 04:43:36 UTC  Alloc Unhealthy  Unhealthy because of failed task
03/20/18 04:43:36 UTC  Not Restarting   Error was unrecoverable
03/20/18 04:43:36 UTC  Driver Failure   unable to get address for service "user-db": invalid port "http": strconv.Atoi: parsing "http": invalid syntax
03/20/18 04:42:56 UTC  Driver           Downloading image weaveworksdemos/user-db:master-5e88df65
03/20/18 04:42:56 UTC  Task Setup       Building Task Directory
03/20/18 04:42:56 UTC  Received         Task received by client

Nomad Client logs (if appropriate)

Job file (if appropriate)

See https://github.com/rberlind/nomad-microservices-demo-with-weave/blob/nomad-0.7.1/shared/jobs/sockshop.nomad

@schmichael
Copy link
Member

Summary

This is a duplicate of #3681, but that's an extremely long issue so let me summarize here!

The service mentioned in the error is trying to register with an invalid port: http is specified, but since a custom network is used the service will try to use the driver's port map which isn't specified.

Details

Nomad 0.6 allowed empty or invalid service.port values and would default them to 0.

Nomad 0.7.1 began requiring valid service.port values. For services using the address_mode=driver this means the non-numeric service.port values will be looked up in the driver's port_map.

The error logging has been improved on master for the upcoming 0.8 release:

[ERR] client: failed to register services and checks for task "cataloguedb" alloc "ab70...":
unable to get address for service "catalogue-db": invalid port label "http": port labels in
driver address_mode must be numeric or in the driver's port map

Since catalogue-db is defined as:

service {
  name = "catalogue-db"
  tags = ["db", "catalogue", "catalogue-db"]
  port = "http"
}

The port "http" is looked up in the driver's port_map. Since cataloguedb does not specify a port_map, you get this error.

Resolution

There are a few fixes:

  • Remove the port="http" parameter as it has never had an effect, even in 0.6. Nomad 0.7.1 and 0.8 will register with port=0 just like 0.6.
  • Add a port_map entry for http
  • Add a numeric port value in the service stanza like: port = "80"

@schmichael
Copy link
Member

For reference I used this simplified version of the job file to ease reproduction:

job "sockshop" {
  datacenters = ["dc1"]

  update {
    stagger = "10s"
    max_parallel = 1
  }

  # - catalogue - #
  group "catalogue" {
    count = 1

    restart {
      attempts = 10
      interval = "5m"
      delay = "25s"
      mode = "delay"
    }

    # - db - #
    task "cataloguedb" {
      driver = "docker"

      config {
        image = "weaveworksdemos/catalogue-db:0.3.5"
        hostname = "catalogue-db.service.consul"
        network_mode = "weave"
        dns_servers = ["172.17.0.1"]
        dns_search_domains = ["service.consul"]
      }

      template {
        data = <<EOH
        MYSQL_ROOT_PASSWORD="foo"
        EOH
        destination = "secrets/mysql_root_pwd.env"
        env = true
      }

      env {
        MYSQL_DATABASE = "socksdb"
        MYSQL_ALLOW_EMPTY_PASSWORD = "false"
      }

      service {
        name = "catalogue-db"
        tags = ["db", "catalogue", "catalogue-db"]
      }

      resources {
        cpu = 100 # 100 Mhz
        memory = 256 # 256MB
        network {
          mbits = 10
                port "http" {}
        }
      }

    } # - end db - #
  } # - end catalogue - #
}

@rberlind
Copy link
Contributor Author

Thanks @schmichael. Just saw these updates. Thanks for looking at this and suggesting a workaround. I will test tomorrow.

Can you explain why I got errors for catalogue-db and user-db, but not for other services that did the same thing with regard to ports?

Can you also clarify what the doc at https://www.nomadproject.io/docs/job-specification/service.html#port is trying to say about mapping a port in the service to a port in the network stanza?

"port: Specifies the label of the port on which this service is running. Note this is the label of the port and not the port number unless address_mode = driver. The port label must match one defined in the network stanza unless you're also using address_mode="driver". Numeric ports may be used when in driver addressing mode."

It seemed to me that by specifying the service port as "http" and matching the same under network, I was complying with Nomad job specification rules.

Thanks.

@rberlind
Copy link
Contributor Author

Getting rid of the port definition in each service did fix the problem for me. I can now run the demo on Nomad 0.7.1 and see the Nomad UI.

Thanks.

@schmichael
Copy link
Member

It seemed to me that by specifying the service port as "http" and matching the same under network, I was complying with Nomad job specification rules.

That would have been true if you had used host or bridge networking. Since you're using a custom network and driver (sockshop+weave), Nomad assumes you want to advertise the driver's port. This autodetection is unfortunately subtle and creates a confusing number of configuration options I'm afraid.

The easiest way to think about it is:

  • service.address_mode = "host" uses the host IP/port (so the port from the resources block)
  • service.address_mode = "driver" uses the IP/port specified by Docker (or whatever network_mode you're using in Docker).
  • service.address_mode = "auto" the default and what you're implicitly using: uses host for host/bridge networking modes and driver for any unrecognized/custom networks.

Hope that helps and please let me know if there's still confusion or if you have any documentation suggestions! This is a classic example of defaults hopefully making things Just Work for most people but failing in very confusing ways for others.

@rberlind
Copy link
Contributor Author

That does help quite a bit, @schmichael.
Do you think the doc I had mentioned should be updated to explain things in terms of the docker networks being used for the task rather than the setting of "address_mode". Looking back at that paragraph, I had not associated the reference to the address_mode setting to my use of a custom Docker network since I had not explicitly set address_mode.

Also, any ideas on why only 2 of my 13 services were affected by the problem?

schmichael added a commit that referenced this issue Mar 27, 2018
Hopefully helps prevent more issues like #3681 and #4008. The
port/address_mode logic is really subtle, and it took me a long time to
diagnose #4008 despite being the one to have addressed the duplicate
issue before! Not to mention I wrote the code! Definitely need to do
something to make it more understandable...
@schmichael
Copy link
Member

Do you think the doc I had mentioned should be updated ...

Absolutely. I made an attempt in #4055, but I would love advice. service.port is totally dependent upon the address_mode which itself is quite subtle thanks to the default auto behavior...

Also, any ideas on why only 2 of my 13 services were affected by the problem?

All of your other services had a port_map stanza mapping ports in Docker, so when Nomad autodetected address_mode=driver it advertised that port from the port_map in Consul.

@rberlind
Copy link
Contributor Author

Thanks so much for clarifying. I do see the port map settings on the other 11. I have no idea why they were not on the other 2.

Will review the other question tomorrow.

@github-actions
Copy link

github-actions bot commented Dec 1, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 1, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants