Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nomad alloc exec fails in TLS enabled clusters. #7233

Closed
henrikjohansen opened this issue Feb 27, 2020 · 14 comments · Fixed by #7274
Closed

nomad alloc exec fails in TLS enabled clusters. #7233

henrikjohansen opened this issue Feb 27, 2020 · 14 comments · Fixed by #7274

Comments

@henrikjohansen
Copy link

henrikjohansen commented Feb 27, 2020

Nomad version

Nomad v0.10.4+ent (284fc3a)

Issue

We are running a TLS enabled cluster - all nomad cli commands work with the exception of nomad alloc exec which fails with a TLS error :

failed to exec into task: x509: certificate is valid for 127.0.0.1, not 1.2.3.4

Yes, you could set NOMAD_SKIP_VERIFY but this is not something I can recommend to our internal users.

Running the example job from nomad init :

$ nomad alloc fs 74a25298

drwxrwxrwx  4.0 KiB  2020-02-27T10:58:08Z  alloc/
drwxrwxrwx  4.0 KiB  2020-02-27T10:58:08Z  redis/ 

$ nomad alloc logs 74a25298

                _._                                                  
           _.-``__ ''-._                                             
      _.-``    `.  `_.  ''-._           Redis 3.2.12 (00000000/0) 64 bit
  .-`` .-```.  ```\/    _.,_ ''-._                                   
 (    '      ,       .-`  | `,    )     Running in standalone mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6379
 |    `-._   `._    /     _.-'    |     PID: 1
  `-._    `-._  `-./  _.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |           http://redis.io        
  `-._    `-._`-.__.-'_.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |                                  
  `-._    `-._`-.__.-'_.-'    _.-'                                   
      `-._    `-.__.-'    _.-'                                       
          `-._        _.-'                                           
              `-.__.-'                                               

....
....
....

$ nomad alloc exec 74a25298 /bin/sh

failed to exec into task: x509: certificate is valid for 127.0.0.1, not 1.2.3.4

Reproduction steps

See above

@notnoop
Copy link
Contributor

notnoop commented Feb 28, 2020

Thanks @henrikjohansen for reporting this. We'll take a closer look here.

I'm a bit puzzled by the error message. It somewhat implies that the cert is only valid for 127.0.0.1 and fails to validate on the ip address the CLI uses to connect to host. If that's true, then I would expect connection to fail. Mind if you provide more info about the certs being used, their SAN value, and if there is a difference between server and client cert setup? I would appreciate any feedback or input you have.

nomad alloc exec uses websocket, unlike other endpoints, so we may have introduced some unintentional validation difference.

@notnoop notnoop added this to Needs Triage in Nomad - Community Issues Triage via automation Feb 28, 2020
@notnoop notnoop moved this from Needs Triage to Triaged in Nomad - Community Issues Triage Feb 28, 2020
@henrikjohansen
Copy link
Author

Hey @notnoop - we followed the instructions from the official docs to the letter.

nomad alloc exec does fail with the error message mentioned above - it seems like it does not respect certificate roles (client.global.nomad) like the other nomad alloc commands do?

@notnoop
Copy link
Contributor

notnoop commented Feb 28, 2020

That would explain it. I will follow up to ensure that handling roles is consistent. Thanks for the pointers and quick follow up.

I'll consult with the team about the docs. I suspect adding host ip/domain would be beneficial for browsers (when using UI) as they don't special case certificate roles either.

@henrikjohansen
Copy link
Author

FYI - I have tried adding the relevant hostnames as subject alternate names to the client cert ... that did not make a difference. Besides , cert roles is what makes Nomad TLS without Vault tolerable to manage :)

@notnoop notnoop moved this from Triaged to In Progress in Nomad - Community Issues Triage Mar 3, 2020
Nomad - Community Issues Triage automation moved this from In Progress to Done Mar 5, 2020
@notnoop
Copy link
Contributor

notnoop commented Mar 5, 2020

@henrikjohansen Thanks again for reporting the issue. I have fixed the bug where nomad alloc exec handling differed from nomad alloc logs. The PR #7274 has some context of what the issue is.

Though, I have confirmed that nomad alloc logs validates the host against the cert CN/SAN values the browser way, and doesn't special client role (as these are specifically for internal RPC communication).

In a test cluster that follows the the docs above, I can verify the failure in the follow; note that nomad alloc logs error message is misleading:

# Normal Server operation
root@baa15f77ae18:/etc/nomad.d/tls# NOMAD_ADDR=https://127.0.0.1:4646/ nomad server members
Name                 Address     Port  Status  Leader  Protocol  Build       Datacenter  Region
00174680e9af.global  172.19.0.5  4648  alive   false   2         0.11.0-dev  dc1         global
72e1cd066e21.global  172.19.0.3  4648  alive   true    2         0.11.0-dev  dc1         global
baa15f77ae18.global  172.19.0.2  4648  alive   false   2         0.11.0-dev  dc1         global
root@baa15f77ae18:/etc/nomad.d/tls# NOMAD_ADDR=https://172.19.0.5:4646/ nomad server members
Error querying servers: Get https://172.19.0.5:4646/v1/agent/members: x509: certificate is valid for 127.0.0.1, not 172.19.0.5

# Nomad allocs, note that the error message here is misleading
root@baa15f77ae18:/etc/nomad.d/tls# NOMAD_ADDR=https://127.0.0.1:4646/ nomad alloc logs --job example
hi
root@baa15f77ae18:/etc/nomad.d/tls# NOMAD_ADDR=https://172.19.0.5:4646/ nomad alloc logs --job example
Error fetching allocations: job "example" doesn't exist or it has no allocations

These correspond to error log messages like:

2020/03/05 19:17:46.771681 http: TLS handshake error from 172.19.0.2:52244: remote error: tls: bad certificate

So next steps for us would be:

  • clarify in the docs that the certs should include ip/domains of hosts
  • correct the error message for nomad alloc logs (and potentially other commands) when the error is a connection error rather than an alloc not found error.

@henrikjohansen
Copy link
Author

henrikjohansen commented Mar 5, 2020

@notnoop Well, the docs specifically state that adding SANs is considered an anti-pattern for most Nomad deployments :

However, hosts (and therefore hostnames and IPs) are often ephemeral in Nomad clusters. Not only would signing a new certificate per Nomad node be difficult, but using a hostname provides no security or functional benefits to Nomad. To fulfill the desired security properties (above) Nomad certificates are signed with their region and role.

All nomad alloc subcommands should respect this.

It might be worth noting that nomad alloc exec is the only command that we have seen these types of TLS errors with. nomad alloc logs and nomad alloc fs have always worked fine.

@notnoop
Copy link
Contributor

notnoop commented Mar 5, 2020

Thanks for pointing this out - I'll discuss it with the team and follow up.

It might be worth noting that nomad alloc exec is the only command that we have seen these types of TLS errors with. nomad alloc logs and nomad alloc fs have always worked fine.

Can you try running the commands I posted above in my sample? Others were failing for me too?

@henrikjohansen
Copy link
Author

henrikjohansen commented Mar 5, 2020

$ nomad alloc logs -job example

1:C 03 Mar 08:41:33.368 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
                _._                                                  
           _.-``__ ''-._                                             
      _.-``    `.  `_.  ''-._           Redis 3.2.12 (00000000/0) 64 bit
  .-`` .-```.  ```\/    _.,_ ''-._                                   
 (    '      ,       .-`  | `,    )     Running in standalone mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6379
 |    `-._   `._    /     _.-'    |     PID: 1
  `-._    `-._  `-./  _.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |           http://redis.io        
  `-._    `-._`-.__.-'_.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |                                  
  `-._    `-._`-.__.-'_.-'    _.-'                                   
      `-._    `-.__.-'    _.-'                                       
          `-._        _.-'                                           
              `-.__.-'                                               

1:M 03 Mar 08:41:33.369 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:M 03 Mar 08:41:33.369 # Server started, Redis version 3.2.12
1:M 03 Mar 08:41:33.370 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
1:M 03 Mar 08:41:33.370 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
1:M 03 Mar 08:41:33.370 * The server is now ready to accept connections on port 6379

$ nomad alloc fs -job example

drwxrwxrwx  4.0 KiB  2020-03-03T08:41:32Z  alloc/
drwxrwxrwx  4.0 KiB  2020-03-03T08:41:32Z  redis/

$ nomad alloc exec e50a9d58 /bin/sh

failed to exec into task: x509: certificate is valid for 127.0.0.1, not 1.2.3.4

All our clusters have the same behavior ... only nomad alloc exec fails with an invalid certificate ... and all 3 clusters were build using these docs.

Litterally very single nomad subcommand except nomad alloc exec works correctly.

@notnoop
Copy link
Contributor

notnoop commented Mar 5, 2020

Sorry, I meant the steps in #7233 (comment) where you switch NOMAD_ADDR to query other servers rather than localhost.

@henrikjohansen
Copy link
Author

henrikjohansen commented Mar 5, 2020

We never query localhost in production.

NOMAD_ADDR is always set to https://FQDN:4646

@notnoop
Copy link
Contributor

notnoop commented Mar 5, 2020

I see - I'm afraid I'm still seeing other commands doing hostname validation. Thank you for your patience, as you walk me through this.

Just to confirm, in your setup, the nomad server (or load balancer if any) servicing the FQDN is configured with a cert that doesn't have the FQDN in its SAN values? Can you run openssl s_client -connect <<FQDN>>:4646 2>/dev/null | openssl x509 -noout -text |grep -i -e DNS -e IP -e CN and let me know the output (redacted).

Here are my steps to check hostname validation when I use a custom FQDN and guide above with nomad 0.10.4 binary. In all cases, the cli command fails when I use an ip/host that doesn't match the cert SAN/CN values:

$ NOMAD_ADDR=https://127.0.0.1:4646/ nomad server members
Name                 Address     Port  Status  Leader  Protocol  Build   Datacenter  Region
3b055a5535f0.global  172.18.0.5  4648  alive   true    2         0.10.4  dc1         global
8011493b785c.global  172.18.0.3  4648  alive   false   2         0.10.4  dc1         global
dfcb76ec3d8b.global  172.18.0.2  4648  alive   false   2         0.10.4  dc1         global
$ NOMAD_ADDR=https://172.18.0.3:4646/ nomad server members
Error querying servers: Get https://172.18.0.3:4646/v1/agent/members: x509: certificate is valid for 127.0.0.1, not 172.18.0.3
$ echo '172.18.0.3    nomad.corp.example.com' | sudo tee -a /etc/hosts
172.18.0.3    nomad.corp.example.com
$ NOMAD_ADDR=https://nomad.corp.example.com:4646/ nomad server members
Error querying servers: Get https://nomad.corp.example.com:4646/v1/agent/members: x509: certificate is valid for server.global.nomad, localhost, not nomad.corp.example.com

In my test cluster here, my certification information in command above is:

$ openssl s_client -connect 172.18.0.3:4646 2>/dev/null | openssl x509 -noout -text  |grep -i -e DNS -e IP -e CN
        Issuer: C=US, ST=CA, L=San Francisco, CN=example.net
                Digital Signature, Key Encipherment
                DNS:server.global.nomad, DNS:localhost, IP Address:127.0.0.1

@henrikjohansen
Copy link
Author

Ah, I see the confusion. The server certificates have their respective hostnames added as SANs since they are static. The client certificates however have not as they change rather frequently.

@notnoop
Copy link
Contributor

notnoop commented Mar 5, 2020

Perfect - that clarifies everything. The PR #7274 fixes your case then so alloc exec handles the case where client cert isn't configured with the ip address, so it's inline with other alloc commands.

Indeed, the documentation you linked to is ambiguous now. It implies you don't need SAN/IP for any node, not even the servers, for purposes of nomad, and it only mentions SAN/IP for purposes of integration with other tools, e.g. curl. This is not correct. We should update it so that it calls out the nomad server benefit of having SAN values, and that the CLI acts just like other tools and doesn't special case cert roles.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants