Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to connect Vault API in 0.9.2 log showing wrong address #5816

Closed
Dirrk opened this issue Jun 11, 2019 · 5 comments
Closed

Failed to connect Vault API in 0.9.2 log showing wrong address #5816

Dirrk opened this issue Jun 11, 2019 · 5 comments

Comments

@Dirrk
Copy link

Dirrk commented Jun 11, 2019

If you have a question, prepend your issue with [question] or preferably use the nomad mailing list.

If filing a bug please include the following:

Nomad version

Output from nomad version
Nomad v0.9.2 (0283266)

Operating system and Environment details

Linux ip-10-REMOVED 4.4.0-1069-aws #79-Ubuntu SMP Mon Sep 24 15:01:41 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Issue

It is logging that it cannot connect to vault.service.consul, but I do not have it configured to use that address at all.

Reproduction steps

Install nomad 0.9.2 following the upgrade guide. I upgrade each instance by installing 0.9.2 then restarting them one at a time.

its run with:
/bin/nomad agent -config /etc/nomad/config.d

There are 3 config files but only 1 with vault stanza in it (first in order)

vault {
  enabled = true
  address = "https://vault.someaddress.loadbalancer.com"
  token = "mytoken"
}

edit: The vault stanza can be in any file except for the last file that loads in order. If it is in the last file it will work.

I have tried changing it to "https://vault.someaddress.loadbalancer.com:443" as I thought maybe that could be the problem.

Job file (if appropriate)

Nomad Client logs (if appropriate)

I have not updated the clients yet as I was afraid this might cause an issue.

If possible please post relevant logs in the issue.

Logs and other artifacts may also be sent to: nomad-oss-debug@hashicorp.com

Please link to your Github issue in the email and reference it in the subject
line:

To: nomad-oss-debug@hashicorp.com

Subject: GH-1234: Errors garbage collecting allocs

Emails sent to that address are readable by all HashiCorp employees but are not publicly visible.

Nomad Server logs (if appropriate)

Jun 11 18:10:34 ip-10-REMOVED nomad[22466]: 2019-06-11T18:10:34.881Z [WARN ] nomad.vault: failed to contact Vault API: retry=30s error="Get https://vault.service.consul:8200/v1/sys/init: dial tcp: lookup vault.service.consul on 127.0.0.1:53: no such host"

@Dirrk
Copy link
Author

Dirrk commented Jun 11, 2019

This error is blocking new jobs from starting: vault: server error deriving vault token: Connection to Vault has not been established

@Dirrk
Copy link
Author

Dirrk commented Jun 11, 2019

I rolled back 2 other servers and let them lead. I was not seeing this failure when I tested it on a single server/client instance with a single .hcl config. So I started playing with the 0.9.2 version, first I attempted to just reload the config but I was still seeing the issue in the logs. So then I added the vault stanza to the last file in the config directory exactly how it is in the first file and reloaded it. I checked the logs and it is working

Jun 11 19:11:19 ip-10-REMOVED nomad[26559]: ==> Caught signal: hangup
Jun 11 19:11:19 ip-10-REMOVED nomad[26559]: ==> Reloading configuration...
Jun 11 19:11:19 ip-10-REMOVED nomad[26559]:     2019-06-11T19:11:19.953Z [DEBUG] agent: starting reload of server config
Jun 11 19:11:20 ip-10-REMOVED nomad[26559]:     2019-06-11T19:11:20.059Z [DEBUG] nomad.vault: not renewing token as it is root

checking nomad agent-info now shows the vault stanza with a token

  token_expire_time = 2019-06-11T19:11:20Z
  token_ttl = -5m39s
  tracked_for_revoked = 0

So I went back to my test environment which is a single nomad instance running in client/server mode. I created a new config2.hcl in the folder with just meta data and reloaded nomad. Immediately I get the error I was expecting:

Jun 11 15:18:50 webtest nomad[20436]: ==> Caught signal: hangup
Jun 11 15:18:50 webtest nomad[20436]: ==> Reloading configuration...
Jun 11 15:18:50 webtest nomad[20436]: ==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
Jun 11 15:18:50 webtest nomad[20436]:     2019-06-11T15:18:50.567-0400 [DEBUG] agent: starting reload of server config
Jun 11 15:18:50 webtest nomad[20436]:     2019-06-11T15:18:50.568-0400 [DEBUG] agent: starting reload of client config
Jun 11 15:18:53 webtest nomad[20436]:     2019-06-11T15:18:53.398-0400 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=229.494µs
Jun 11 15:18:54 webtest nomad[20436]:     2019-06-11T15:18:54.469-0400 [WARN ] nomad.vault: failed to contact Vault API: retry=30s error="Get https://vault.service.consul:8200/v1/sys/init: dial tcp: lookup vault.service.consul on 127.0.0.1:53: no such host"
-rw-r--r--  1 root  root   25 Jun 11 15:18 config2.hcl
-rwxr-xr-x  1 nomad root  690 Jun 11 15:18 config.hcl

Just to verify my results I deleted config2.hcl and kill -SIGHUP 20436

Jun 11 15:22:28 webtest nomad[20436]: ==> Caught signal: hangup
Jun 11 15:22:28 webtest nomad[20436]: ==> Reloading configuration...
Jun 11 15:22:28 webtest nomad[20436]:     2019-06-11T15:22:28.312-0400 [DEBUG] nomad: memberlist: Stream connection from=10.0.100.130:46240
Jun 11 15:22:28 webtest nomad[20436]: ==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
Jun 11 15:22:28 webtest nomad[20436]:     2019-06-11T15:22:28.338-0400 [DEBUG] agent: starting reload of server config
Jun 11 15:22:28 webtest nomad[20436]:     2019-06-11T15:22:28.338-0400 [DEBUG] agent: starting reload of client config
Jun 11 15:22:28 webtest nomad[20436]:     2019-06-11T15:22:28.468-0400 [DEBUG] nomad.vault: not renewing token as it is root```

@Dirrk
Copy link
Author

Dirrk commented Jun 11, 2019

So looking through the code when we merge we are looking to see if the address is an empty string as seen here. But when we initialize the config we always initialize it as DefaultVaultConfig() which has a value of "https://vault.service.consul:8200". This most likely would apply to ConnectionRetryIntv as well. The config_parse code seen here would call merge in the vault stanza in a way that always makes the default overwrite the user defined value unless the user defined file was last.

ie:

file | file vault stanza | config vault value after parsing
config_1.hcl | vault { addr = "https://userdefined.host.com" } | {addr:"https://userdefined.host.com"}
config_2.hcl | nil | {addr:"https://vault.service.consul:8200"}

config_1.vault.merge(config_2.vault) => {addr: "https://vault.service.consul:8200" }

I think this would also affect any values that are default loaded across multiple files. And I half verified this via my config.

In the primary config I have:

consul {
  checks_use_advertise = true
}

to check:

curl -s localhost:4646/v1/agent/self | jq '.config.Consul.ChecksUseAdvertise'
false

@notnoop
Copy link
Contributor

notnoop commented Jun 11, 2019

Thank you so much for reporting this issue as well as the detailed investigation and reporting . This was fixed by #5817.

@notnoop notnoop closed this as completed Jun 11, 2019
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants