Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vault API Client Tweaks #4817

Merged
merged 14 commits into from
Nov 20, 2018
Merged

Vault API Client Tweaks #4817

merged 14 commits into from
Nov 20, 2018

Conversation

notnoop
Copy link
Contributor

@notnoop notnoop commented Oct 30, 2018

Some tweaks for Vault Client Token renewal:

  • Passing a UserAgent header when calling Vault
  • Increment counter for vault renewal errors
  • As token vault expiry approaches, set 1s floor for delay

nomad/vault.go Outdated Show resolved Hide resolved
nomad/vault.go Outdated Show resolved Hide resolved
nomad/vault.go Outdated
@@ -1228,8 +1239,18 @@ func (v *vaultClient) EmitStats(period time.Duration, stopCh chan struct{}) {
case <-time.After(period):
stats := v.Stats()
metrics.SetGauge([]string{"nomad", "vault", "distributed_tokens_revoking"}, float32(stats.TrackedForRevoke))
metrics.SetGauge([]string{"nomad", "vault", "token_ttl"}, float32(tokenTTL(v.tokenData)/time.Millisecond))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric here better has a server label to identify problematic nomad server - but looks like none of the other metrics have server id. I went with consistency in this case, so we can address all metrics together later. I suspect it'd be better to add a global label at the metrics agent level than pollute the metric key.

Users currently can set additional tags in the metrics agent level (e.g. datadog agent), or by setting server specific keys in the Nomad config (e.g. statsd_tags).

nomad/vault.go Outdated Show resolved Hide resolved
nomad/vault.go Show resolved Hide resolved
nomad/vault.go Outdated

// Copy all the stats
stats.TrackedForRevoke = v.stats.TrackedForRevoke
if v.tokenData != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only set on establish connection, so after we pass the original expire time, the TTL we display will be negative. We need to update the tokenData's expire time on each renew. That is what lastRenewed was for: https://github.com/hashicorp/nomad/pull/4817/files#diff-87a4c76eca3de54716c55ba6b8c06475L558

nomad/vault.go Outdated Show resolved Hide resolved
nomad/vault.go Outdated
@@ -474,13 +477,10 @@ func (v *vaultClient) renewalLoop() {
case <-authRenewTimer.C:
// Renew the token and determine the new expiration
err := v.renew()
currentExpiration := v.lastRenewed.Add(time.Duration(v.tokenData.CreationTTL) * time.Second)
currentExpiration := v.tokenData.ExpireTime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is never getting updated; renewal will break

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - thanks. I think we'll need to update the field to protect against leases TTLs being different from creation ttl and/or when we get closer to the token ttl. I'll rework this.

Copy link
Contributor Author

@notnoop notnoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PR to avoid relying on tokenData to be latest, but to simply track expiration time as a field. This is because the renewSelf does not return the full token data and does not contain an expire_time field.

@@ -528,6 +530,49 @@ func TestVaultClient_RenewalLoop(t *testing.T) {
if ttl == 0 {
t.Fatalf("token renewal failed; ttl %v", ttl)
}

if client.currentExpiration.Before(time.Now()) {
t.Fatalf("found current expiration to be in past %s", time.Until(client.currentExpiration))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This added check to ensure that after the loop, we track the right expiration. In my earlier attempt, currentExpiration was an past value that caused renewal loop to renew at every loop without any pausing.

nomad/vault.go Outdated
@@ -203,16 +210,12 @@ type vaultClient struct {
// childTTL is the TTL for child tokens.
childTTL string

// lastRenewed is the time the token was last renewed
lastRenewed time.Time
// currentExpiration is the time the current tokean lease expires
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// currentExpiration is the time the current tokean lease expires
// currentExpiration is the time the current token lease expires

func (v *vaultClient) renew() error {
// returned. The boolean indicates whether it's safe to attempt to renew again.
// This method updates the currentExpiration time
func (v *vaultClient) renew() (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI we have a RecoverableError type we sometimes use in circumstances like this. Since the recoverable aspect is handled within this package and doesn't cross public API boundaries, I don't think there's a compelling reason to switch to it.


_, err = client.renew()
require.NoError(t, err)
exp1 := client.currentExpiration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does reading this variable without locks trigger the race detector?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at race job in [1], I don't see reports of race issues - but I suspect technically there should be. However, given that in this test, we renew on the same goroutine we check the value, it's not an issue; even if there is a another goroutine that could update it later on.

[1] https://travis-ci.org/hashicorp/nomad/jobs/457677344

Mahmood Ali added 14 commits November 20, 2018 17:10
Add a gauge to track remaining time-to-live, duration of renewal request API call.
Seems like the stats field is a micro-optimization that doesn't justify
the complexity it introduces.  Removing it and computing the stats from
revoking field directly.
Return Vault TTL info to /agent/self API and `nomad agent-info` command.
Keep attempting to renew Vault token past locally recorded expiry, just
in case the token was renewed out of band, e.g. on another Nomad server,
until Vault returns an unrecoverable error.
@notnoop notnoop merged commit d7f02ed into master Nov 20, 2018
@notnoop notnoop deleted the b-vault-client-tweaks branch November 21, 2018 02:50
// TokenTTL is the time-to-live duration for the current token
TokenTTL time.Duration

// TokenExpiry Time is the recoreded expiry time of the current token
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// TokenExpiry Time is the recoreded expiry time of the current token
// TokenExpiry Time is the recorded expiry time of the current token

data.Root = root
v.tokenData = &data
v.currentExpiration = time.Now().Add(time.Duration(data.TTL) * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a read/write race between renew and stats(). Use a lock

@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants