Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vault: expired tokens count toward batch limit #8553

Merged
merged 1 commit into from
Jul 29, 2020

Conversation

schmichael
Copy link
Member

As of 0.11.3 Vault token revocation and purging was done in batches.
However the batch size was only limited by the number of non-expired
tokens being revoked.

Due to bugs prior to 0.11.3, expired tokens were not properly purged.
Long-lived clusters could have thousands to millions of very old
expired tokens that never got purged from the state store.

Since these expired tokens did not count against the batch limit, very
large batches could be created and overwhelm servers.

This commit ensures expired tokens count toward the batch limit with
this one line change:

- if len(revoking) >= toRevoke {
+ if len(revoking)+len(ttlExpired) >= toRevoke {

However, this code was difficult to test due to being in a periodically
executing loop. Most of the changes are to make this one line change
testable and test it.

Copy link
Contributor

@notnoop notnoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM . testing nitpicks aside that you can ignore.

Comment on lines 1730 to 1732
if err != nil {
t.Fatalf("failed to build vault client: %v", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use require for consistency with other assertions:

Suggested change
if err != nil {
t.Fatalf("failed to build vault client: %v", err)
}
require.NoError(t, err)

Comment on lines +1705 to +1710
v.Config.Token = defaultTestVaultWhitelistRoleAndToken(v, t, 5)

// Disable client until we can change settings for testing
conf := v.Config.Copy()
conf.Enabled = helper.BoolToPtr(false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that copying isn't necessary here since we are still initializing the config. If not, unclear why the token can be modified without copying.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that vaultClient keeps a reference to the Config, so I copied it out of an abundance of caution.

resultCh <- nil
}

return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good assertion as-is, I would like to add a check that we actually received all messages, not that we received 3 batches not bigger than expected size.

Maybe add a counter for number of purge calls, and how many accessors received so far. Also, you can send nil on resultCh (or close it) only when received all accessors over 3 batches and then you can avoid the loop below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better idea, thanks.

@@ -1305,7 +1311,7 @@ func (v *vaultClient) revokeDaemon() {
revoking = append(revoking, va)
}

if len(revoking) >= toRevoke {
if len(revoking)+len(ttlExpired) >= toRevoke {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a comment indicating that maxVaultRevokeBatchSize is meant to constraint the batch size for both submitting requests to Vault as well as restrict the size of Raft messages - hence needing to account for ttlExpired too.

As of 0.11.3 Vault token revocation and purging was done in batches.
However the batch size was only limited by the number of *non-expired*
tokens being revoked.

Due to bugs prior to 0.11.3, *expired* tokens were not properly purged.
Long-lived clusters could have thousands to *millions* of very old
expired tokens that never got purged from the state store.

Since these expired tokens did not count against the batch limit, very
large batches could be created and overwhelm servers.

This commit ensures expired tokens count toward the batch limit with
this one line change:

```
- if len(revoking) >= toRevoke {
+ if len(revoking)+len(ttlExpired) >= toRevoke {
```

However, this code was difficult to test due to being in a periodically
executing loop. Most of the changes are to make this one line change
testable and test it.
@chuckyz
Copy link
Contributor

chuckyz commented Jul 28, 2020

Is this PR an appropriate place to request some metrics around this? I think it'd be a useful bit to have # of tokens that are active/to-be-revoked/are-revoking as a way to see things like high job churn, or a rare case I've seen where tokens just seem to be lost (e.g.: i've seen allocations with {{ with secret "pki" "ttl=7d" }} that never refreshes).

@schmichael
Copy link
Member Author

@chuckyz That's a great idea. At a glance I couldn't decide exactly what metric to add, so I'm going to go ahead and merge this PR. Please open up an issue if you can think of exact metric to track here, otherwise I'll probably aim for tracking the pending revocations?

@schmichael schmichael merged commit 7eee26a into master Jul 29, 2020
@schmichael schmichael deleted the b-vault-revoke-batch branch July 29, 2020 01:16
schmichael added a commit that referenced this pull request Jul 29, 2020
schmichael added a commit that referenced this pull request Jul 29, 2020
*Cherry-pick of #8553 to branch off of v0.11.3 tag.*

As of 0.11.3 Vault token revocation and purging was done in batches.
However the batch size was only limited by the number of *non-expired*
tokens being revoked.

Due to bugs prior to 0.11.3, *expired* tokens were not properly purged.
Long-lived clusters could have thousands to *millions* of very old
expired tokens that never got purged from the state store.

Since these expired tokens did not count against the batch limit, very
large batches could be created and overwhelm servers.

This commit ensures expired tokens count toward the batch limit with
this one line change:

```
- if len(revoking) >= toRevoke {
+ if len(revoking)+len(ttlExpired) >= toRevoke {
```

However, this code was difficult to test due to being in a periodically
executing loop. Most of the changes are to make this one line change
testable and test it.
schmichael added a commit that referenced this pull request Jul 29, 2020
Fix some capitalization too.
schmichael added a commit that referenced this pull request Jul 29, 2020
schmichael added a commit that referenced this pull request Aug 5, 2020
This log line should be rare since:

1. Most tokens should be logged synchronously, not via this async
   batched method. Async revocation only takes place when Vault
   connectivity is lost and after leader election so no revocations are
   missed.
2. There should rarely be >1 batch (1,000) tokens to revoke since the
   above conditions should be brief and infrequent.
3. Interval is 5 minutes, so this log line will be emitted at *most*
   once every 5 minutes.

What makes this log line rare is also what makes it interesting: due to
a bug prior to Nomad 0.11.2 some tokens may never get revoked. Therefore
Nomad tries to re-revoke them on every leader election. This caused a
massive buildup of old tokens that would never be properly revoked and
purged. Nomad 0.11.3 mostly fixed this but still had a bug in purging
revoked tokens via Raft (fixed in #8553).

The nomad.vault.distributed_tokens_revoked metric is only ticked upon
successful revocation and purging, making any bugs or slowness in the
process difficult to detect.

Logging before a potentially slow revoaction+purge operation is
performed will give users much better indications of what activity is
going on should the process fail to make it to the metric.
schmichael added a commit that referenced this pull request Aug 5, 2020
This log line should be rare since:

1. Most tokens should be logged synchronously, not via this async
   batched method. Async revocation only takes place when Vault
   connectivity is lost and after leader election so no revocations are
   missed.
2. There should rarely be >1 batch (1,000) tokens to revoke since the
   above conditions should be brief and infrequent.
3. Interval is 5 minutes, so this log line will be emitted at *most*
   once every 5 minutes.

What makes this log line rare is also what makes it interesting: due to
a bug prior to Nomad 0.11.2 some tokens may never get revoked. Therefore
Nomad tries to re-revoke them on every leader election. This caused a
massive buildup of old tokens that would never be properly revoked and
purged. Nomad 0.11.3 mostly fixed this but still had a bug in purging
revoked tokens via Raft (fixed in #8553).

The nomad.vault.distributed_tokens_revoked metric is only ticked upon
successful revocation and purging, making any bugs or slowness in the
process difficult to detect.

Logging before a potentially slow revocation+purge operation is
performed will give users much better indications of what activity is
going on should the process fail to make it to the metric.
schmichael added a commit that referenced this pull request Aug 6, 2020
This log line should be rare since:

1. Most tokens should be logged synchronously, not via this async
   batched method. Async revocation only takes place when Vault
   connectivity is lost and after leader election so no revocations are
   missed.
2. There should rarely be >1 batch (1,000) tokens to revoke since the
   above conditions should be brief and infrequent.
3. Interval is 5 minutes, so this log line will be emitted at *most*
   once every 5 minutes.

What makes this log line rare is also what makes it interesting: due to
a bug prior to Nomad 0.11.2 some tokens may never get revoked. Therefore
Nomad tries to re-revoke them on every leader election. This caused a
massive buildup of old tokens that would never be properly revoked and
purged. Nomad 0.11.3 mostly fixed this but still had a bug in purging
revoked tokens via Raft (fixed in #8553).

The nomad.vault.distributed_tokens_revoked metric is only ticked upon
successful revocation and purging, making any bugs or slowness in the
process difficult to detect.

Logging before a potentially slow revocation+purge operation is
performed will give users much better indications of what activity is
going on should the process fail to make it to the metric.
schmichael added a commit that referenced this pull request Aug 6, 2020
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants