vault: expired tokens count toward batch limit #8553

schmichael · 2020-07-28T22:04:33Z

As of 0.11.3 Vault token revocation and purging was done in batches.
However the batch size was only limited by the number of non-expired
tokens being revoked.

Due to bugs prior to 0.11.3, expired tokens were not properly purged.
Long-lived clusters could have thousands to millions of very old
expired tokens that never got purged from the state store.

Since these expired tokens did not count against the batch limit, very
large batches could be created and overwhelm servers.

This commit ensures expired tokens count toward the batch limit with
this one line change:

- if len(revoking) >= toRevoke {
+ if len(revoking)+len(ttlExpired) >= toRevoke {

However, this code was difficult to test due to being in a periodically
executing loop. Most of the changes are to make this one line change
testable and test it.

notnoop

LGTM . testing nitpicks aside that you can ignore.

notnoop · 2020-07-28T22:12:57Z

nomad/vault_test.go

+	if err != nil {
+		t.Fatalf("failed to build vault client: %v", err)
+	}


We can use require for consistency with other assertions:

Suggested change

if err != nil {

t.Fatalf("failed to build vault client: %v", err)

}

require.NoError(t, err)

notnoop · 2020-07-28T22:15:32Z

nomad/vault_test.go

+	v.Config.Token = defaultTestVaultWhitelistRoleAndToken(v, t, 5)
+
+	// Disable client until we can change settings for testing
+	conf := v.Config.Copy()
+	conf.Enabled = helper.BoolToPtr(false)


I suspect that copying isn't necessary here since we are still initializing the config. If not, unclear why the token can be modified without copying.

I just noticed that vaultClient keeps a reference to the Config, so I copied it out of an abundance of caution.

notnoop · 2020-07-28T22:21:11Z

nomad/vault_test.go

+			resultCh <- nil
+		}
+
+		return nil


This is good assertion as-is, I would like to add a check that we actually received all messages, not that we received 3 batches not bigger than expected size.

Maybe add a counter for number of purge calls, and how many accessors received so far. Also, you can send nil on resultCh (or close it) only when received all accessors over 3 batches and then you can avoid the loop below.

Much better idea, thanks.

notnoop · 2020-07-28T22:24:48Z

nomad/vault.go

@@ -1305,7 +1311,7 @@ func (v *vaultClient) revokeDaemon() {
 					revoking = append(revoking, va)
 				}

-				if len(revoking) >= toRevoke {
+				if len(revoking)+len(ttlExpired) >= toRevoke {


Maybe a comment indicating that maxVaultRevokeBatchSize is meant to constraint the batch size for both submitting requests to Vault as well as restrict the size of Raft messages - hence needing to account for ttlExpired too.

As of 0.11.3 Vault token revocation and purging was done in batches. However the batch size was only limited by the number of *non-expired* tokens being revoked. Due to bugs prior to 0.11.3, *expired* tokens were not properly purged. Long-lived clusters could have thousands to *millions* of very old expired tokens that never got purged from the state store. Since these expired tokens did not count against the batch limit, very large batches could be created and overwhelm servers. This commit ensures expired tokens count toward the batch limit with this one line change: ``` - if len(revoking) >= toRevoke { + if len(revoking)+len(ttlExpired) >= toRevoke { ``` However, this code was difficult to test due to being in a periodically executing loop. Most of the changes are to make this one line change testable and test it.

chuckyz · 2020-07-28T23:01:39Z

Is this PR an appropriate place to request some metrics around this? I think it'd be a useful bit to have # of tokens that are active/to-be-revoked/are-revoking as a way to see things like high job churn, or a rare case I've seen where tokens just seem to be lost (e.g.: i've seen allocations with {{ with secret "pki" "ttl=7d" }} that never refreshes).

schmichael · 2020-07-29T01:16:13Z

@chuckyz That's a great idea. At a glance I couldn't decide exactly what metric to add, so I'm going to go ahead and merge this PR. Please open up an issue if you can think of exact metric to track here, otherwise I'll probably aim for tracking the pending revocations?

*Cherry-pick of #8553 to branch off of v0.11.3 tag.* As of 0.11.3 Vault token revocation and purging was done in batches. However the batch size was only limited by the number of *non-expired* tokens being revoked. Due to bugs prior to 0.11.3, *expired* tokens were not properly purged. Long-lived clusters could have thousands to *millions* of very old expired tokens that never got purged from the state store. Since these expired tokens did not count against the batch limit, very large batches could be created and overwhelm servers. This commit ensures expired tokens count toward the batch limit with this one line change: ``` - if len(revoking) >= toRevoke { + if len(revoking)+len(ttlExpired) >= toRevoke { ``` However, this code was difficult to test due to being in a periodically executing loop. Most of the changes are to make this one line change testable and test it.

Fix some capitalization too.

docs: add #8553 to changelog

This log line should be rare since: 1. Most tokens should be logged synchronously, not via this async batched method. Async revocation only takes place when Vault connectivity is lost and after leader election so no revocations are missed. 2. There should rarely be >1 batch (1,000) tokens to revoke since the above conditions should be brief and infrequent. 3. Interval is 5 minutes, so this log line will be emitted at *most* once every 5 minutes. What makes this log line rare is also what makes it interesting: due to a bug prior to Nomad 0.11.2 some tokens may never get revoked. Therefore Nomad tries to re-revoke them on every leader election. This caused a massive buildup of old tokens that would never be properly revoked and purged. Nomad 0.11.3 mostly fixed this but still had a bug in purging revoked tokens via Raft (fixed in #8553). The nomad.vault.distributed_tokens_revoked metric is only ticked upon successful revocation and purging, making any bugs or slowness in the process difficult to detect. Logging before a potentially slow revoaction+purge operation is performed will give users much better indications of what activity is going on should the process fail to make it to the metric.

This log line should be rare since: 1. Most tokens should be logged synchronously, not via this async batched method. Async revocation only takes place when Vault connectivity is lost and after leader election so no revocations are missed. 2. There should rarely be >1 batch (1,000) tokens to revoke since the above conditions should be brief and infrequent. 3. Interval is 5 minutes, so this log line will be emitted at *most* once every 5 minutes. What makes this log line rare is also what makes it interesting: due to a bug prior to Nomad 0.11.2 some tokens may never get revoked. Therefore Nomad tries to re-revoke them on every leader election. This caused a massive buildup of old tokens that would never be properly revoked and purged. Nomad 0.11.3 mostly fixed this but still had a bug in purging revoked tokens via Raft (fixed in #8553). The nomad.vault.distributed_tokens_revoked metric is only ticked upon successful revocation and purging, making any bugs or slowness in the process difficult to detect. Logging before a potentially slow revocation+purge operation is performed will give users much better indications of what activity is going on should the process fail to make it to the metric.

github-actions · 2022-12-27T02:14:34Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

schmichael requested a review from notnoop July 28, 2020 22:04

schmichael force-pushed the b-vault-revoke-batch branch from 21f8f71 to 49db531 Compare July 28, 2020 22:09

notnoop reviewed Jul 28, 2020

View reviewed changes

notnoop approved these changes Jul 28, 2020

View reviewed changes

notnoop reviewed Jul 28, 2020

View reviewed changes

schmichael force-pushed the b-vault-revoke-batch branch from 49db531 to 7c9bec5 Compare July 28, 2020 22:37

schmichael force-pushed the b-vault-revoke-batch branch from 7c9bec5 to 4a14604 Compare July 28, 2020 22:42

schmichael merged commit 7eee26a into master Jul 29, 2020

schmichael deleted the b-vault-revoke-batch branch July 29, 2020 01:16

schmichael added a commit that referenced this pull request Jul 29, 2020

docs: add #8553 to changelog

408e11d

schmichael added a commit that referenced this pull request Jul 29, 2020

docs: add #8553 to changelog

5708237

Fix some capitalization too.

schmichael added a commit that referenced this pull request Jul 29, 2020

Merge pull request #8554 from hashicorp/docs-vault-revoke-batch

a85ab50

docs: add #8553 to changelog

schmichael mentioned this pull request Aug 5, 2020

vault: log once per interval if batching revocation #8597

Merged

schmichael added a commit that referenced this pull request Aug 6, 2020

docs: add v0.11.4 release and mention #8553

f48880b

github-actions bot locked as resolved and limited conversation to collaborators Dec 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vault: expired tokens count toward batch limit #8553

vault: expired tokens count toward batch limit #8553

schmichael commented Jul 28, 2020

notnoop left a comment

notnoop Jul 28, 2020

notnoop Jul 28, 2020

schmichael Jul 28, 2020

notnoop Jul 28, 2020

schmichael Jul 28, 2020

notnoop Jul 28, 2020

chuckyz commented Jul 28, 2020

schmichael commented Jul 29, 2020

github-actions bot commented Dec 27, 2022

vault: expired tokens count toward batch limit #8553

vault: expired tokens count toward batch limit #8553

Conversation

schmichael commented Jul 28, 2020

notnoop left a comment

Choose a reason for hiding this comment

notnoop Jul 28, 2020

Choose a reason for hiding this comment

notnoop Jul 28, 2020

Choose a reason for hiding this comment

schmichael Jul 28, 2020

Choose a reason for hiding this comment

notnoop Jul 28, 2020

Choose a reason for hiding this comment

schmichael Jul 28, 2020

Choose a reason for hiding this comment

notnoop Jul 28, 2020

Choose a reason for hiding this comment

chuckyz commented Jul 28, 2020

schmichael commented Jul 29, 2020

github-actions bot commented Dec 27, 2022