Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keyring: update handle to state inside replication loop #15227

Merged
merged 3 commits into from
Nov 17, 2022

Conversation

tgross
Copy link
Member

@tgross tgross commented Nov 11, 2022

Fixes #14981

When keyring replication starts, we take a handle to the state store. But whenever a snapshot is restored, this handle is invalidated and no longer points to a state store that is receiving new keys. This leaks a bunch of memory too!

In addition to operator-initiated restores, when fresh servers are added to existing clusters with large-enough state, the keyring replication can get started quickly enough that it's running before the snapshot from the existing clusters have been restored.

Fix this by updating the handle to the state store when we query.

@tgross tgross added theme/keyring type/bug backport/1.4.x backport to 1.4.x release line labels Nov 11, 2022
@tgross tgross added this to the 1.4.3 milestone Nov 11, 2022
@@ -457,6 +456,7 @@ START:
goto ERR_WAIT // rate limit exceeded
}

store := krr.srv.fsm.State()
ws := store.NewWatchSet()
Copy link
Member Author

@tgross tgross Nov 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to admit I feel like I'm supposed to be able to have this store.NewWatchSet at the top of the loop and then I can select on it with the rest of the context to replace the state handle... but I can't figure out which is the right thing to be polling on there and this diff fixes the test case I've got.

(And we should probably not create a watchset here at all if we're not using it to poll for changes?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think this can restructured a bit, but I could be wrong as I'm new to this area of code:

(*state.StateStore).AbandonCh() will get closed when a restore finishes, so I think just toss it in the select above and then you only need to do store = krr.srv.fsm.State() in that case block.

The WatchSet can be dropped and nil apssed tWatchSets are internal to iradix indexes while Abandoning is a StateStore concept. I think this is an accurate mental model:

Raft
 |
 v
FSM   <readers>
 |      ^
 v      |
StateStore <- Abandon
  |
  v
MemDB
  |
  v
iradix indexes <- Watches

If replicators were waiting for their local root key meta to be updated via the FSM then I think the watchset could be added to the select {} to be notified when it was updated. Since the root key meta is replicated by RPC outside of the FSM, I don't think a watchset is useful here.

But like I said: new code to me, so let's maybe zoom if this all sounds way off.

@@ -457,6 +456,7 @@ START:
goto ERR_WAIT // rate limit exceeded
}

store := krr.srv.fsm.State()
ws := store.NewWatchSet()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think this can restructured a bit, but I could be wrong as I'm new to this area of code:

(*state.StateStore).AbandonCh() will get closed when a restore finishes, so I think just toss it in the select above and then you only need to do store = krr.srv.fsm.State() in that case block.

The WatchSet can be dropped and nil apssed tWatchSets are internal to iradix indexes while Abandoning is a StateStore concept. I think this is an accurate mental model:

Raft
 |
 v
FSM   <readers>
 |      ^
 v      |
StateStore <- Abandon
  |
  v
MemDB
  |
  v
iradix indexes <- Watches

If replicators were waiting for their local root key meta to be updated via the FSM then I think the watchset could be added to the select {} to be notified when it was updated. Since the root key meta is replicated by RPC outside of the FSM, I don't think a watchset is useful here.

But like I said: new code to me, so let's maybe zoom if this all sounds way off.

@tgross
Copy link
Member Author

tgross commented Nov 15, 2022

(*state.StateStore).AbandonCh() will get closed when a restore finishes, so I think just toss it in the select above and then you only need to do store = krr.srv.fsm.State() in that case block.

That's what I was looking for! Thank you!

If replicators were waiting for their local root key meta to be updated via the FSM then I think the watchset could be added to the select {} to be notified when it was updated. Since the root key meta is replicated by RPC outside of the FSM, I don't think a watchset is useful here.

The RootKeyMeta is replicated in the FSM; it's the root key material that's not (the random bytes that make up the key). So a watchset would normally be what we want here. But I couldn't figure out how to make this work when bringing up a fresh node. While the snapshot restores, any watchsets that fire will be on the new iradix tree which we're not yet watching on, so we'd never detect an update.

But I think we could do this similar to how state.BlockingQuery works:

  • select on the state store abandon channel. when that channel is closed:
    • replace the store variable
    • create a new watchset that includes the abandon channel, and pass it to a new query for all the keys
  • select on the watchset. when that fires:
    • create a new watchset that includes the abandon channel, and pass it to a new query for all the keys
  • select on a timer every few seconds. That allows for periodic attempts to repair if the on-disk keystore needs to be repaired by the cluster administrator after a botched recovery from backup.

That would let us lean on the efficiency of the watchset in the common case, while still handling new nodes and cases where the cluster needs to self-repair.

@tgross tgross changed the title keyring: take handle to state inside replication loop keyring: update handle to state inside replication loop Nov 15, 2022
When keyring replication starts, we take a handle to the state store. But
whenever a snapshot is restored, this handle is invalidated and no longer points
to a state store that is receiving new keys. This leaks a bunch of memory too!

In addition to operator-initiated restores, when fresh servers are added to
existing clusters with large-enough state, the keyring replication can get
started quickly enough that it's running before the snapshot from the existing
clusters have been restored.

Fix this by updating the handle to the state store whenever the store's abandon
channel is closed. Refactor the query for key metadata to use blocking queries
for efficiency.
@tgross
Copy link
Member Author

tgross commented Nov 15, 2022

@schmichael I've updated this to select over the abandon channel and to use a blocking query (to take advantage of the wait channel). This ends up being a little more verbose but more idiomatic with the rest of the code base.

@tgross
Copy link
Member Author

tgross commented Nov 16, 2022

Bah, I need to rework this, as it caused a panic during shutdown somewhere in https://github.com/hashicorp/nomad/actions/runs/3474236511/jobs/5807491485 Edit: that panic was because of #15267 but it seems that I'm still getting test failures because leadership is failing now? It's stable on main so it must be a problem in this branch but heck if I know why yet.

nomad/encrypter.go Outdated Show resolved Hide resolved
@tgross
Copy link
Member Author

tgross commented Nov 16, 2022

@schmichael I've backed out the blocking query change and just left refreshing the State() call in the loop (and clean up some of the rest of the code in the process). There's something I'm missing with blocking queries where everything would be working as I'd expect -- the blocking query would return on either an update or the context timeout. But then it'd randomly get into a state where it'd enter a tight spin loop and ignore the context timeout entirely. This didn't seem to be correlated with the snapshot restore either.

This tight loop was destabilizing the whole test suite, as it was causing enough load that leadership was failing. In the interest of shipping a critical fix I'm using the rate-limited query and I'll see about revisiting blocking queries in later work -- I have a sneaking suspicion there's may be a bug in our blocking query implementation that's hard to hit in RPCs. You've got a request-changes on this so I'll need your 👍 here to ship.

Copy link
Member

@shoenig shoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! goto -= 3 🎉

@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 18, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backport/1.4.x backport to 1.4.x release line theme/keyring type/bug
Projects
None yet
4 participants