-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
node pool: node pool upsert on multiregion node register #17503
Conversation
When registering a node with a new node pool in a non-authoritative region we can't create the node pool because this new pool will not be replicated to other regions. This commit modifies the node registration logic to only allow automatic node pool creation in the authoritative region. In non-authoritative regions, the client is registered, but the node pool is not created. The client is kept in the `initialing` status until its node pool is created in the authoritative region and replicated to the client's region.
@@ -244,143 +244,58 @@ func TestFSM_UpsertNode_Canonicalize_Ineligible(t *testing.T) { | |||
func TestFSM_UpsertNode_NodePool(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized that I basically had the same test cases as in TestStateStore_UpsertNode_NodePool
here, and that seem quite excessive and unnecessary, so I updated this test to only exercise the logic specific to the FSM (node canonicalization and setting the upsert options).
nomad/node_endpoint.go
Outdated
// ┌────────────────────────────────────── No ───┐ | ||
// │ │ | ||
// ┌──▼───┐ ┌─────────────┐ ┌────────┴────────┐ | ||
// ── Register ─► init ├─ ready ──► Has allocs? ├─ Yes ─► Allocs updated? │ | ||
// └──▲───┘ └─────┬───────┘ └────────┬────────┘ | ||
// │ │ │ | ||
// ready └─ No ─┐ ┌─────── Yes ──┘ | ||
// │ └─ No ─┐ ┌─────── Yes ──┘ | ||
// │ │ │ | ||
// ┌──────┴───────┐ ┌──▼──▼─┐ ┌──────┐ | ||
// │ ┌────────▼──▼───────┐ | ||
// │ │ Node pool exists? │ | ||
// │ └─────────┬─────────┘ | ||
// │ │ | ||
// ready Yes | ||
// │ │ | ||
// ┌──────┴───────┐ ┌───▼───┐ ┌──────┐ | ||
// │ disconnected ◄─ disconnected ─┤ ready ├─ down ──► down │ | ||
// └──────────────┘ └───▲───┘ └──┬───┘ | ||
// │ │ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how it's supposed to look like 😅
┌────────────────────────────────────── No ───┐
│ │
┌──▼───┐ ┌─────────────┐ ┌────────┴────────┐
── Register ─► init ├─ ready ──► Has allocs? ├─ Yes ─► Allocs updated? │
└──▲───┘ └─────┬───────┘ └────────┬────────┘
│ │ │
│ └─ No ─┐ ┌─────── Yes ──┘
│ │ │
│ ┌────────▼──▼───────┐
│ │ Node pool exists? │
│ └─────────┬─────────┘
│ │
ready Yes
│ │
┌──────┴───────┐ ┌───▼───┐ ┌──────┐
│ disconnected ◄─ disconnected ─┤ ready ├─ down ──► down │
└──────────────┘ └───▲───┘ └──┬───┘
│ │
└──── ready ─────┘
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops...just realized that I forgot an arrow 😬
┌────────────────────────────────────── No ───┐
│ │
┌──▼───┐ ┌─────────────┐ ┌────────┴────────┐
── Register ─► init ├─ ready ──► Has allocs? ├─ Yes ─► Allocs updated? │
└──▲──▲┘ └─────┬───────┘ └────────┬────────┘
│ │ │ │
│ │ └─ No ─┐ ┌─────── Yes ──┘
│ │ │ │
│ │ ┌────────▼──▼───────┐
│ └──────────No───┤ Node pool exists? │
│ └─────────┬─────────┘
│ │
ready Yes
│ │
┌──────┴───────┐ ┌───▼───┐ ┌──────┐
│ disconnected ◄─ disconnected ─┤ ready ├─ down ──► down │
└──────────────┘ └───▲───┘ └──┬───┘
│ │
└──── ready ─────┘
@@ -896,15 +905,17 @@ func (s *StateStore) ScalingEventsByJob(ws memdb.WatchSet, namespace, jobID stri | |||
// UpsertNode is used to register a node or update a node definition | |||
// This is assumed to be triggered by the client, so we retain the value | |||
// of drain/eligibility which is set by the scheduler. | |||
func (s *StateStore) UpsertNode(msgType structs.MessageType, index uint64, node *structs.Node) error { | |||
func (s *StateStore) UpsertNode(msgType structs.MessageType, index uint64, node *structs.Node, opts ...NodeUpsertOption) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if a variadic here is the best option. I picked this for a few reasons:
StateStore.UpsertNode()
is called in several places (mostly tests, but still), so we will need to update it everywhere (kind of a lazy argument 😅)- Prevent random boolean arguments that may be hard to know what they represent from callers side.
Reasons I don't like it:
- It's kind of weird since no other function here uses this pattern.
- It only represents a boolean, so it may be too restrictive to expand this to other use cases.
One alternative implementation I had at some point was to have two methods: UpserNode()
, which doesn't create a node pool, and UpsertNodeWithNodePool()
which always creates a node pool if necessary. Then it's up to the caller (the FSM in this case) to call the appropriate function for their use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- It's kind of weird since no other function here uses this pattern.
True, but we should do this kind of thing more often. So I'm 👍 on that.
- It only represents a boolean, so it may be too restrictive to expand this to other use cases.
True, but the type isn't serialized anywhere, so it's easy enough to change later. I think this will work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points, thanks!
// CreateNodePool is used to indicate that the node's node pool should be | ||
// create along with the node registration if it doesn't exist. | ||
CreateNodePool bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think defaulting to false
is the safe option here because it will keep the node in the initializing
status with the node event as a hint to users.
If this were default to true
(like, SkipNodePoolCreation
) we would risk placing a node in a node pool that is not replicated to other regions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
// Keep node in the initialing status if it's in a node pool that | ||
// doesn't exist. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how this design keeps the initializing state transition on the Node.UpdateStatus
, because we know that's going to regularly heartbeat so it'll switch to ready fairly quickly once the new pool has been created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it makes things more clear to allow the existing heartbeat mechanism to recheck the node pool status, since the node pool creation will also trigger a node status transition.
@@ -896,15 +905,17 @@ func (s *StateStore) ScalingEventsByJob(ws memdb.WatchSet, namespace, jobID stri | |||
// UpsertNode is used to register a node or update a node definition | |||
// This is assumed to be triggered by the client, so we retain the value | |||
// of drain/eligibility which is set by the scheduler. | |||
func (s *StateStore) UpsertNode(msgType structs.MessageType, index uint64, node *structs.Node) error { | |||
func (s *StateStore) UpsertNode(msgType structs.MessageType, index uint64, node *structs.Node, opts ...NodeUpsertOption) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- It's kind of weird since no other function here uses this pattern.
True, but we should do this kind of thing more often. So I'm 👍 on that.
- It only represents a boolean, so it may be too restrictive to expand this to other use cases.
True, but the type isn't serialized anywhere, so it's easy enough to change later. I think this will work.
When registering a node with a new node pool in a non-authoritative region we can't create the node pool because this new pool will not be replicated to other regions.
This commit modifies the node registration logic to only allow automatic node pool creation in the authoritative region.
In non-authoritative regions, the client is registered, but the node pool is not created. The client is kept in the
initialing
status until its node pool is created in the authoritative region and replicated to the client's region.