Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node pool: node pool upsert on multiregion node register #17503

Merged
merged 2 commits into from
Jun 13, 2023

Conversation

lgfa29
Copy link
Contributor

@lgfa29 lgfa29 commented Jun 12, 2023

When registering a node with a new node pool in a non-authoritative region we can't create the node pool because this new pool will not be replicated to other regions.

This commit modifies the node registration logic to only allow automatic node pool creation in the authoritative region.

In non-authoritative regions, the client is registered, but the node pool is not created. The client is kept in the initialing status until its node pool is created in the authoritative region and replicated to the client's region.

When registering a node with a new node pool in a non-authoritative
region we can't create the node pool because this new pool will not be
replicated to other regions.

This commit modifies the node registration logic to only allow automatic
node pool creation in the authoritative region.

In non-authoritative regions, the client is registered, but the node
pool is not created. The client is kept in the `initialing` status until
its node pool is created in the authoritative region and replicated to
the client's region.
@@ -244,143 +244,58 @@ func TestFSM_UpsertNode_Canonicalize_Ineligible(t *testing.T) {
func TestFSM_UpsertNode_NodePool(t *testing.T) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that I basically had the same test cases as in TestStateStore_UpsertNode_NodePool here, and that seem quite excessive and unnecessary, so I updated this test to only exercise the logic specific to the FSM (node canonicalization and setting the upsert options).

Comment on lines 515 to 532
// ┌────────────────────────────────────── No ───┐
// │ │
// ┌──▼───┐ ┌─────────────┐ ┌────────┴────────┐
// ── Register ─► init ├─ ready ──► Has allocs? ├─ Yes ─► Allocs updated? │
// └──▲───┘ └─────┬───────┘ └────────┬────────┘
// │ │ │
// ready └─ No ─┐ ┌─────── Yes ──┘
// └─ No ─┐ ┌─────── Yes ──┘
// │ │ │
// ┌──────┴───────┐ ┌──▼──▼─┐ ┌──────┐
// │ ┌────────▼──▼───────┐
// │ │ Node pool exists? │
// │ └─────────┬─────────┘
// │ │
// ready Yes
// │ │
// ┌──────┴───────┐ ┌───▼───┐ ┌──────┐
// │ disconnected ◄─ disconnected ─┤ ready ├─ down ──► down │
// └──────────────┘ └───▲───┘ └──┬───┘
// │ │
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how it's supposed to look like 😅

                ┌────────────────────────────────────── No ───┐
                │                                             │
             ┌──▼───┐          ┌─────────────┐       ┌────────┴────────┐
── Register ─► init ├─ ready ──► Has allocs? ├─ Yes ─► Allocs updated? │
             └──▲───┘          └─────┬───────┘       └────────┬────────┘
                │                    │                        │
                │                    └─ No ─┐  ┌─────── Yes ──┘
                │                           │  │
                │                  ┌────────▼──▼───────┐
                │                  │ Node pool exists? │
                │                  └─────────┬─────────┘
                │                            │
              ready                         Yes
                │                            │
         ┌──────┴───────┐                ┌───▼───┐         ┌──────┐
         │ disconnected ◄─ disconnected ─┤ ready ├─ down ──► down │
         └──────────────┘                └───▲───┘         └──┬───┘
                                             │                │
                                             └──── ready ─────┘

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops...just realized that I forgot an arrow 😬

                ┌────────────────────────────────────── No ───┐
                │                                             │
             ┌──▼───┐          ┌─────────────┐       ┌────────┴────────┐
── Register ─► init ├─ ready ──► Has allocs? ├─ Yes ─► Allocs updated? │
             └──▲──▲┘          └─────┬───────┘       └────────┬────────┘
                │  │                 │                        │
                │  │                 └─ No ─┐  ┌─────── Yes ──┘
                │  │                        │  │
                │  │               ┌────────▼──▼───────┐
                │  └──────────No───┤ Node pool exists? │
                │                  └─────────┬─────────┘
                │                            │
              ready                         Yes
                │                            │
         ┌──────┴───────┐                ┌───▼───┐         ┌──────┐
         │ disconnected ◄─ disconnected ─┤ ready ├─ down ──► down │
         └──────────────┘                └───▲───┘         └──┬───┘
                                             │                │
                                             └──── ready ─────┘

@@ -896,15 +905,17 @@ func (s *StateStore) ScalingEventsByJob(ws memdb.WatchSet, namespace, jobID stri
// UpsertNode is used to register a node or update a node definition
// This is assumed to be triggered by the client, so we retain the value
// of drain/eligibility which is set by the scheduler.
func (s *StateStore) UpsertNode(msgType structs.MessageType, index uint64, node *structs.Node) error {
func (s *StateStore) UpsertNode(msgType structs.MessageType, index uint64, node *structs.Node, opts ...NodeUpsertOption) error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if a variadic here is the best option. I picked this for a few reasons:

  1. StateStore.UpsertNode() is called in several places (mostly tests, but still), so we will need to update it everywhere (kind of a lazy argument 😅)
  2. Prevent random boolean arguments that may be hard to know what they represent from callers side.

Reasons I don't like it:

  1. It's kind of weird since no other function here uses this pattern.
  2. It only represents a boolean, so it may be too restrictive to expand this to other use cases.

One alternative implementation I had at some point was to have two methods: UpserNode(), which doesn't create a node pool, and UpsertNodeWithNodePool() which always creates a node pool if necessary. Then it's up to the caller (the FSM in this case) to call the appropriate function for their use case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It's kind of weird since no other function here uses this pattern.

True, but we should do this kind of thing more often. So I'm 👍 on that.

  1. It only represents a boolean, so it may be too restrictive to expand this to other use cases.

True, but the type isn't serialized anywhere, so it's easy enough to change later. I think this will work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points, thanks!

Comment on lines +585 to +587
// CreateNodePool is used to indicate that the node's node pool should be
// create along with the node registration if it doesn't exist.
CreateNodePool bool
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think defaulting to false is the safe option here because it will keep the node in the initializing status with the node event as a hint to users.

If this were default to true (like, SkipNodePoolCreation) we would risk placing a node in a node pool that is not replicated to other regions.

@lgfa29 lgfa29 requested review from tgross and gulducat June 12, 2023 21:49
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Comment on lines +610 to +611
// Keep node in the initialing status if it's in a node pool that
// doesn't exist.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how this design keeps the initializing state transition on the Node.UpdateStatus, because we know that's going to regularly heartbeat so it'll switch to ready fairly quickly once the new pool has been created.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it makes things more clear to allow the existing heartbeat mechanism to recheck the node pool status, since the node pool creation will also trigger a node status transition.

@@ -896,15 +905,17 @@ func (s *StateStore) ScalingEventsByJob(ws memdb.WatchSet, namespace, jobID stri
// UpsertNode is used to register a node or update a node definition
// This is assumed to be triggered by the client, so we retain the value
// of drain/eligibility which is set by the scheduler.
func (s *StateStore) UpsertNode(msgType structs.MessageType, index uint64, node *structs.Node) error {
func (s *StateStore) UpsertNode(msgType structs.MessageType, index uint64, node *structs.Node, opts ...NodeUpsertOption) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It's kind of weird since no other function here uses this pattern.

True, but we should do this kind of thing more often. So I'm 👍 on that.

  1. It only represents a boolean, so it may be too restrictive to expand this to other use cases.

True, but the type isn't serialized anywhere, so it's easy enough to change later. I think this will work.

@lgfa29 lgfa29 merged commit 5db9e64 into main Jun 13, 2023
14 of 15 checks passed
@lgfa29 lgfa29 deleted the f-node-pools-node-register-multiregion branch June 13, 2023 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/node-pools Issues related to node pools
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants