Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batch node deregistration #5784

Merged
merged 34 commits into from
Jul 10, 2019
Merged

batch node deregistration #5784

merged 34 commits into from
Jul 10, 2019

Conversation

langmartin
Copy link
Contributor

@langmartin langmartin commented Jun 5, 2019

Change node deregistration to operate on a batch of node is, partitioned into maxIdsPerReap slices. The user-facing API submits a batch of 1.

  • Changelog

@langmartin langmartin requested a review from dadgar June 5, 2019 19:33
Copy link
Contributor

@notnoop notnoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the compatibility issue needs addressing first and will follow up with more feedback there.

nomad/structs/structs.go Show resolved Hide resolved
nomad/state/state_store_test.go Show resolved Hide resolved
nomad/util.go Outdated Show resolved Hide resolved
nomad/util.go Outdated Show resolved Hide resolved
Copy link
Contributor

@endocrimes endocrimes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be another raft message type, otherwise great start 👍

command/agent/node_endpoint.go Outdated Show resolved Hide resolved
nomad/util.go Outdated Show resolved Hide resolved
nomad/core_sched.go Outdated Show resolved Hide resolved
Copy link
Contributor

@endocrimes endocrimes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions about backwards compatibility and regression testing

nomad/node_endpoint.go Show resolved Hide resolved
nomad/node_endpoint.go Outdated Show resolved Hide resolved
// DeleteNode deregisters a batch of nodes
func (s *StateStore) DeleteNode(index uint64, nodes []string) error {
if len(nodes) == 0 {
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this potentially return an error? - It probably signifies a programmer error, as an empty node-id should previously have returned a node lookup failed error.

nomad/node_endpoint_test.go Outdated Show resolved Hide resolved
nomad/fsm.go Outdated Show resolved Hide resolved
command/agent/node_endpoint.go Outdated Show resolved Hide resolved
nomad/core_sched.go Outdated Show resolved Hide resolved
nomad/fsm.go Outdated Show resolved Hide resolved
nomad/fsm.go Outdated Show resolved Hide resolved
nomad/fsm.go Outdated
@@ -191,6 +191,8 @@ func (n *nomadFSM) Apply(log *raft.Log) interface{} {
return n.applyUpsertNode(buf[1:], log.Index)
case structs.NodeDeregisterRequestType:
return n.applyDeregisterNode(buf[1:], log.Index)
case structs.NodeDeregisterBatchRequestType:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these sorted by their value? If so, probably belong to the bottom?

nomad/fsm.go Outdated Show resolved Hide resolved
@@ -54,6 +54,7 @@ type MessageType uint8
const (
NodeRegisterRequestType MessageType = iota
NodeDeregisterRequestType
NodeDeregisterBatchRequestType // QUESTION does iota make it important to append this list?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - this needs to be appended at the end. golang compiler would assign 0 to iota and then increments for subsequent lines. See https://play.golang.org/p/7ZqlecuVJHW and https://golang.org/ref/spec#Iota about iota and constants.

In raft, we store the message type uint8 in messages, so these order here must remain unaltered; otherwise, we will not interpret old raft messages properly on upgrade. Probably belongs to raft notes too.

type NodeDeregisterRequest struct {
NodeID string
WriteRequest
}

// NodeDeregisterBatchRequest is used for Node.DeregisterBatch endpoint
// to deregister a batch of nodes from being schedulable entities.
type NodeDeregisterBatchRequest struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find our naming conventions here odd, Batch seems to appear inconsistently throughout the file: BatchNodeUpdateDrainRequest, JobBatchDeregisterRequest, and NodeDeregisterBatchRequest :(. The same goes for the fsm method handler name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. I'll rename it to avoid introducing a third pattern at least

command/agent/node_endpoint.go Outdated Show resolved Hide resolved
@@ -316,12 +317,20 @@ type NodeRegisterRequest struct {
}

// NodeDeregisterRequest is used for Node.Deregister endpoint
// to deregister a node as being a schedulable entity.
// to deregister a batch of nodes from being schedulable entities.
// Deprecated in 0.9.3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we still use it in http agent nodePurge, we might want to qualify its deprecation and scope it to internal scheduler work.

Copy link
Contributor

@notnoop notnoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the many back and forth - noticed one more compatibility issue :(.

nomad/core_sched.go Show resolved Hide resolved
@@ -316,12 +317,19 @@ type NodeRegisterRequest struct {
}

// NodeDeregisterRequest is used for Node.Deregister endpoint
// to deregister a node as being a schedulable entity.
// to deregister a batch of nodes from being schedulable entities.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment here doesn't need to be updated

func (n *Node) Deregister(args *structs.NodeDeregisterRequest, reply *structs.NodeUpdateResponse) error {
if done, err := n.srv.forward("Node.Deregister", args, args, reply); done {
return n.BatchDeregister(&structs.NodeBatchDeregisterRequest{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can safely forward to Node.BatchDeregister here when servers are running mixed fleet. Consider a 0.9.4 nomad server follower workin against 0.9.3 leader: the leader wouldn't be able to handle the call.

I think it's better to leave the old endpoints untouched and only add new ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, that makes a lot of sense

@langmartin langmartin requested a review from notnoop June 27, 2019 19:55
if done, err := n.srv.forward("Node.Deregister", args, args, reply); done {
return err
}

// If we're handling the request locally on the leader, operate on a batch of size 1
return n.batchDeregister(&structs.NodeBatchDeregisterRequest{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:( I don't think this is right yet. Even if we are the leader, it's not safe to use the new raft message type until we confirm that all servers are at least 0.9.4. I would suggest keeping old handling untouched without it depending on batchDeregister (module some function extraction as preferred).

Copy link
Contributor

@notnoop notnoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with minor stylistic suggestion. Merge away. Thank you so much for following this through.

// deregister takes a raftMessage closure, to support both Deregister and BatchDeregister
func (n *Node) deregister(raftMessage func() (interface{}, uint64, error),
args *structs.NodeBatchDeregisterRequest,
reply *structs.NodeUpdateResponse) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest ordering them in the following args, reply, raftApplyFn. It's a bit more consistent with rest of functions, and it's a bit nicer visually for the lambda function to appear last. Also, calling raftApplyFn or applyRaftMsgFn probably captures meaning better.

@langmartin langmartin force-pushed the b-batch-node-dereg branch 4 times, most recently from 78736ae to 4ced779 Compare July 8, 2019 16:31
@langmartin langmartin merged commit 8da0b91 into master Jul 10, 2019
@langmartin langmartin deleted the b-batch-node-dereg branch July 10, 2019 18:25
@github-actions
Copy link

github-actions bot commented Feb 6, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants