Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drain v2: add controlled draining #4010

Merged
merged 79 commits into from
Mar 22, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
95b3b6e
drain: initial drainv2 structs and impl
schmichael Jan 24, 2018
587d4e2
testlog: override testlogger with envvar
schmichael Feb 16, 2018
91e8fd0
mock_driver: improve Kill() logging
schmichael Feb 20, 2018
48d637d
RPC, FSM, State Store for marking DesiredTransistion
dadgar Feb 21, 2018
7deabe9
drainer: switch to job based watching
schmichael Feb 22, 2018
832b1d5
switch to new raft DesiredTransition message
schmichael Feb 23, 2018
a466f97
scheduler: migrate non-terminal migrating allocs
schmichael Feb 24, 2018
116c28c
improve drain fsm/statestore tests
schmichael Feb 26, 2018
1773de9
Node.Drain takes strategy
dadgar Feb 23, 2018
2bdeace
Drain cli, api, http
dadgar Feb 23, 2018
762db7c
Fix tests
dadgar Feb 26, 2018
fba20fd
Remove update time
dadgar Feb 26, 2018
a7833bc
Upgrade path
dadgar Feb 27, 2018
5c101de
flag comment
dadgar Feb 27, 2018
dcafa8b
RPC/FSM/State Store for Eligibility
dadgar Feb 27, 2018
0fb9ba7
HTTP and API
dadgar Feb 27, 2018
378c566
node eligibility command
dadgar Feb 27, 2018
d6399cb
Add eligibility to node view
dadgar Feb 27, 2018
451b77d
Unblock evals once eligible
dadgar Feb 27, 2018
a96c337
Fix retaining the drain
dadgar Feb 27, 2018
d65ae92
Small refactor and cleanups
dadgar Feb 27, 2018
7d58209
code review
dadgar Feb 27, 2018
5be3263
refactor drainer into a subpkg
schmichael Feb 27, 2018
9de8908
drainer: drainer should shutdown with server
schmichael Feb 27, 2018
57c0335
Remove unused context
schmichael Feb 27, 2018
f2de735
Restart every time SetEnabled(true) is called
schmichael Feb 27, 2018
678fbe1
drainer: factor job & node watchers out of drainer.go
schmichael Feb 27, 2018
3b25f78
drainer: convert fsm errors to go errors
schmichael Feb 27, 2018
3fe3c6e
Improve DeadlineTime helper
dadgar Mar 1, 2018
3ca9cdf
client: don't monitor health of non-service jobs
schmichael Feb 27, 2018
1f73cd5
drainer: refactor newStopAllocs, applyMigrations
schmichael Mar 1, 2018
4782098
refactor main drainloop into 2 more methods
schmichael Mar 1, 2018
7f98949
Correct defaulting
dadgar Mar 1, 2018
a027016
Fix file names
dadgar Mar 1, 2018
c00c02d
System test runs on mac
dadgar Mar 1, 2018
6026af2
Initial design
dadgar Mar 2, 2018
e566fcd
drain heap
dadgar Mar 2, 2018
da36810
node watcher
dadgar Mar 3, 2018
d45532d
Node's being untracked or having updated deadlines, updates the deadl…
dadgar Mar 3, 2018
0e51b20
job watcher
dadgar Mar 6, 2018
cec2c5a
Drainer
dadgar Mar 6, 2018
c035422
integration test and basic fixes
dadgar Mar 7, 2018
fb40e8b
handle empty node case
dadgar Mar 7, 2018
4b4e234
Comments
dadgar Mar 7, 2018
bd70197
spelling fixes
dadgar Mar 8, 2018
5b36af9
code review
dadgar Mar 8, 2018
d153714
Toggle Drain allows resetting eligibility
dadgar Mar 8, 2018
efb6601
Switch to drainerv2 impl
schmichael Mar 8, 2018
45e7e88
Fix deadline handling
dadgar Mar 12, 2018
ad2f211
Batch drain update
dadgar Mar 9, 2018
5324e56
sharding
dadgar Mar 10, 2018
270699b
fix comment
dadgar Mar 14, 2018
fb6c821
Fix node eligibility test
schmichael Mar 6, 2018
6347bae
Add DesiredTransition.ShouldMigrate to api pkg
schmichael Mar 7, 2018
11d0eae
Monitor node drains until completion in CLI
schmichael Mar 6, 2018
e669e82
Improve drain log messages
schmichael Mar 16, 2018
0a1f1d2
Fix deadline heap triggering
schmichael Mar 9, 2018
8217ebf
drainer: RegisterJob -> RegisterJobs
schmichael Mar 10, 2018
74dc8fd
JobNs -> NamespacedID
schmichael Mar 19, 2018
8ef7863
Deregister garbage collected jobs
schmichael Mar 19, 2018
e003b05
Remove debug prints
schmichael Mar 19, 2018
08c9116
Refactor assertOps into a helper func
schmichael Mar 19, 2018
98935b8
fix race in drain integration tests
dadgar Mar 19, 2018
4efbc34
rpcapi: remove; unused
schmichael Mar 20, 2018
aab1fb7
Fix linting errors
schmichael Mar 20, 2018
2bb1874
api: fix tests to expect default migrate strategy
schmichael Mar 20, 2018
1537061
alloc_runner: watch health for deployed batch jobs
schmichael Mar 20, 2018
9b88749
mock: add BatchJob() helper
schmichael Mar 20, 2018
17161ec
tests: use mock.BatchJob to fix tests
schmichael Mar 20, 2018
8088562
test: don't call t.Fatal from within a goroutine
schmichael Mar 20, 2018
50a94d7
test: try to prevent flakiness on travis
schmichael Mar 20, 2018
b8b1922
test: fix by using mock.BatchJob
schmichael Mar 20, 2018
e8673b1
test: disable drain during fsm test
schmichael Mar 20, 2018
6366938
test: disable node drainer during tests
schmichael Mar 20, 2018
ec09ea6
test: must initialize jobResults with new func
schmichael Mar 20, 2018
b58a22c
remove spurious TODOs and FIXMEs
schmichael Mar 21, 2018
07fe879
test: index no longer guaranteed on job list
schmichael Mar 21, 2018
3496bcf
docs: improve DrainRequest.MarkEligible comment
schmichael Mar 21, 2018
e10883c
eligbile -> eligible
schmichael Mar 21, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions api/allocations.go
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,7 @@ type Allocation struct {
Metrics *AllocationMetric
DesiredStatus string
DesiredDescription string
DesiredTransition DesiredTransition
ClientStatus string
ClientDescription string
TaskStates map[string]*TaskState
Expand Down Expand Up @@ -205,3 +206,17 @@ type RescheduleEvent struct {
// PrevNodeID is the node ID of the previous allocation
PrevNodeID string
}

// DesiredTransition is used to mark an allocation as having a desired state
// transition. This information can be used by the scheduler to make the
// correct decision.
type DesiredTransition struct {
// Migrate is used to indicate that this allocation should be stopped and
// migrated to another node.
Migrate *bool
}

// ShouldMigrate returns whether the transition object dictates a migration.
func (d DesiredTransition) ShouldMigrate() bool {
return d.Migrate != nil && *d.Migrate
}
7 changes: 7 additions & 0 deletions api/allocations_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -239,3 +239,10 @@ func TestAllocations_RescheduleInfo(t *testing.T) {
}

}

func TestAllocations_ShouldMigrate(t *testing.T) {
t.Parallel()
require.True(t, DesiredTransition{Migrate: helper.BoolToPtr(true)}.ShouldMigrate())
require.False(t, DesiredTransition{}.ShouldMigrate())
require.False(t, DesiredTransition{Migrate: helper.BoolToPtr(false)}.ShouldMigrate())
}
1 change: 1 addition & 0 deletions api/jobs.go
Original file line number Diff line number Diff line change
Expand Up @@ -559,6 +559,7 @@ type Job struct {
ParameterizedJob *ParameterizedJobConfig
Payload []byte
Reschedule *ReschedulePolicy
Migrate *MigrateStrategy
Meta map[string]string
VaultToken *string `mapstructure:"vault_token"`
Status *string
Expand Down
32 changes: 15 additions & 17 deletions api/jobs_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -12,41 +12,34 @@ import (
"github.com/hashicorp/nomad/testutil"
"github.com/kr/pretty"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)

func TestJobs_Register(t *testing.T) {
t.Parallel()
require := require.New(t)

c, s := makeClient(t, nil, nil)
defer s.Stop()
jobs := c.Jobs()

// Listing jobs before registering returns nothing
resp, qm, err := jobs.List(nil)
if err != nil {
t.Fatalf("err: %s", err)
}
require.Nil(err)
assertQueryMeta(t, qm)
if n := len(resp); n != 0 {
t.Fatalf("expected 0 jobs, got: %d", n)
}
require.Emptyf(resp, "expected 0 jobs, got: %d", len(resp))

// Create a job and attempt to register it
job := testJob()
resp2, wm, err := jobs.Register(job, nil)
if err != nil {
t.Fatalf("err: %s", err)
}
if resp2 == nil || resp2.EvalID == "" {
t.Fatalf("missing eval id")
}
require.Nil(err)
require.NotNil(resp2)
require.NotEmpty(resp2.EvalID)
assertWriteMeta(t, wm)

// Query the jobs back out again
resp, qm, err = jobs.List(nil)
if err != nil {
t.Fatalf("err: %s", err)
}
assertQueryMeta(t, qm)
resp, _, err = jobs.List(nil)
require.Nil(err)

// Check that we got the expected response
if len(resp) != 1 || resp[0].ID != *job.ID {
Expand Down Expand Up @@ -141,6 +134,7 @@ func TestJobs_Canonicalize(t *testing.T) {
MaxDelay: helper.TimeToPtr(1 * time.Hour),
Unlimited: helper.BoolToPtr(true),
},
Migrate: DefaultMigrateStrategy(),
Tasks: []*Task{
{
KillTimeout: helper.TimeToPtr(5 * time.Second),
Expand Down Expand Up @@ -211,6 +205,7 @@ func TestJobs_Canonicalize(t *testing.T) {
MaxDelay: helper.TimeToPtr(1 * time.Hour),
Unlimited: helper.BoolToPtr(true),
},
Migrate: DefaultMigrateStrategy(),
Tasks: []*Task{
{
Name: "task1",
Expand Down Expand Up @@ -363,6 +358,7 @@ func TestJobs_Canonicalize(t *testing.T) {
AutoRevert: helper.BoolToPtr(false),
Canary: helper.IntToPtr(0),
},
Migrate: DefaultMigrateStrategy(),
Tasks: []*Task{
{
Name: "redis",
Expand Down Expand Up @@ -576,6 +572,7 @@ func TestJobs_Canonicalize(t *testing.T) {
AutoRevert: helper.BoolToPtr(true),
Canary: helper.IntToPtr(1),
},
Migrate: DefaultMigrateStrategy(),
Tasks: []*Task{
{
Name: "task1",
Expand Down Expand Up @@ -616,6 +613,7 @@ func TestJobs_Canonicalize(t *testing.T) {
AutoRevert: helper.BoolToPtr(false),
Canary: helper.IntToPtr(0),
},
Migrate: DefaultMigrateStrategy(),
Tasks: []*Task{
{
Name: "task1",
Expand Down
142 changes: 107 additions & 35 deletions api/nodes.go
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,9 @@ package api
import (
"fmt"
"sort"
"strconv"
"time"

"github.com/hashicorp/nomad/nomad/structs"
)

// Nodes is used to query node-related API endpoints
Expand Down Expand Up @@ -42,10 +43,57 @@ func (n *Nodes) Info(nodeID string, q *QueryOptions) (*Node, *QueryMeta, error)
return &resp, qm, nil
}

// ToggleDrain is used to toggle drain mode on/off for a given node.
func (n *Nodes) ToggleDrain(nodeID string, drain bool, q *WriteOptions) (*WriteMeta, error) {
drainArg := strconv.FormatBool(drain)
wm, err := n.client.write("/v1/node/"+nodeID+"/drain?enable="+drainArg, nil, nil, q)
// NodeUpdateDrainRequest is used to update the drain specification for a node.
type NodeUpdateDrainRequest struct {
// NodeID is the node to update the drain specification for.
NodeID string

// DrainSpec is the drain specification to set for the node. A nil DrainSpec
// will disable draining.
DrainSpec *DrainSpec

// MarkEligible marks the node as eligible for scheduling if removing
// the drain strategy.
MarkEligible bool
}

// UpdateDrain is used to update the drain strategy for a given node. If
// markEligible is true and the drain is being removed, the node will be marked
// as having its scheduling being elibile
func (n *Nodes) UpdateDrain(nodeID string, spec *DrainSpec, markEligible bool, q *WriteOptions) (*WriteMeta, error) {
req := &NodeUpdateDrainRequest{
NodeID: nodeID,
DrainSpec: spec,
MarkEligible: markEligible,
}

wm, err := n.client.write("/v1/node/"+nodeID+"/drain", req, nil, q)
if err != nil {
return nil, err
}
return wm, nil
}

// NodeUpdateEligibilityRequest is used to update the drain specification for a node.
type NodeUpdateEligibilityRequest struct {
// NodeID is the node to update the drain specification for.
NodeID string
Eligibility string
}

// ToggleEligibility is used to update the scheduling eligibility of the node
func (n *Nodes) ToggleEligibility(nodeID string, eligible bool, q *WriteOptions) (*WriteMeta, error) {
e := structs.NodeSchedulingEligible
if !eligible {
e = structs.NodeSchedulingIneligible
}

req := &NodeUpdateEligibilityRequest{
NodeID: nodeID,
Eligibility: e,
}

wm, err := n.client.write("/v1/node/"+nodeID+"/eligibility", req, nil, q)
if err != nil {
return nil, err
}
Expand Down Expand Up @@ -108,25 +156,48 @@ type DriverInfo struct {

// Node is used to deserialize a node entry.
type Node struct {
ID string
Datacenter string
Name string
HTTPAddr string
TLSEnabled bool
Attributes map[string]string
Resources *Resources
Reserved *Resources
Links map[string]string
Meta map[string]string
NodeClass string
Drain bool
Status string
StatusDescription string
StatusUpdatedAt int64
Events []*NodeEvent
Drivers map[string]*DriverInfo
CreateIndex uint64
ModifyIndex uint64
ID string
Datacenter string
Name string
HTTPAddr string
TLSEnabled bool
Attributes map[string]string
Resources *Resources
Reserved *Resources
Links map[string]string
Meta map[string]string
NodeClass string
Drain bool
DrainStrategy *DrainStrategy
SchedulingEligibility string
Status string
StatusDescription string
StatusUpdatedAt int64
Events []*NodeEvent
Drivers map[string]*DriverInfo
CreateIndex uint64
ModifyIndex uint64
}

// DrainStrategy describes a Node's drain behavior.
type DrainStrategy struct {
// DrainSpec is the user declared drain specification
DrainSpec

// ForceDeadline is the deadline time for the drain after which drains will
// be forced
ForceDeadline time.Time
}

// DrainSpec describes a Node's drain behavior.
type DrainSpec struct {
// Deadline is the duration after StartTime when the remaining
// allocations on a draining Node should be told to stop.
Deadline time.Duration

// IgnoreSystemJobs allows systems jobs to remain on the node even though it
// has been marked for draining.
IgnoreSystemJobs bool
}

const (
Expand Down Expand Up @@ -181,17 +252,18 @@ type HostDiskStats struct {
// NodeListStub is a subset of information returned during
// node list operations.
type NodeListStub struct {
Address string
ID string
Datacenter string
Name string
NodeClass string
Version string
Drain bool
Status string
StatusDescription string
CreateIndex uint64
ModifyIndex uint64
Address string
ID string
Datacenter string
Name string
NodeClass string
Version string
Drain bool
SchedulingEligibility string
Status string
StatusDescription string
CreateIndex uint64
ModifyIndex uint64
}

// NodeIndexSort reverse sorts nodes by CreateIndex
Expand Down
Loading