Skip to content

Commit

Permalink
scheduler: recover from panic
Browse files Browse the repository at this point in the history
If processing a specific evaluation causes the scheduler (and
therefore the entire server) to panic, that evaluation will never
get a chance to be nack'd and cleared from the state store. It will
get dequeued by another scheduler, causing that server to panic, and
so forth until all servers are in a panic loop. This prevents the
operator from intervening to remove the evaluation or update the
state.

Recover the goroutine from the top-level `Process` methods for each
scheduler so that this condition can be detected without panicking the
server process. This will lead to a loop of recovering the scheduler
goroutine until the eval can be removed or nack'd, but that's much
better than taking a downtime.
  • Loading branch information
tgross committed Feb 4, 2022
1 parent e9ef2c0 commit 674a1b5
Show file tree
Hide file tree
Showing 3 changed files with 18 additions and 2 deletions.
3 changes: 3 additions & 0 deletions .changelog/12009.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
```release-note:improvement
scheduler: recover scheduler goroutines on panic
```
9 changes: 8 additions & 1 deletion scheduler/generic_sched.go
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,14 @@ func NewBatchScheduler(logger log.Logger, eventsCh chan<- interface{}, state Sta
}

// Process is used to handle a single evaluation
func (s *GenericScheduler) Process(eval *structs.Evaluation) error {
func (s *GenericScheduler) Process(eval *structs.Evaluation) (err error) {

defer func() {
if r := recover(); r != nil {
err = fmt.Errorf("processing eval %q panicked scheduler: %v", eval.ID, r)
}
}()

// Store the evaluation
s.eval = eval

Expand Down
8 changes: 7 additions & 1 deletion scheduler/scheduler_system.go
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,13 @@ func NewSysBatchScheduler(logger log.Logger, eventsCh chan<- interface{}, state
}

// Process is used to handle a single evaluation.
func (s *SystemScheduler) Process(eval *structs.Evaluation) error {
func (s *SystemScheduler) Process(eval *structs.Evaluation) (err error) {

defer func() {
if r := recover(); r != nil {
err = fmt.Errorf("processing eval %q panicked scheduler: %v", eval.ID, r)
}
}()

// Store the evaluation
s.eval = eval
Expand Down

0 comments on commit 674a1b5

Please sign in to comment.