Improve staleness logic by improving error handling and retrying a subset of errors #32

glaksh100 · 2019-06-18T23:09:18Z

This PR removes dependency on staleness duration. The application is now rolled back based on the error observed (and retried if applicable). It also makes error handling a bit more structured so that we can more easily specify retry-able errors.

Changelog

Adds a error_handler class that wraps around the error interface and adds information like error codes and method info to the error message
Defines a map of retryable error keys (method + error code)
Unifies all error logging in api.go to use the above error_handler
Adds a LastSeenError and a RetryCount to the FlinkApplicationStatus
Changes the logic within shouldRollback() in flink_state_machine.go to check if the LastSeenError can be retried and is within the maxRetries limit
Removes dependency on staleness duration

Note: go generate on Config.go seems to be failing with:

which pflags || (go get github.com/lyft/flytestdlib/cli/pflags)
# github.com/lyft/flytestdlib/promutils
../flytestdlib/promutils/workqueue.go:29:49: cannot use composite literal (type prometheusMetricsProvider) as type workqueue.MetricsProvider in argument to workqueue.SetProvider:
	prometheusMetricsProvider does not implement workqueue.MetricsProvider (missing NewDeprecatedAddsMetric method)
make: *** [gen-config] Error 2

glaksh100 · 2019-06-20T21:11:43Z

Many thanks to @anandswaminathan for helping debug and fix the pflags issue. For anyone who may run into this in their local dev:

make gen-config downloads flytestdlib to ~/src/go/blah... (does ago get`)
All references to flytestdlib are henceforth to your local flystdlib version and NOT to the one under flinkk8soperator/vendor/github.com/lyft/flytestdlib.
TL;DR run dep ensure on your local version of flytstdlib and the problem goes away.

glaksh100 · 2019-06-20T21:54:08Z

@mwylde @anandswaminathan After various fixes, I think this PR is now in a state to be reviewed. PTAL when you have a chance.

mwylde

This is great! A few comments, but overall my biggest concern is that in the current design we have to be very careful to avoid applications getting stuck. For example, in some local testing of the PR I saw a case where killing the JM during savepointing caused the application to get stuck midway with

ts="2019-06-20T18:01:38-07:00" level=error msg="Check savepoint status failed with response {\"errors\":[\"Operation not found under key: org.apache.flink.runtime.rest.handler.job.AsynchronousJobOperationKey@e0ff9448\"]}" app_name=operator-test-app-ha ns=default phase=Savepointing src="api.go:248"

This is especially dangerous because there's currently no way for a user to interrupt a deploy, so they will have to delete and recreate the application.

pkg/controller/flink/client/api.go

pkg/controller/flink/client/error_handler.go

pkg/controller/flinkapplication/flink_state_machine.go

anandswaminathan · 2019-06-21T18:13:55Z

@glaksh100

One thing I have noticed with both Operator SDK and Controller runtime is that on occasions - Callbacks can happen at a much faster loop, and not only at the Resync period. Have tried debugging it but looks like several edge cases where this can happen.

What this can translate to - you might exhaust your maxRetries in very small interval. You will need to track time, and have logic like minTimeBetweenRetries or something, and set it to an acceptable levels.

pkg/controller/flink/client/error_handler.go

pkg/controller/flinkapplication/flink_state_machine.go

pkg/controller/flink/client/error_handler.go

pkg/controller/flinkapplication/flink_state_machine.go

pkg/controller/flink/client/error_handler.go

glaksh100 · 2019-06-25T21:44:38Z

@mwylde Thanks for the initial and the second review! I'm going to respond to comments individually and resolve ones that are outdated.

glaksh100 · 2019-06-25T22:28:32Z

Things I'm planning to address in the next pass:

Improving the exponential back off with jitter and a bound
Compile-check the error codes
Make the retryhandler parameters configurable and move them to the configMap

glaksh100 · 2019-06-27T23:13:46Z

Changelog — it's perhaps easier to look at the entire diff than the ones for the commit.

Refactored the retryable/failfast errors to not depend on a map at all. When the error is logged, the error itself dictates what category it belongs to along with its maxRetries.
Increased defaultRetries and also added jitter to the retry delay calculation
Added retry configurations to the configmap (basebackoff/maxbackoff/maxwaitonerror)

glaksh100 · 2019-07-02T17:49:44Z

As of yesterday, I've addressed all comments afaict. PTAL when you have a chance @mwylde @anandswaminathan

pkg/controller/flinkapplication/flink_state_machine.go

pkg/controller/flink/client/api.go

pkg/controller/flink/client/error_handler.go

…itjob, default time based application staleness, no error handling in running phase

glaksh100 · 2019-07-08T17:51:16Z

PTAL @mwylde when you have a chance!

mwylde

👍 This is great!

glaksh100 · 2019-07-10T19:15:09Z

Thanks for the thorough review @mwylde!

glaksh100 requested review from anandswaminathan, kumare3 and mwylde as code owners June 18, 2019 23:09

lrao100 added 18 commits June 20, 2019 14:13

Merge conflicts

a896944

Merge conflicts

0b5c536

Fix lint, add comments

1b72891

Remove references to staleness duration

309c5de

Reset error to empty

92c1049

Revert unintended changes

8d16b51

Revert unintended changes

9a30fa5

Fixes

179380b

Fix typo

551cd23

Fix imports-ed

370d64b

Fix error condition during first deploy

11c0341

Update error handling

fde1dde

Fix integration test and manually update config flags for clean build

25d90c8

Fix integration test

e5b1f56

Fix unit tests with updates

a161d31

Add more unit tests

c542e37

Actually generate

d676d8c

Remove staleness config from integ

1fe6929

glaksh100 force-pushed the improve-staleness-logic branch from a48ec6a to 1fe6929 Compare June 20, 2019 21:16

Fix space

c207e36

mwylde suggested changes Jun 21, 2019

View reviewed changes

lrao100 added 3 commits June 24, 2019 08:16

First pass at review comments

10c48df

Second pass at review comments

0cd10d4

Improve error codes based on local testing

1808750

mwylde reviewed Jun 25, 2019

View reviewed changes

lrao100 added 2 commits June 26, 2019 17:38

Improving backoff with jitter and increasing default retries

3f85790

Refactor to make the retry checks simpler

60919ed

lrao100 added 3 commits June 28, 2019 14:13

Use handle() to retry instead of sleeping in the goroutine

3d245d9

Fix deep copy gen

d102c02

Fix lint

d8de801

mwylde reviewed Jul 2, 2019

View reviewed changes

pkg/controller/flinkapplication/flink_state_machine.go Outdated Show resolved Hide resolved

mwylde reviewed Jul 2, 2019

View reviewed changes

pkg/controller/flinkapplication/flink_state_machine.go Show resolved Hide resolved

mwylde reviewed Jul 2, 2019

View reviewed changes

lrao100 added 2 commits July 3, 2019 10:03

Separate methods for retry and failfast methods, add retries for subm…

85081ed

…itjob, default time based application staleness, no error handling in running phase

Fix error code on submit job

d421422

lrao100 added 2 commits July 9, 2019 15:15

Simplify error types to 1

9aaa9f2

Fix integration test

057d411

mwylde approved these changes Jul 10, 2019

View reviewed changes

lrao100 added 3 commits July 10, 2019 10:14

Update local config

29d6ad3

Resolve conflicts

741ab17

Update integ direct mode config

9b59d32

glaksh100 merged commit 8d847d7 into master Jul 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve staleness logic by improving error handling and retrying a subset of errors #32

Improve staleness logic by improving error handling and retrying a subset of errors #32

glaksh100 commented Jun 18, 2019 •

edited

Loading

glaksh100 commented Jun 20, 2019

glaksh100 commented Jun 20, 2019

mwylde left a comment

anandswaminathan commented Jun 21, 2019

glaksh100 commented Jun 25, 2019

glaksh100 commented Jun 25, 2019

glaksh100 commented Jun 27, 2019 •

edited

Loading

glaksh100 commented Jul 2, 2019

glaksh100 commented Jul 8, 2019

mwylde left a comment

glaksh100 commented Jul 10, 2019

Improve staleness logic by improving error handling and retrying a subset of errors #32

Improve staleness logic by improving error handling and retrying a subset of errors #32

Conversation

glaksh100 commented Jun 18, 2019 • edited Loading

glaksh100 commented Jun 20, 2019

glaksh100 commented Jun 20, 2019

mwylde left a comment

Choose a reason for hiding this comment

anandswaminathan commented Jun 21, 2019

glaksh100 commented Jun 25, 2019

glaksh100 commented Jun 25, 2019

glaksh100 commented Jun 27, 2019 • edited Loading

glaksh100 commented Jul 2, 2019

glaksh100 commented Jul 8, 2019

mwylde left a comment

Choose a reason for hiding this comment

glaksh100 commented Jul 10, 2019

glaksh100 commented Jun 18, 2019 •

edited

Loading

glaksh100 commented Jun 27, 2019 •

edited

Loading