-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add taint feature to auto replace tainted node #1581
Conversation
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make sure you are running make fmt lint
?
d9d7e02
to
7a30f2f
Compare
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
a3affc1
to
b20161c
Compare
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
config/crd/bases/apps.foundationdb.org_foundationdbrestores.yaml
Outdated
Show resolved
Hide resolved
9bee145
to
dabc93d
Compare
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
fca2015
to
1edf305
Compare
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
Because PR build does not have the new CRD yet, the taint related e2e test will always fail.
some taint tests are senesitive to the execution time of the test case. In PR test, test takes longer than in dev env, which causes failure in PR test but not reproducible in dev env.
Change flapping node test as flaky test as well
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
|
Notable change 1. When both exact match key and wildcard key matches a node taint key, the wildcard key will have no effect on the node taint key. Added test to verify that 2. Refactor replace_failed_processgroups_test.go 3. Change clusterrolebinding to rolebinding 4. Coding style and various cosmetic improvement
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we revert the PREVIOUS_FDB_VERSION
and remove/change the fmt.Print
statements in our e2e test framework I'm okay with merging those changes to make sure we can move forward. We should note in our docs for this feature, that it's currently experimental, the requirements (node access) and the use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes are good. We should focus on finishing the open tasks and add documentation for this feature 👍
Description
Fixes: #507
When a node is tainted by a key, operator will mark pods on the node as NodeTaintDetected. If a pods stay in NodeTaintDetected for long enough, operator will mark the pod as unhealthy and replace it with replacement logic.
SRE can configure the tainted keys operator should react to. Each tainted key has a tainted duration which defines how long the pod should stay in tainted mode before it is marked for replacement.
Change sets include
This test focuses on the testing the taint logic isolately;
This test focuses on testing the interaction between taint function and other reconciliation loops
Type of change
Testing
Unit tests and end-to-end test on EKS.
Do we need to perform additional testing once this is merged, or perform in a larger testing environment?
No.
Documentation
TODO: Update user manual and design doc.
Did you update relevant documentation within this repository?
If this change is adding new functionality, do we need to describe it in our user manual?
If this change is adding or removing subreconcilers, have we updated the core technical design doc to reflect that?
If this change is adding new safety checks or new potential failure modes, have we documented and how to debug potential issues?
Follow-up
Future work: