Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add taint feature to auto replace tainted node #1581

Merged
merged 47 commits into from
May 11, 2023

Conversation

xumengpanda
Copy link
Contributor

@xumengpanda xumengpanda commented Apr 10, 2023

Description

Fixes: #507

When a node is tainted by a key, operator will mark pods on the node as NodeTaintDetected. If a pods stay in NodeTaintDetected for long enough, operator will mark the pod as unhealthy and replace it with replacement logic.

SRE can configure the tainted keys operator should react to. Each tainted key has a tainted duration which defines how long the pod should stay in tainted mode before it is marked for replacement.

Change sets include

  • Set logger to output to stdout for replacements_test;
  • Add k8s node concept to unit test framework so that we can test node related features (e.g., taint a node, fail a node);
  • Update node and cluster config to propogate its effect to all reconcilation loop;
  • Update NodeTaintDetected condition time with node's tainted time; so that a pod is removed "closely" after configured duration after its tainted time;
  • Add taint tests in update_status_test.
    This test focuses on the testing the taint logic isolately;
  • Add taint test in replace_failed_process_groups_test.
    This test focuses on testing the interaction between taint function and other reconciliation loops
  • Add end-to-end test to taint a node in k8s cluster and verify the feature replaces pods on tainted node as expected.

Type of change

  • New feature (non-breaking change which adds functionality)

Testing

Unit tests and end-to-end test on EKS.

Do we need to perform additional testing once this is merged, or perform in a larger testing environment?
No.

Documentation

TODO: Update user manual and design doc.

Did you update relevant documentation within this repository?

If this change is adding new functionality, do we need to describe it in our user manual?

If this change is adding or removing subreconcilers, have we updated the core technical design doc to reflect that?

If this change is adding new safety checks or new potential failure modes, have we documented and how to debug potential issues?

Follow-up

Future work:

  • Consider to add a separate replacementTimeDuration to replace tainted node faster. Right now, cluster only replaces a failed process after 2 hours. The reaction time is too long for some taint key (e.g., maintenance or network disconnection).
  • Consider to reduce the detection latency of a tainted node. Right now, the detection latency is the latency of operator's reconciliation invocation plus operator's reconciliation's execution time. Since reconciliation may not be invoked for a long time (hour), operator may not detect a 10min taint quickly enough.

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: d8c466e
  • Duration 4:11:13
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 9f4d7fc4be8f898a8ec9c8bb39eb6f1e01d4eb7e
  • Duration 4:11:31
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

Copy link
Member

@johscheuer johscheuer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make sure you are running make fmt lint ?

api/v1beta2/foundationdbcluster_types.go Outdated Show resolved Hide resolved
api/v1beta2/foundationdbcluster_types.go Outdated Show resolved Hide resolved
api/v1beta2/foundationdbcluster_types.go Outdated Show resolved Hide resolved
api/v1beta2/foundationdbcluster_types.go Outdated Show resolved Hide resolved
api/v1beta2/foundationdbcluster_types.go Outdated Show resolved Hide resolved
controllers/update_status.go Outdated Show resolved Hide resolved
controllers/update_status.go Outdated Show resolved Hide resolved
e2e/test_operator/operator_test.go Outdated Show resolved Hide resolved
e2e/test_operator/operator_test.go Outdated Show resolved Hide resolved
internal/replacements/replacements_test.go Outdated Show resolved Hide resolved
@xumengpanda xumengpanda force-pushed the mx/taint-v2 branch 2 times, most recently from d9d7e02 to 7a30f2f Compare April 15, 2023 00:49
@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 0f05ff78803c1b5a3a5afcb149e7c01d456ba783
  • Duration 4:11:30
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: d9d7e02563d16b404e289ae93f5c48489062d367
  • Duration 4:11:20
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 7a30f2f55c164389a753c0db7ebc63ee5488158c
  • Duration 4:11:26
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 9b190f87d510d7a32404eb4d74f18cf00a02bcf0
  • Duration 4:11:21
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: a3affc1ef8b6923d9c9911c664c2171e673c8ae8
  • Duration 4:11:10
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: b20161ce2fccb1e16f648a8cab833ff7d25d9e05
  • Duration 4:11:16
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

mock-kubernetes-client/client/client.go Outdated Show resolved Hide resolved
api/v1beta2/foundationdbcluster_types.go Outdated Show resolved Hide resolved
api/v1beta2/foundationdbcluster_types.go Show resolved Hide resolved
api/v1beta2/foundationdbcluster_types.go Show resolved Hide resolved
controllers/admin_client_test.go Outdated Show resolved Hide resolved
e2e/test_operator/operator_test.go Outdated Show resolved Hide resolved
e2e/test_operator/operator_test.go Outdated Show resolved Hide resolved
e2e/test_operator/operator_test.go Show resolved Hide resolved
api/v1beta2/foundationdbcluster_types.go Outdated Show resolved Hide resolved
api/v1beta2/foundationdbcluster_types.go Outdated Show resolved Hide resolved
@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 30011c9f03b650e8cc50b542aa6e415cb03d3f79
  • Duration 4:11:02
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

api/v1beta2/foundationdbcluster_types.go Show resolved Hide resolved
config/rbac/role.yaml Show resolved Hide resolved
config/samples/deployment.yaml Outdated Show resolved Hide resolved
controllers/replace_failed_process_groups.go Show resolved Hide resolved
controllers/update_status.go Show resolved Hide resolved
controllers/replace_failed_process_groups_test.go Outdated Show resolved Hide resolved
controllers/replace_failed_process_groups_test.go Outdated Show resolved Hide resolved
@xumengpanda xumengpanda force-pushed the mx/taint-v2 branch 2 times, most recently from 9bee145 to dabc93d Compare April 19, 2023 02:36
@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 9bee1450858503d2bbdc827243db06550801e7fd
  • Duration 0:05:34
  • Result: ❌ FAILED
  • Error: reference not found for primary source and source version 9bee1450858503d2bbdc827243db06550801e7fd
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: ca5f1d7d180fa7354a8a407d90a959e58ef8f003
  • Duration 4:11:15
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 7fbc72dd2d1d915778ef726859768f8e98bced5c
  • Duration 4:11:18
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: dabc93df0d14d0021dfbfb26443862075e1ddc5d
  • Duration 4:11:15
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 4b7c8491d2f6adc1bc437569fbe512ad66f42083
  • Duration 4:10:59
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: fca2015effd060842234ec0ec150d730309da426
  • Duration 0:05:40
  • Result: ❌ FAILED
  • Error: reference not found for primary source and source version fca2015effd060842234ec0ec150d730309da426
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

Because PR build does not have the new CRD yet, the taint related e2e test
will always fail.
some taint tests are senesitive to the execution time of the test case.
In PR test, test takes longer than in dev env, which causes failure in PR test but
not reproducible in dev env.
Change flapping node test as flaky test as well
@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 6cd82ded8a61bf92e67d6ffdad6f888d5a3e193e
  • Duration 2:32:32
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@xumengpanda
Copy link
Contributor Author

2023/05/05 22:03:56 reconciled name=fdb-cluster-57vzpdb4, namespace=mengxu-1drl2po1
• [6.331 seconds]
------------------------------
SSSSSSS
P [PENDING]
Operator when Changing the TLS setting should disable or enable TLS and keep the cluster available [e2e]
/root/src/fdb-kubernetes-operator/e2e/test_operator/operator_test.go:567
------------------------------
SSSSSSSSSSSSSS
P [PENDING]
Operator when setting the empty config to true should stop all running processes [e2e]
/root/src/fdb-kubernetes-operator/e2e/test_operator/operator_test.go:991
------------------------------
S
P [PENDING]
Operator when a process group is set to be blocked for removal should exclude the Pod but not remove the resources [e2e]
/root/src/fdb-kubernetes-operator/e2e/test_operator/operator_test.go:1089
------------------------------
[AfterSuite] 
/root/src/fdb-kubernetes-operator/e2e/test_operator/operator_test.go:102
2023/05/05 22:04:12 start cleaning up chaos mesh client with 1 experiment(s)
2023/05/05 22:04:12 Start deleting skifoch4j44491j15l93xatpgqev4oc9
2023/05/05 22:04:12 Chaos skifoch4j44491j15l93xatpgqev4oc9 is deleted.
2023/05/05 22:04:12 finished all tests, start deleting namespace mengxu-1drl2po1
[AfterSuite] PASSED [15.045 seconds]
------------------------------
[ReportAfterSuite] Autogenerated ReportAfterSuite for --junit-report
autogenerated by Ginkgo
[ReportAfterSuite] PASSED [0.002 seconds]
------------------------------

Ran 5 of 30 Specs in 1379.051 seconds
SUCCESS! -- 5 Passed | 0 Failed | 3 Pending | 22 Skipped
PASS | FOCUSED
FAIL	github.com/FoundationDB/fdb-kubernetes-operator/e2e/test_operator	1379.082s

Notable change
1. When both exact match key and wildcard key matches a node taint key, the wildcard key
will have no effect on the node taint key. Added test to verify that
2. Refactor replace_failed_processgroups_test.go
3. Change clusterrolebinding to rolebinding
4. Coding style and various cosmetic improvement
@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: b51b797fdddc299d5f2c1ebc92cd6ff6d633aaec
  • Duration 2:45:29
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 9f4ddc3
  • Duration 2:45:13
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 5032bc65eb7434bd2dc04de78f98da192ba579bb
  • Duration 2:58:01
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

api/v1beta2/foundationdbcluster_types.go Outdated Show resolved Hide resolved
api/v1beta2/foundationdbcluster_types.go Outdated Show resolved Hide resolved
api/v1beta2/foundationdbcluster_types.go Outdated Show resolved Hide resolved
config/deployment/rbac_role_binding.yaml Outdated Show resolved Hide resolved
config/samples/deployment.yaml Outdated Show resolved Hide resolved
controllers/update_status.go Show resolved Hide resolved
controllers/update_status.go Outdated Show resolved Hide resolved
controllers/update_status_test.go Outdated Show resolved Hide resolved
e2e/Makefile Outdated Show resolved Hide resolved
api/v1beta2/foundationdbcluster_types.go Outdated Show resolved Hide resolved
@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 0ec06c90b5b662ce5260373dceb764fb97647030
  • Duration 2:55:14
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 1cec691
  • Duration 2:25:00
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

Copy link
Member

@johscheuer johscheuer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we revert the PREVIOUS_FDB_VERSION and remove/change the fmt.Print statements in our e2e test framework I'm okay with merging those changes to make sure we can move forward. We should note in our docs for this feature, that it's currently experimental, the requirements (node access) and the use case.

e2e/Makefile Outdated Show resolved Hide resolved
e2e/fixtures/fdb_cluster.go Outdated Show resolved Hide resolved
e2e/fixtures/fdb_cluster.go Outdated Show resolved Hide resolved
Copy link
Member

@johscheuer johscheuer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes are good. We should focus on finishing the open tasks and add documentation for this feature 👍

@xumengpanda xumengpanda merged commit d38d6bc into FoundationDB:main May 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Idea: voluntary node evacuation
3 participants