Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check if the database is available before doing any exclusion checks #1758

Merged

Conversation

johscheuer
Copy link
Member

Description

We want to make sure that the database is available if we do any checks for the exclusion status based on the machine-readable status. If the database is unavailable we could hit a case were processes are reported but not the roles for them. In this case it will be better to wait until the database is available again.

Type of change

Please select one of the options below.

  • Bug fix (non-breaking change which fixes an issue)

Discussion

I decided to return all processes as "not excluded" to prevent the operator doing any additional checks with the exclude command, as the exclude command could make the unavailable database state even worse.

Testing

Unit tests.

Documentation

Follow-up

@johscheuer johscheuer added the bug Something isn't working label Jul 25, 2023
@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 10cbd4f
  • Duration 4:06:34
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 1c0de81
  • Duration 2:29:38
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: a45e1d1
  • Duration 4:06:34
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@johscheuer johscheuer force-pushed the check-if-database-is-available branch from 1c0de81 to 97b05ea Compare July 27, 2023 15:26
@johscheuer johscheuer requested a review from hfu94 July 27, 2023 16:19
@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 97b05ea
  • Duration 2:27:27
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: b9532f7
  • Duration 2:28:21
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 8a9c172
  • Duration 4:06:31
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@johscheuer johscheuer closed this Jul 28, 2023
@johscheuer johscheuer reopened this Jul 28, 2023
@johscheuer
Copy link
Member Author

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 8a9c172
  • Duration 4:06:31
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

Seems like the exclusion was stuck for a bad Pod:

{"level":"info","ts":"2023-07-28T08:43:01Z","logger":"controller","msg":"Getting remaining removals to check for exclusion","namespace":"pr-292-4j8sviah","cluster":"fdb-cluster-dgoffk4t","reconciler":"controllers.removeProcessGroups","processGroupID":"log-3","reason":"missing address"}
{"level":"info","ts":"2023-07-28T08:43:01Z","logger":"controller","msg":"Incomplete exclusion still present in removeProcessGroups step","namespace":"pr-292-4j8sviah","cluster":"fdb-cluster-dgoffk4t","reconciler":"controllers.removeProcessGroups","processGroupID":"log-3","error":"process has no addresses, cannot safely determine if process can be removed"}

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 8a9c172
  • Duration 3:29:29
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 379872c
  • Duration 4:06:22
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@johscheuer johscheuer closed this Jul 28, 2023
@johscheuer johscheuer reopened this Jul 28, 2023
@johscheuer
Copy link
Member Author

All the test failures that I saw were related to one Pod being stuck in Pending. As the Pod has no IP address, the operator doesn't know if it is safe to remove it or not. I create a radar to make the test suite handle those cases better and the operator too.

@johscheuer johscheuer merged commit d28bd2b into FoundationDB:main Jul 28, 2023
16 checks passed
@johscheuer johscheuer deleted the check-if-database-is-available branch July 28, 2023 15:40
@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 379872c
  • Duration 2:21:14
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

johscheuer added a commit to johscheuer/fdb-kubernetes-operator that referenced this pull request Aug 2, 2023
…oundationDB#1758)

* Check if the database is available before doing any exclusion checks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants