Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to get the running version based on the reachable coordinators during an upgrade #2098

Merged

Conversation

johscheuer
Copy link
Member

Description

We sometimes see that the test case Operator HA Upgrades when no remote storage processes are restarted fails where the coordinators are reporting in the new upgraded version but the cluster is stuck with the following message:

 "messages": [
      {
        "name": "status_incomplete_timeout",
        "description": "Timed out fetching cluster status."
      }
    ]

The idea for this change is to get the running version based on the reachable coordinators. If a quorum of coordinators can be reached with a specific fdbcli version the operator will assume that the cluster is running in that version. Doing that will unblock the operator and will allow the operator to move forward. If the running version is not updated to the new desired version the reconciler checkClientCompatibility will block (as the operator is not able to check for incompatible clients).

Type of change

  • Bug fix (non-breaking change which fixes an issue)n

Discussion

Testing

Will be running multiple e2e tests for this change.

Documentation

Must be updated.

Follow-up

@johscheuer johscheuer added the enhancement New feature or request label Jul 5, 2024
@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: cab2fe7
  • Duration 2:47:58
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@johscheuer johscheuer closed this Jul 5, 2024
@johscheuer johscheuer reopened this Jul 5, 2024
@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: cab2fe7
  • Duration 3:08:20
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 5513064
  • Duration 2:46:33
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@johscheuer johscheuer closed this Jul 8, 2024
@johscheuer johscheuer reopened this Jul 8, 2024
@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 5513064
  • Duration 2:48:21
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@johscheuer johscheuer marked this pull request as ready for review July 8, 2024 12:50
@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 3e82d77
  • Duration 2:48:46
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 49abd36
  • Duration 2:54:58
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@johscheuer johscheuer closed this Jul 8, 2024
@johscheuer johscheuer reopened this Jul 8, 2024
@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: b5c3674
  • Duration 3:00:37
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 49abd36
  • Duration 4:07:31
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@johscheuer johscheuer closed this Jul 9, 2024
@johscheuer johscheuer reopened this Jul 9, 2024
@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: b5c3674
  • Duration 2:44:34
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@johscheuer johscheuer closed this Jul 9, 2024
@johscheuer johscheuer reopened this Jul 9, 2024
Copy link
Contributor

@nicmorales9 nicmorales9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems legit but I want to make sure I understand things first

controllers/cluster_controller.go Show resolved Hide resolved
controllers/cluster_controller.go Show resolved Hide resolved
@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: b5c3674
  • Duration 2:58:25
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@johscheuer
Copy link
Member Author

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: b5c3674
  • Duration 2:58:25
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
[Container] 2024/07/09 14:55:01.358154 Running command if $(grep -q -- "--- FAIL:" ${CODEBUILD_SRC_DIR}/logs/*.log); then echo "TESTS FAILED SEE THESE LOGS:"; echo ; grep -l -- "--- FAIL:" ${CODEBUILD_SRC_DIR}/logs/*.log; export fail_test=true; fi
TESTS FAILED SEE THESE LOGS:

/codebuild/output/src1406809119/src/github.com/FoundationDB/fdb-kubernetes-operator/logs/test_operator_upgrades.log
• [FAILED] [448.057 seconds]
Operator Upgrades one process is marked for removal and is stuck in terminating state [It] Upgrade from 7.1.57 to 7.3.33 [e2e, pr]
/codebuild/output/src1406809119/src/github.com/FoundationDB/fdb-kubernetes-operator/e2e/fixtures/upgrade_test_configuration.go:115

  [FAILED] Timed out after 300.001s.
  Expected
      <*int64 | 0x0>: nil
  not to be nil
  In [It] at: /codebuild/output/src1406809119/src/github.com/FoundationDB/fdb-kubernetes-operator/e2e/test_operator_upgrades/operator_upgrades_test.go:569 @ 07/09/24 13:15:55.085
------------------------------

https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/e2e/test_operator_upgrades/operator_upgrades_test.go#L557-L569 is where the failure happened, unrelated to the PR changes.

@johscheuer johscheuer closed this Jul 9, 2024
@johscheuer johscheuer reopened this Jul 9, 2024
Copy link
Contributor

@nicmorales9 nicmorales9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything looks right in theory but I still am confused 😆

fdbclient/admin_client.go Show resolved Hide resolved
@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: b5c3674
  • Duration 2:34:00
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@johscheuer johscheuer closed this Jul 10, 2024
@johscheuer johscheuer reopened this Jul 10, 2024
Copy link
Contributor

@nicmorales9 nicmorales9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense now, thanks for sticking with me! I think it would be good to improve the comments as per the one remaining comment but otherwise LGTM

fdbclient/admin_client.go Show resolved Hide resolved
@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: b5c3674
  • Duration 2:38:04
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@johscheuer johscheuer merged commit 1291561 into FoundationDB:main Jul 10, 2024
36 checks passed
@johscheuer johscheuer deleted the get-running-version-from-coordinators branch July 10, 2024 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants