Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Print warning instead of error in case of unstable cluster #4315

Merged
merged 1 commit into from
Oct 8, 2024

Conversation

vyasgun
Copy link
Contributor

@vyasgun vyasgun commented Aug 12, 2024

Fixes: Issue #4284

Solution/Idea

Since the code doesn't exit, the error messaging might be confusing to users. Changed it to Warn

INFO Operator network is progressing
INFO Operator network is progressing
INFO Operator network is progressing
INFO Operator network is progressing
INFO Operator network is progressing
INFO Operator network is progressing
INFO Operator network is progressing
WARN Cluster is not ready: cluster operators are still not stable after 10m0.695631268s
INFO Adding crc-admin and crc-developer contexts to kubeconfig...
ERRO Cannot update kubeconfig: Head "https://oauth-openshift.apps-crc.testing:443": read tcp 127.0.0.1:60782->127.0.0.1:443: read: connection reset by peer
Started the OpenShift cluster.

The server is accessible via web console at:
  https://console-openshift-console.apps-crc.testing

Log in as administrator:
  Username: kubeadmin
  Password: 3NM8K-C5kvg-YTRW4-FhiUM

Log in as user:
  Username: developer
  Password: developer

Use the 'oc' command line interface:
  $ eval $(crc oc-env)
  $ oc login -u developer https://api.crc.testing:6443

Testing

crc start and cluster operators should not get ready within the timeout. I did it by cordoning the cluster node and running start again.

- Since the code doesn't exit, the error messaging might be confusing to users
Copy link

openshift-ci bot commented Aug 12, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign cfergeau for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

openshift-ci bot commented Aug 12, 2024

@vyasgun: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/security 65d4041 link false /test security

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@praveenkumar
Copy link
Member

@vyasgun can you check in case of ERRO Cannot update kubeconfig: Head "https://oauth-openshift.apps-crc.testing:443": read tcp 127.0.0.1:60782->127.0.0.1:443: read: connection reset by peer Started the OpenShift cluster. the error code is non-zero ?

@vyasgun
Copy link
Contributor Author

vyasgun commented Aug 13, 2024

@praveenkumar No. Do we want it to return a non-zero exit code?

@praveenkumar
Copy link
Member

@praveenkumar No. Do we want it to return a non-zero exit code?

For any error, yes we should return non-zero otherwise we should change it to warn but I think if we are not able to update the kubeconfig file then tell user how they can still access it.

@vyasgun
Copy link
Contributor Author

vyasgun commented Aug 13, 2024

@praveenkumar In theStart function, only 2 logging.Errorf() statements have been used and neither of them is followed by a non-zero return.
Also, kubeconfig is being updated in other places inside the function and all of them except the last one are returning an error. For example:
https://github.com/crc-org/crc/blob/main/pkg/crc/machine/start.go#L603
https://github.com/crc-org/crc/blob/main/pkg/crc/machine/start.go#L528

I'm not sure if there is a particular reason for these statements and the differences (or if it's just an oversight). Additionally, I think if updating kubeconfig is grounds for a non-zero return, so is an unstable cluster (to indicate a failure in Start). We should return these errors in the end and put any extra processing that might be still required in defer so it's always executed.

@cfergeau
Copy link
Contributor

cfergeau commented Sep 5, 2024

Additionally, I think if updating kubeconfig is grounds for a non-zero return, so is an unstable cluster (to indicate a failure in Start).

We could return a different error code in both cases when crc completes. Different error codes we might want to ignore 'cluster unstable'

The 'cannot update kubeconfig' message deserves to be made a lot more userfriendly :) Explain what won't work when this fails (I think this only means kube contexts can't be used, and that an explicit login to the cluster will be needed).

@praveenkumar
Copy link
Member

Additionally, I think if updating kubeconfig is grounds for a non-zero return, so is an unstable cluster (to indicate a failure in Start).

We could return a different error code in both cases when crc completes. Different error codes we might want to ignore 'cluster unstable'

The 'cannot update kubeconfig' message deserves to be made a lot more userfriendly :) Explain what won't work when this fails (I think this only means kube contexts can't be used, and that an explicit login to the cluster will be needed).

Yes, it should tell user how to access the cluster (like export KUBECONFIG=$HOME/.crc/machine/crc/kubeconfig or use oc --kubeconfig=$HOME/.crc/machine/crc/kubeconfig . They should able to debug or check which cluster operator is not in available state.

@cfergeau
Copy link
Contributor

cfergeau commented Oct 1, 2024

We probably can get this in, and create follow-up issues for the various improvements that have been discussed during the review?

@praveenkumar
Copy link
Member

@vyasgun Can you create the follow up issue which is discussed here? Once follow up issue is created we can merge this one.

@vyasgun
Copy link
Contributor Author

vyasgun commented Oct 8, 2024

Created a follow up issue: #4395

@praveenkumar praveenkumar merged commit ab7c92f into crc-org:main Oct 8, 2024
20 of 29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants