-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-15200: Filter out shallowly UpdateEffectNone
errors from a MultipleErrors
message in the Failing condition
#1050
base: master
Are you sure you want to change the base?
Conversation
@Davoska: This pull request references Jira Issue OCPBUGS-15200, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/jira refresh |
@Davoska: This pull request references Jira Issue OCPBUGS-15200, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
I would like to test this on a live cluster (edit: and fix the failing CI). Thus, I am putting this PR on hold for the time being. /hold |
/uncc LalatenduMohanty |
Approach SGTM 👍 |
I have not looked at the code closely yet but one piece to check for possible interaction is #1041 which renders all reconciliation problems (including the If possible we'd like to keep |
/hold I am re-working the PR. |
e11d635
to
8b4d632
Compare
@Davoska: This pull request references Jira Issue OCPBUGS-15200, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
UpdateEffectNone
errors from the summarized task graph errorUpdateEffectNone
errors from the Failing condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approach lgtm, some code readability nits + what Trevor says ;)
4528f98
to
411f084
Compare
UpdateEffectNone
errors from a MultipleErrors
message in the Failing conditionUpdateEffectNone
errors from a MultipleErrors
message in the Failing condition
All jobs have
/test all |
dfc9b34
to
775a7d8
Compare
/unhold However, I would initiate the QE pre-merge testing after the PR is approved and lgtm-ed. The code touches a few sensitive places, and there is a possibility for more new changes after code reviews. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ONLY a couple of NITs.
…erVersionStatus This commit will add additional testing regarding setting the Failing condition using the `updateClusterVersionStatus` function. This is to ensure no functionality is lost upon new changes.
…n MultipleErrors in Failing condition Various errors get propagated to users, such as the summarized task graph error. For example, in the form of the message in the Failing condition. However, update errors set with the update effect of UpdateEffectNone can confuse users, as these primarily informing messages get displayed together with valid update errors that heavily impact the update. This can result in a message such as: { "lastTransitionTime": "2023-06-20T13:40:12Z", "message": "Multiple errors are preventing progress:\n* Cluster operator authentication is updating versions\n* Could not update customresourcedefinition \"alertingrules.monitoring.openshift.io\" (512 of 993): the object is invalid, possibly due to local cluster configuration", "reason": "MultipleErrors", "status": "True", "type": "Failing" } The Failing condition is not true because of the UpdateEffectNone error ("Cluster operator authentication is updating versions"), but its message still gets displayed. This commit makes sure that update errors that do not heavily affect the update will be removed from the MultipleErrors error in the Failing condition message to an extent. The filtered out errors from the message will still be displayed in the logs and in other places, such as the ReconciliationIssues condition. The original code handles correctly situations where the status failure is an UpdateEffectNone error. The new changes leave such errors be. In case the MultipleErrors error contains only UpdateEffectNone errors, the error is unchanged to keep the original logic unchanged and keep the commit simple. The goal of this commit is to remove unimportant messages from MultipleErrors errors that contain valid messages in the Failing condition. The current code contains an override to set the Failing condition when history is empty or the CVO is reconciling. This commit will keep this logic functional. This means the filtering is only applied when history is not empty and the CVO is not reconciling the payload.
Due to the introduced filtering of UpdateError errors before setting the Failing condition, it is needed to update the TestCVO_ParallelError test, as its errors are getting rightfully filtered due to their UpdateEffect being None. This commit is utilizing this chance to update the UpdateEffect of one of the errors to test the filtering here as well.
90cd133
to
7676d46
Compare
/retest |
2 similar comments
/retest |
/retest |
/lgtm /hold |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Davoska, hongkailiu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/override ci/prow/e2e-hypershift The PR doesn't affect functionally the CVO. The PR only modifies the /override ci/prow/e2e-agnostic-ovn-upgrade-out-of-change |
@Davoska: Overrode contexts on behalf of Davoska: ci/prow/e2e-agnostic-ovn-upgrade-out-of-change, ci/prow/e2e-hypershift In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@Davoska: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/unhold Testing the PR might be a little bit interesting/tricky. The PR's goal is to remove unimportant messages from the More precisely, the PR should remove These messages show up in the Failing condition only when another more serious issue is present as well. This means an update must be in progress, the CVO must have issues progressing with an update, and some cluster operators are also updating at the moment. Launching an update may be not enough to catch the new changes, as the update might go without problems. In my previous testing, I have permanently set a cluster operator to be always degraded. The CVO eventually got to the run-level of the said operator and declared an error |
Test Scenario: Make a CO(authentication) degraded. Original Failure: Install a 4.17 cluster and degrade the Cluster operator authentication.
Trigger Upgrade to version which doesn't contain the PR Changes
with error upgrade is proceeded and CVO is throwing the error
Expected/New Failure:
Trigger an upgrade to version which contains the PR changes
Upgrade is not triggered and CVO is throwing an error
|
Various errors get propagated to users, such as the summarized task
graph error. For example, in the form of the message in the Failing
condition. However, update errors set with the update effect of
UpdateEffectNone
can confuse users, as these primarily informingmessages get displayed together with valid update errors that heavily
impact the update. This can result in a message such as:
The Failing condition is not true because of the
UpdateEffectNone
error (
"Cluster operator authentication is updating versions"
), butits message still gets displayed.
This PR makes sure that update errors that do not heavily affect
the update will be removed from the Failing condition message to an
extent.
This pull request references https://issues.redhat.com/browse/OCPBUGS-15200