feat: set defaults for ignoredUnrecoverableEvents operator config #1310

mkuznyetsov · 2024-08-22T13:54:23Z

What does this PR do?

Add FailedScheduling event to the default list of ignoredUnrecoverableEvents list in operator config.

(this PR is an alternative to #1306)

the relevant docs should also be updated:
https://eclipse.dev/che/docs/stable/administration-guide/configuring-machine-autoscaling/#_when_the_autoscaler_adds_a_new_node

What issues does this PR fix or reference?

#1280

Is it tested? How?

create a workspace with exceeding resource requests/limits (modified samples/plain.yaml):

apiVersion: workspace.devfile.io/v1alpha2
metadata:
  name: plain-devworkspace
spec:
  started: true
  routingClass: 'basic'
  template:
    components:
      - name: web-terminal
        container:
          image: quay.io/wto/web-terminal-tooling:next
          memoryRequest: 1000Gi
          memoryLimit: 1000Gi
          mountSources: true
          command:
           - "tail"
           - "-f"
           - "/dev/null"

check the workspace status, which will keep trying to start workspace, until it times out in 5 minutes:

$ kdw get dw
NAME                 DEVWORKSPACE ID             PHASE    INFO
plain-devworkspace   workspace8e15dba59ab04607   Failed   DevWorkspace failed to progress past step 'Waiting for workspace deployment' for longer than timeout (5m). Ignored events: Detected unrecoverable event FailedScheduling: 0/1 nodes are available: 1 Insufficient memory. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod...

PR Checklist

E2E tests pass (when PR is ready, comment /test v8-devworkspace-operator-e2e, v8-che-happy-path to trigger)
- v8-devworkspace-operator-e2e: DevWorkspace e2e test
- v8-che-happy-path: Happy path for verification integration with Che

Signed-off-by: Mykhailo Kuznietsov <mkuznets@redhat.com>

openshift-ci · 2024-08-22T13:54:30Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mkuznyetsov
Once this PR has been reviewed and has the lgtm label, please assign dkwon17 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

AObuchow

Thanks for the PR @mkuznyetsov :)
Please run make fmt but make sure you have goimports installed as well, as the format CI check is currently failing: go install golang.org/x/tools/cmd/goimports@latest

Some thoughts:

I think there's 3 important cases to test:

Is the FailedScheduling event ignored by default? Your current test case covers this.
Can users remove the FailedScheduling event from the ignoredUnrecoverableEvents list? In my testing, this is possible by setting ignoredUnrecoverableEvents to an empty array [] -- however, just adding ignoredUnrecoverableEvents:, won't work. To test this do a kubectl edit dwoc -n $NAMESPACE:

The following works:

apiVersion: controller.devfile.io/v1alpha1
config:
  routing:
    clusterHostSuffix: 192.168.49.2.nip.io
    defaultRoutingClass: basic
  workspace:
+    ignoredUnrecoverableEvents: []
    imagePullPolicy: Always
    progressTimeout: 60s
kind: DevWorkspaceOperatorConfig

The following will not work:

apiVersion: controller.devfile.io/v1alpha1
config:
  routing:
    clusterHostSuffix: 192.168.49.2.nip.io
    defaultRoutingClass: basic
  workspace:
+    ignoredUnrecoverableEvents:
    imagePullPolicy: Always
    progressTimeout: 60s
kind: DevWorkspaceOperatorConfig

IMO, this behaviour is acceptable.

What happens when we add an extra ignoredUnrecoverableEvent? Does it merge the user-provided event(s) with the default event list (that contains FailedScheduling)? Or does it overwrite the default list with the user-provided event(s) list.

Since the DWOC CR doesn't currently show that the FailedScheduling event is being ignored, I would expect it to overwrite the default list with the user-provided list.

However, merging the default event list with the user-provided list might make sense if we use Kubebuilder annotations to set the default value in the CR level as well.

AObuchow · 2024-08-22T16:26:35Z

apis/controller/v1alpha1/devworkspaceoperatorconfig_types.go

-	// if a transient cluster issue is triggering false-positives (for example, if
-	// the cluster occasionally encounters FailedScheduling events). Events listed
-	// here will not trigger DevWorkspace failures.
+	// be ignored when deciding to fail a DevWorkspace startup.


I'm not entirely sure we need to mention the cluster auto-scaler in DWO (or rewrite the docs here). It might be better to mention this in the Che Cluster CRD documentation, since the ignoredUnrecoverableEvents can be configured from the Che Cluster CRD.

Instead, I would suggest:

Mentioning "By default, the FailedScheduling is ignored"

Removing the "(for example, if the cluster occasionally encounters FailedScheduling events)" since this example is no longer valid now that the FailedScheduling event is ignored by default

AObuchow · 2024-08-22T16:40:02Z

apis/controller/v1alpha1/devworkspaceoperatorconfig_types.go

+	// For example, a FailedScheduling event, that occurs when workspace cannot start
+	// due to exceeding available resources, should not fail the workspace startup, if there is
+	// an autoscaler configured on the cluster, and we want to wait until it provisions additional resources.
+	// FailedScheduling event can also occur as a false-positive, as a result of a transient cluster issue.


I suggest experimenting with kubebuilder annotations for the IgnoredUnrecoverableEvents field.

We should try setting the default array value. I think this would be done with +kubebuilder:default:={"FailedScheduling"}

I believe that should be enough to populate the IgnoredUnrecoverableEvents list in the DWOC. Make sure you re-generate the CRD's in a seperate commit by running: make update_devworkspace_api update_devworkspace_crds generate_all

Something to note: This entire PR might be dropped and re-implemented in Che-Operator if we can get the kubebuilder approach working. We'd want Che admins to see that the FailedSchedling event is ignored by default & there would be no advantages to duplicating this code change in both DWO & Che-Operator (unless users who use DWO in isolation want this feature, however, this is not the current reason why we're resolving #1280).

feat: set defaults for ignoredUnrecoverableEvents operator config

4805a16

Signed-off-by: Mykhailo Kuznietsov <mkuznets@redhat.com>

mkuznyetsov requested review from AObuchow, dkwon17 and ibuziuk as code owners August 22, 2024 13:54

mkuznyetsov mentioned this pull request Aug 22, 2024

feat: remove FailedScheduling event from list of unrecoverable worksp… #1306

Closed

3 tasks

AObuchow reviewed Aug 22, 2024

View reviewed changes

AObuchow mentioned this pull request Sep 6, 2024

Improve documentation for Webhook deployment configuration #1312

Merged

3 tasks

mkuznyetsov mentioned this pull request Sep 9, 2024

feat: set default ignoredUnrecoverableEvents eclipse-che/che-operator#1897

Merged

10 tasks

mkuznyetsov closed this Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: set defaults for ignoredUnrecoverableEvents operator config #1310

feat: set defaults for ignoredUnrecoverableEvents operator config #1310

mkuznyetsov commented Aug 22, 2024 •

edited

Loading

openshift-ci bot commented Aug 22, 2024

AObuchow left a comment

AObuchow Aug 22, 2024

AObuchow Aug 22, 2024

feat: set defaults for ignoredUnrecoverableEvents operator config #1310

feat: set defaults for ignoredUnrecoverableEvents operator config #1310

Conversation

mkuznyetsov commented Aug 22, 2024 • edited Loading

What does this PR do?

What issues does this PR fix or reference?

Is it tested? How?

PR Checklist

openshift-ci bot commented Aug 22, 2024

AObuchow left a comment

Choose a reason for hiding this comment

AObuchow Aug 22, 2024

Choose a reason for hiding this comment

AObuchow Aug 22, 2024

Choose a reason for hiding this comment

mkuznyetsov commented Aug 22, 2024 •

edited

Loading