Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SURE-5515] Fleet is not upgrading properly on windows nodes in a cluster with windows workers #993

Closed
1 task done
slickwarren opened this issue Sep 20, 2022 · 5 comments · Fixed by #1505
Closed
1 task done

Comments

@slickwarren
Copy link

slickwarren commented Sep 20, 2022

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

After upgrading rancher with an existing cluster that is running rke2 with a windows node from 2.6.8 -> 2.6.9-rc1, (fleet v0.3.11 -> v0.4.0-rc3) , fleet fails to upgrade as it is consistently trying to run on the windows node.

Expected Behavior

Expecting fleet to work in a windows cluster after upgrade

Steps To Reproduce

  1. Deploy a windows cluster on rancher v2.6.8, k8s v1.23.10+rke2r1
  2. upgrade rancher
  3. wait for fleet in the downstream windows cluster to finish upgrading

Environment

- Architecture:
- Fleet Version: v0.3.11
- Cluster: rke2
  - Provider: vsphere
  - Options: with windows worker pool
  - Kubernetes Version: v1.23.10+rke2r1

Logs

MountVolume.SetUp failed for volume "kube-api-access-9w88q" : chown c:\var\lib\kubelet\pods\bfdcae51-9200-440e-89c9-e55b9fb69ef7\volumes\kubernetes.io~projected\kube-api-access-9w88q\..2022_09_20_16_52_32.1316767762\token: not supported by windows

Anything else?

No response

@slickwarren slickwarren added this to the v2.6.9 milestone Sep 20, 2022
@Jono-SUSE-Rancher Jono-SUSE-Rancher modified the milestones: v2.6.9, v2.7.0 Sep 20, 2022
@slickwarren slickwarren changed the title Fleet is trying to deploy on windows nodes in a cluster with windows workers Fleet is not upgrading properly on windows nodes in a cluster with windows workers Sep 20, 2022
@mattfarina mattfarina removed this from the v2.7.0 milestone Oct 5, 2022
@manno manno added this to the v2.7.1 milestone Nov 3, 2022
@kkaempf kkaempf added the JIRA Must shout label Feb 14, 2023
@kkaempf kkaempf changed the title Fleet is not upgrading properly on windows nodes in a cluster with windows workers [SURE-5515] Fleet is not upgrading properly on windows nodes in a cluster with windows workers Feb 14, 2023
@kkaempf kkaempf modified the milestones: v2.7.2, 2023-Q2-v2.7x Mar 6, 2023
@sowmyav27 sowmyav27 added the QA/S label Mar 21, 2023
@slickwarren
Copy link
Author

still reproducible on v2.7.2-rc7

@aiyengar2
Copy link
Contributor

See rancher/rancher#39372 (comment) for more details

@manno
Copy link
Member

manno commented May 3, 2023

I can confirm this is the case by simply looking at the fleet-agent Deployment on any provisioned cluster (here seen from Rancher 2.7.0), which shows that the securityContext.runAsUser is provided while nodeSelector is not set and affinity does not have anything Windows-related

So this deployment doesn't schedule on Windows:

% helm template -n cattle-fleet-system --set apiServerURL=abc fleet-agent charts/fleet-agent
      nodeSelector:
        kubernetes.io/os: linux
      tolerations:
        - key: "cattle.io/os"
          value: "linux"
          effect: "NoSchedule"
          operator: "Equal"

But https://github.com/rancher/fleet/blob/master/pkg/agent/manifest.go#L178-L209 does.
Apparently the toleration, introduced in 2db5cb44, is ineffective?

How do we fix this, do we just add the nodeSelecor or do we add to the built-in affinity rules?

     requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/os
            operator: In
            values:
            - linux 

Adding to affinity could also be done as a workaround via the cluster config.

@manno
Copy link
Member

manno commented May 4, 2023

Additionnal QA

Problem

Using fleet in an environment with mixed nodes (linux+windows), would show failures for the windows nodes. Fleet agent cannot run on windows.

Solution

Adding a nodeSelector to select only nodes which have kubernetes.io/os: linux.

Testing

Engineering Testing

Manual Testing

I don't have access to a cluster with windows nodes.

Automated Testing

QA Testing Considerations

Fleet should still start on the linux nodes.

Regressions Considerations

@sbulage
Copy link
Contributor

sbulage commented May 30, 2023

This is my observation:
I saw in the cluster of 2 Linux node and 1 windows node.
Label on Windows node: beta.kubernetes.io/: windows.
Label on Linux node: kuberenetes.io/os: Linux

I see no fleet-agent pod scheduled on Windows node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

8 participants