Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guide for adding windows node: RBAC config not found #261

Closed
twity1337 opened this issue Dec 18, 2022 · 20 comments
Closed

Guide for adding windows node: RBAC config not found #261

twity1337 opened this issue Dec 18, 2022 · 20 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@twity1337
Copy link

twity1337 commented Dec 18, 2022

Describe the bug
In the guide for adding windows nodes (link) in section "Getting started: Adding a Windows Node to Your Cluster" there is a dead link to the RBAC file.
https://raw.githubusercontent.com/kubernetes-sigs/sig-windows-tools/master/kubeadm/flannel/kube-flannel-rbac.yml -> Results in 404

Therefore the step 5 does not succeed and flannel (in pod "kube-flannel-ds-windows-...") is failing with the following error:

Starting flannel
I1219 00:36:55.650017    6060 alivpc_windows.go:22] AliVpc is not supported on this platform
I1219 00:36:55.651310    6060 awsvpc_windows.go:22] AWS VPC is not supported on this platform
I1219 00:36:55.651310    6060 gce_windows.go:45] GCE is not supported on this platform
I1219 00:36:55.652126    6060 ipip_windows.go:23] ipip is not supported on this platform
I1219 00:36:55.652126    6060 ipsec_windows.go:20] ipsec is not supported on this platform
I1219 00:36:55.652126    6060 tencentvpc_windows.go:22] TencentVpc is not supported on this platform
I1219 00:36:55.652126    6060 udp_windows.go:22] udp is not supported on this platform
I1219 00:36:55.653902    6060 main.go:456] Searching for interface using 172.18.80.101
I1219 00:36:55.660936    6060 main.go:533] Using interface with name Ethernet and address 172.18.80.101
I1219 00:36:55.660936    6060 main.go:550] Defaulting external address to interface address (172.18.80.101)
E1219 00:36:55.737636    6060 main.go:251] Failed to create SubnetManager: error retrieving pod spec for 'kube-system/kube-flannel-ds-windows-amd64-qbw26': pods "kube-flannel-ds-windows-amd64-qbw26" is forbidden: User "system:serviceaccount:kube-system:flannel" cannot get resource "pods" in API group "" in the namespace "kube-system"

If somebody could tell me a complete working guide for how to setup flannel on Windows, I would highly appreciate that.

To Reproduce
Steps to reproduce the behavior:

  • Follow the steps in the guide for adding windows nodes.

Expected behavior
A clear and concise description of what you expected to happen.

Kubernetes (please complete the following information):

  • Windows Server version: 2019 Version 1809
  • Kubernetes Version: 1.25.3
  • CNI: 0.2.0
@twity1337 twity1337 changed the title Guid for adding windows node: RBAC config not found Guide for adding windows node: RBAC config not found Dec 18, 2022
@Mik4sa
Copy link
Contributor

Mik4sa commented Dec 19, 2022

@twity1337
Copy link
Author

Thank you very much.
Furthermore I think the line

curl -L https://github.com/kubernetes-sigs/sig-windows-tools/releases/latest/download/kube-proxy.yml | sed 's/KUBE_PROXY_VERSION/v1.25.3/g' | kubectl apply -f -

should read (replaced KUBE_PROXY_VERSION with VERSION)

curl -L https://github.com/kubernetes-sigs/sig-windows-tools/releases/latest/download/kube-proxy.yml | sed 's/VERSION/v1.25.3/g' | kubectl apply -f -

Right?

@Mik4sa
Copy link
Contributor

Mik4sa commented Dec 19, 2022

I just used the yaml files from master and not from the latest release

@twity1337
Copy link
Author

What have you done to make it work in the end?
I followed the guide, but in the end kube-proxy is failing either with the message hcs::CreateComputeSystem kube-proxy: The directory name is invalid. (on Windows Server 2019) or error creating endpoint hcnCreateEndpoint failed in Win32: IP address is either invalid or not part of any configured subnet(s). (on Windows Server 2022).

It does not matter, what kind of CNI config for NAT I apply at C:\etc\cni\net.d\0-containerd-nat.json. (Assuming I have to create it by myself on the node, since it wasn't deployed automatically)

@Mik4sa
Copy link
Contributor

Mik4sa commented Dec 19, 2022

As far as I remember this was everything. So except of the both mentioned things above I was following the guide without extras.
Since I'm using k8s 1.26 I had to consider the PR #259 too

Also I changed no config for cni. I stayed with defaults

@twity1337
Copy link
Author

twity1337 commented Dec 19, 2022

Okay, thanks then. Actually I'm facing a lot of issues setting up a Kubernetes cluster under Windows with containerd as runtime.
Maybe the fact that I run the nodes in a Hyper-V VM might cause some troubles with the container runtime then.

@fabi200123
Copy link
Contributor

@twity1337 I have updated the actual guide on this repo and tested it as well for flannel and calico as well. It works okay for me now for v1.25.3. Let me know if you have problems anymore with this. Also the scripts are gonna be updated soon here #262

@Mik4sa
Copy link
Contributor

Mik4sa commented Dec 21, 2022

@twity1337 I noticed that I still had some problems running some pods on that node. So I created a fork on my own and tried to fix all the things. Maybe you want to have a look and maybe this help you to solve your problems (though that I never had the errors you currently seem to have).
I update the guide, check it out: https://github.com/Mik4sa/sig-windows-tools/blob/flannel-hostprocess/guides/guide-for-adding-windows-node.md

Note: I had to create my own images for flannel-hostprocess and kube-proxy which are referenced in the updated .yaml files aswell. Keep that in mind before executing them.

@twity1337
Copy link
Author

@Mik4sa Thanks for sharing that link, unfortunately I don't have access on it.

How do you both run the the Worker node? I'm running it on Hyper-V, therefore I'm wondering if there are any known issues while running the Windows worker node on Hyper-V.
So my (development) cluster used to look like this:

  • Controlplane - Debian Bullseye, Hyper-V VM
  • Worker Node - Windows Server 2019, Hyper-V VM
    both running on a Windows Server 2019 machine as the Hyper-V host, where I don't have full admin rights.

I'm currently trying to set up everything on bare metal, while facing some other issues with my local setup. However, I keep you updated.

@Mik4sa
Copy link
Contributor

Mik4sa commented Jan 5, 2023

That's because all my required changes were merged and I deleted my fork. You should now be able to follow the master branch as it is right now.

I have one control plane with Ubuntu 22.04 and a worker node with Windows Server 2022. Both are real machines, no VMs.

Note: About one year ago I tested this on Hyper-V on my Windows 10 (or 11?) machine. Back then I used Kubernetes 1.23.x and Docker (with non-process images). That worked so far

@twity1337
Copy link
Author

twity1337 commented Jan 9, 2023

Thanks for your detailed answer, @Mik4sa .
I want to avoid using docker, because it is already deprecated.

However, after setting up the stuff on my private physical machines and Evaluation release of Windows Server 2019, I was able to get it half way running.
At least, the pod has been created now. However, since the step of copying the C:\run\flannel\subnet.env file to the worker node is just written in an "Info-Note" box, I oversaw it before joining the node. After running "kubeadm reset", deleting the node object on the controlplane, deleting the created files on the C-drive of the worker node, copying the subnet file, and rejoining the node with "kubeadm join", I still get an error:

root@controlplane:/# kubectl logs -n kube-system kube-proxy-windows-c5hrw 
Write files so the kubeconfig points to correct locations


    Directory: C:\var\lib


Mode                LastWriteTime         Length Name                                                                  
----                -------------         ------ ----                                                                  
d-----       09.01.2023     03:44                kube-proxy                                                            
Finding sourcevip
Cannot index into a null array.
At C:\C\9a841a2e8684bdbdc81630803f1b4e51dc2b9bb025b039df935b228c232e5888\kube-proxy\start.ps1:19 char:9
+         $subnet = $hnsNetwork.Subnets[0].AddressPrefix
+         ~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (:) [], ParentContainsErrorRecordException
    + FullyQualifiedErrorId : NullArray

As you can see, the HNS-Network doesn't seem to have a Subnet configured. The access in the powershell script is therefore failing. What am I missing?

Let me investigate a little bit more if Hyper-V is really the cause for those error messages, that I got in the first place. After isolating the error I might open an issue - in the containerd repo (?).

@Mik4sa
Copy link
Contributor

Mik4sa commented Jan 9, 2023

What's the content of the both sourcevip files on your worker node?

Note: When I was experimenting with resetting and rejoining my worker node I had to carefully revert everything what was done by the Install and the Prepare script. Otherwise I later got errors in different situations.
Like that stuff is already existing and similar.

@twity1337
Copy link
Author

twity1337 commented Jan 9, 2023

What's the content of the both sourcevip files on your worker node?

What are files are you talking about? The directory "C:\sourcevip" on my worker node is empty.

@Mik4sa
Copy link
Contributor

Mik4sa commented Jan 9, 2023

Then this is your problem. There should be two files, sourceVip.json and sourceVipRequest.json. You might want to check why these two are missing.

Edit: Oh I'm sorry. These files get created after you resolved your current problem

@Mik4sa
Copy link
Contributor

Mik4sa commented Jan 11, 2023

Interestingly I got the same error now you described first in this issue. It started after we rebooted the control plane (Linux) and the worker node (windows). I did it simultaneously so I can't say which one, if not both, was the cause. I'm going to have a look at it tomorrow or the day after. Maybe I find something out which helps you aswell.

@twity1337
Copy link
Author

Okay, so I managed to make it work now (on bare-metal, Windows Server 2022). For some reason, the RBAC file was not applied, and therefore the HNS-Network and the pod itself was not created by the "kube-flannel-ds-windows" pod.
Btw: Manually copying the subnet.env was not required in my case, since it automatically was created. (Different, than it was proposed in the guide.)


However, pod networking still seems not fully functional:
The guide for scheduling Windows containers in Kubernetes (from the Kubernetes website) lists several verification steps after deploying a basic webserver application.

It seems, my windows pods don't have outbound connectivity and are only reachable from the controlplane node.

$ kubectl get service
NAME            TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
kubernetes      ClusterIP   10.96.0.1       <none>        443/TCP        2d20h
win-webserver   NodePort    10.96.173.206   <none>        80:30040/TCP   32m

$ kubectl get pods -o wide
NAME                             READY   STATUS    RESTARTS   AGE    IP            NODE                            NOMINATED NODE   READINESS GATES
win-webserver-585f6c9dc6-5f4xn   1/1     Running   0          40m    10.244.6.2    win-server2022           <none>           <none>
win-webserver-585f6c9dc6-bmfls   1/1     Running   0          40m    10.244.6.3    win-server2022           <none>           <none>

$ kubectl get nodes -o wide
NAME                            STATUS   ROLES           AGE     VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                    KERNEL-VERSION      CONTAINER-RUNTIME
kube-controlplane               Ready    control-plane   2d21h   v1.26.0   192.168.0.39   <none>        Ubuntu 22.04.1 LTS                          5.15.0-56-generic   containerd://1.6.14
win-server2022                  Ready    <none>          73m     v1.26.0   192.168.0.20   <none>        Windows Server 2022 Datacenter Evaluation   10.0.20348.1487     containerd://1.6.8

According to the guide:

  • Node-to-pod:
    • curl 10.244.6.2 -> fail
  • Pod-to-Pod:
    • kubectl exec pods/win-webserver-585f6c9dc6-5f4xn -- curl 10.244.6.3 -> success
  • Service-to-pod:
    • curl 10.96.173.206 -> fail (is Cluster-IP = Virtual-Service IP?)
    • kubectl exec pods/win-webserver-585f6c9dc6-5f4xn -- curl 10.96.173.206 -> fail
  • Service discovery:
    • kubectl exec pods/win-webserver-585f6c9dc6-5f4xn -- nslookup kubernetes.default -> fail
    • kubectl exec pods/win-webserver-585f6c9dc6-5f4xn -- nslookup win-webserver.default -> fail
  • Inbound connectivitiy:
    • curl 192.168.0.20:30040 -> used to work, but after rejoining the cluster it doesn't anymore
  • Outbound connectivity:
    • kubectl exec pods/win-webserver-585f6c9dc6-5f4xn -- curl 142.251.36.238 -> success
    • kubectl exec pods/win-webserver-585f6c9dc6-5f4xn -- curl www.google.com -> fail

So, I think something might be wrong with the DNS. Which is strange, because flannel should care about this.
Did you face this issue already somewhere? Also, what am I missing about the Service-To-Pod connection?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 11, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 11, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 10, 2023
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants