Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes 1.12 and flannel does not work out of the box #1044

Closed
outcoldman opened this issue Sep 28, 2018 · 32 comments
Closed

Kubernetes 1.12 and flannel does not work out of the box #1044

outcoldman opened this issue Sep 28, 2018 · 32 comments
Labels

Comments

@outcoldman
Copy link
Contributor

Seems like a new behavior with kubeadm, after I created a master, I see two taints on the master node:

Taints:             node-role.kubernetes.io/master:NoSchedule
                    node.kubernetes.io/not-ready:NoSchedule

But https://raw.githubusercontent.com/coreos/flannel/v0.10.0/Documentation/kube-flannel.yml has toleration only to

- key: node-role.kubernetes.io/master
        operator: Exists
        effect: NoSchedule

I added a toleration to kube-flannel.yml to solve the issue:

      - key: node-role.kubernetes.io/master
        operator: Exists
        effect: NoSchedule
      - key: node.kubernetes.io/not-ready
        operator: Exists
        effect: NoSchedule

Expected Behavior

The docs should work with flannel out of the box
https://kubernetes.io/docs/setup/independent/create-cluster-kubeadm/

Current Behavior

Possible Solution

Maybe instead it should use a toleration without a key?

tolerations:
        - effect: NoSchedule
          operator: Exists

Steps to Reproduce (for bugs)

  1. Bootstrap master node with kubeadm
  2. Apply as suggested https://raw.githubusercontent.com/coreos/flannel/v0.10.0/Documentation/kube-flannel.yml from the docs.

Context

Your Environment

  • Flannel version: v0.10.0
  • Backend used (e.g. vxlan or udp):
  • Etcd version:
  • Kubernetes version (if used): 1.12
  • Operating System and version: Linux master1 4.4.0-134-generic #160-Ubuntu SMP Wed Aug 15 14:58:00 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux, "Ubuntu 16.04.5 LTS"
  • Link to your project (optional):
@geerlingguy
Copy link

I can confirm as well—on 1.11.3 the configuration applies correctly. On 1.12.0 it does not.

@OnnoSAP
Copy link

OnnoSAP commented Sep 28, 2018

usign the toleration without a key worked for me. Would this be the solution?

@caseydavenport
Copy link

usign the toleration without a key worked for me. Would this be the solution?

That sounds fine to me - flannel should probably tolerate all NoSchedule taints, since it's a critical piece of infrastructure.

Anyone want to submit a PR?

@outcoldman
Copy link
Contributor Author

@caseydavenport I have submitted PR against master https://github.com/coreos/flannel/pull/1045/files

But it will be good to have the same fix for the tag v0.10.0, considering that in a lot of places there is a reference to this path https://raw.githubusercontent.com/coreos/flannel/v0.10.0/Documentation/kube-flannel.yml

Considering that this is just a configuration change, maybe make a release v0.10.1 and update the Kubernetes documentation?

@jmyung
Copy link

jmyung commented Sep 29, 2018

thanks @outcoldman. it helps :)

alanpeng added a commit to wise2c-devops/breeze that referenced this issue Sep 30, 2018
alanpeng added a commit to wise2c-devops/breeze that referenced this issue Sep 30, 2018
schu added a commit to schu/kubedee that referenced this issue Sep 30, 2018
There seems to be an issue and deadlock with Flannel on v1.12 clusters
where Flannel pods don't start on unready nodes and nodes don't become
ready w/o Flannel / container networking.

Issue upstream, albeit with kubeadm:

flannel-io/flannel#1044

Follow up on commit or revert.
@adhipati-blambangan
Copy link

thanks @outcoldman ! it works like a charm. ;)

@mauilion
Copy link

mauilion commented Oct 1, 2018

Flannel should probably set

tolerations:
        - operator: Exists

as the default tolerations set. This will ensure that the flannel ds tolerates all taints.

@ReSearchITEng
Copy link

ReSearchITEng commented Oct 5, 2018

For anyone willing to test the flannel fix for 1.12 ,
kubectl -n kube-system apply -f https://raw.githubusercontent.com/coreos/flannel/bc79dd1505b0c8681ece4de4c0d86c5cd2643275/Documentation/kube-flannel.yml

@NerdyShawn
Copy link

For anyone willing to test the flannel fix for 1.12 ,
kubeadm -n kube-system apply -f https://raw.githubusercontent.com/coreos/flannel/bc79dd1505b0c8681ece4de4c0d86c5cd2643275/Documentation/kube-flannel.yml

#trying on a pi2 b+ master
`HypriotOS/armv7: root@piNode01 in ~

kubeadm -n kube-system apply -f https://raw.githubusercontent.com/coreos/flannel/bc79dd1505b0c8681ece4de4c0d86c5cd2643275/Documentation/kube-flannel.yml

Error: unknown command "apply" for "kubeadm"
Run 'kubeadm --help' for usage.
error: unknown command "apply" for "kubeadm"
`

@cablespaghetti
Copy link

cablespaghetti commented Oct 5, 2018 via email

@rberg2
Copy link

rberg2 commented Oct 5, 2018

Hello,
I can confirm this fixes flannel on my 1.12 test cluster. https://raw.githubusercontent.com/coreos/flannel/bc79dd1505b0c8681ece4de4c0d86c5cd2643275/Documentation/kube-flannel.yml

@NerdyShawn
Copy link

So that got me closer, but still no dice, here is the docker output the apiserver container seems no bueno. Sorry I'm struggling with text formatting so here is the screenshot.
kubectl_apply_1 12

@cablespaghetti
Copy link

Hi @NerdyShawn,

I don't think you've got your kubectl configured correctly to connect to your cluster. As it seems like @rberg2 has managed to get this working, maybe it would be good to continue this on one of the support channels like slack rather than this issue.

@ReSearchITEng
Copy link

ReSearchITEng commented Oct 5, 2018

Sorry, it was a typo, it's kubectl.

For those interested, k8s 1.12 deployment with all the goodies (ingress, dashboard, optional vsphere*, etc) automated with ansible and maintained here: github.com/ReSearchITEng/kubeadm-playbook/
The above has been scripted there as well.

@tallaxes
Copy link

tallaxes commented Oct 6, 2018

@ReSearchITEng, confirm works (1.12.1).
The link to ansible playbook is broken.

@hegdedarsh
Copy link

Hello,
Even with the tolerations, it still says , i used the below link to run the flannel

https://raw.githubusercontent.com/coreos/flannel/bc79dd1505b0c8681ece4de4c0d86c5cd2643275/Documentation/kube-flannel.yml

Please find the output of the pods:-

[user@darshan-p-hegde-89ca8c531 ~]$ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-576cbf47c7-9r27x 0/1 ContainerCreating 0 6m
coredns-576cbf47c7-qc4tm 0/1 ContainerCreating 0 6m
etcd-darshan-p-hegde-89ca8c531.mylabserver.com 1/1 Running 0 4m54s
kube-apiserver-darshan-p-hegde-89ca8c531.mylabserver.com 1/1 Running 0 5m2s
kube-controller-manager-darshan-p-hegde-89ca8c531.mylabserver.com 1/1 Running 0 5m2s
kube-flannel-ds-amd64-gm5z7 0/1 CrashLoopBackOff 5 4m56s
kube-proxy-mbtcj 1/1 Running 0 6m
kube-scheduler-darshan-p-hegde-89ca8c531.mylabserver.com 1/1 Running 0 5m13s

I have described the flannel pod and and the output is below:-

Name: kube-flannel-ds-amd64-gm5z7
Namespace: kube-system
Priority: 0
PriorityClassName:
Node: darshan-p-hegde-89ca8c531.mylabserver.com/172.31.42.12
Start Time: Sun, 07 Oct 2018 06:37:31 +0000
Labels: app=flannel
controller-revision-hash=6697bf5fc6
pod-template-generation=1
tier=node
Annotations:
Status: Running
IP: 172.31.42.12
Controlled By: DaemonSet/kube-flannel-ds-amd64
Init Containers:
install-cni:
Container ID: docker://b085e4a7d80b26730dc795d4a72b8a278ddc4ba71e5c463bfcd0172b793de349
Image: quay.io/coreos/flannel:v0.10.0-amd64
Image ID: docker-pullable://quay.io/coreos/flannel@sha256:88f2b4d96fae34bfff3d46293f7f18d1f9f3ca026b4a4d288f28347fcb6580ac
Port:
Host Port:
Command:
cp
Args:
-f
/etc/kube-flannel/cni-conf.json
/etc/cni/net.d/10-flannel.conflist
State: Terminated
Reason: Completed
Exit Code: 0
Started: Sun, 07 Oct 2018 06:37:33 +0000
Finished: Sun, 07 Oct 2018 06:37:33 +0000
Ready: True
Restart Count: 0
Environment:
Mounts:
/etc/cni/net.d from cni (rw)
/etc/kube-flannel/ from flannel-cfg (rw)
/var/run/secrets/kubernetes.io/serviceaccount from flannel-token-llwn4 (ro)
Containers:
kube-flannel:
Container ID: docker://a8096a56009a0566b53e4b0aac09430b75120979e63dbe32eb8ed91053666a77
Image: quay.io/coreos/flannel:v0.10.0-amd64
Image ID: docker-pullable://quay.io/coreos/flannel@sha256:88f2b4d96fae34bfff3d46293f7f18d1f9f3ca026b4a4d288f28347fcb6580ac
Port:
Host Port:
Command:
/opt/bin/flanneld
Args:
--ip-masq
--kube-subnet-mgr
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Sun, 07 Oct 2018 06:43:46 +0000
Finished: Sun, 07 Oct 2018 06:43:48 +0000
Ready: False
Restart Count: 6
Limits:
cpu: 100m
memory: 50Mi
Requests:
cpu: 100m
memory: 50Mi
Environment:
POD_NAME: kube-flannel-ds-amd64-gm5z7 (v1:metadata.name)
POD_NAMESPACE: kube-system (v1:metadata.namespace)
Mounts:
/etc/kube-flannel/ from flannel-cfg (rw)
/run from run (rw)
/var/run/secrets/kubernetes.io/serviceaccount from flannel-token-llwn4 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
run:
Type: HostPath (bare host directory volume)
Path: /run
HostPathType:
cni:
Type: HostPath (bare host directory volume)
Path: /etc/cni/net.d
HostPathType:
flannel-cfg:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: kube-flannel-cfg
Optional: false
flannel-token-llwn4:
Type: Secret (a volume populated by a Secret)
SecretName: flannel-token-llwn4
Optional: false
QoS Class: Guaranteed
Node-Selectors: beta.kubernetes.io/arch=amd64
Tolerations: :NoSchedule
node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/network-unavailable:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unschedulable:NoSchedule
Events:
Type Reason Age From Message


Normal Scheduled 6m57s default-scheduler Successfully assigned kube-system/kube-flannel-ds-amd64-gm5z7 to darshan-p-hegde-89ca8c531.mylabserver.com
Normal Pulling 6m57s kubelet, darshan-p-hegde-89ca8c531.mylabserver.com pulling image "quay.io/coreos/flannel:v0.10.0-amd64"
Normal Pulled 6m55s kubelet, darshan-p-hegde-89ca8c531.mylabserver.com Successfully pulled image "quay.io/coreos/flannel:v0.10.0-amd64"
Normal Created 6m55s kubelet, darshan-p-hegde-89ca8c531.mylabserver.com Created container
Normal Started 6m55s kubelet, darshan-p-hegde-89ca8c531.mylabserver.com Started container
Normal Started 6m5s (x4 over 6m53s) kubelet, darshan-p-hegde-89ca8c531.mylabserver.com Started container
Normal Pulled 5m11s (x5 over 6m54s) kubelet, darshan-p-hegde-89ca8c531.mylabserver.com Container image "quay.io/coreos/flannel:v0.10.0-amd64" already present on machine
Normal Created 5m11s (x5 over 6m53s) kubelet, darshan-p-hegde-89ca8c531.mylabserver.com Created container
Warning BackOff 105s (x23 over 6m48s) kubelet, darshan-p-hegde-89ca8c531.mylabserver.com Back-off restarting failed container

Please find the output of the coreos pods:-

Warning FailedCreatePodSandBox 7m50s kubelet, darshan-p-hegde-89ca8c531.mylabserver.com Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "5f6770d9dfcb53738a0dd428b86e815d4d85e9b71a76d17b10b1f764f102fb61" network for pod "coredns-576cbf47c7-9r27x": NetworkPlugin cni failed to set up pod "coredns-576cbf47c7-9r27x_kube-system" network: open /run/flannel/subnet.env: no such file or directory
Warning FailedCreatePodSandBox 7m49s kubelet, darshan-p-hegde-89ca8c531.mylabserver.com Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "009e9e0099f993086300649a89995a28a0fdf1a128863f7a71e3ff1973788c26" network for pod "coredns-576cbf47c7-9r27x": NetworkPlugin cni failed to set up pod "coredns-576cbf47c7-9r27x_kube-system" network: open /run/flannel/subnet.env: no such file or directory
Warning FailedCreatePodSandBox 7m48s kubelet, darshan-p-hegde-89ca8c531.mylabserver.com Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "ea0ddaf5c411dd026cfd23366e49424526b7cc547652ca262a346f4c800f0c04" network for pod "coredns-576cbf47c7-9r27x": NetworkPlugin cni failed to set up pod "coredns-576cbf47c7-9r27x_kube-system" network: open /run/flannel/subnet.env: no such file or directory

@outcoldman
Copy link
Contributor Author

@hegdedarsh possible that it is a different problem, but I would suggest using a released version https://raw.githubusercontent.com/coreos/flannel/v0.10.0/Documentation/kube-flannel.yml, modify the tolerations and give it a try.

@pkeuter
Copy link

pkeuter commented Oct 11, 2018

This fixes the issue for me. Thanks for the PR!

@telecodani
Copy link

Adding the toleration in the Flannel yaml works for me also. Tested on v1.12.1 Kubernetes. Thanks.

@benn0r
Copy link

benn0r commented Oct 11, 2018

I am using the yaml file recommended in this issue. But for me nodePort and "externalIps" doesn't work anymore unless its from the same node that the pods are located on. If i try to telnet via the master ip i get a timeout.
this is since the upgrade to kubernetes 1.12.

Is this a problem with flannel?

@sarlacpit
Copy link

sarlacpit commented Oct 11, 2018

I am on a fresh install of k8s 1.12 and have just tried downloading v0.10 and the tolerations seem to exist already. So I applied the yml

      - key: node-role.kubernetes.io/master
        operator: Exists
        effect: NoSchedule

It tried creating the flannel pod but came up with 'Error' and eventually "CrashLoopBackOff".
Still very new to k8s. Any debug I can provide let me know.

@bitva77
Copy link

bitva77 commented Nov 29, 2018

just here to say that using https://raw.githubusercontent.com/coreos/flannel/v0.10.0/Documentation/kube-flannel.yml with the toleration's set as the below works on Kubernetes 1.12.3 with kubeadm install:

      tolerations:
      - key: node-role.kubernetes.io/master
        operator: Exists
        effect: NoSchedule
      - key: node.kubernetes.io/not-ready
        operator: Exists
        effect: NoSchedule

@sa9226
Copy link

sa9226 commented Dec 11, 2018

Thanks . it worked for me after applying above changes to flannel config to v1.12.3.

@vmendi
Copy link

vmendi commented Jan 26, 2019

There hasn't been a release of flannel for a year and we need to upgrade to Kubernetes 1.12.

Are there plans to have a new release anytime soon? If not, it's not a problem, we can always branch and fix it ourselves.

Thanks

@rajatchopra
Copy link
Contributor

There hasn't been a release of flannel for a year and we need to upgrade to Kubernetes 1.12.

Are there plans to have a new release anytime soon? If not, it's not a problem, we can always branch and fix it ourselves.

Thanks

There is a release planned soon. Can we have a PR that updates kube-flannel.yml with the correct tolerations?

@vmendi
Copy link

vmendi commented Jan 26, 2019

Thanks!

Wasn't it fixed here? 13a990b

@cablespaghetti
Copy link

cablespaghetti commented Jan 26, 2019 via email

@vmendi
Copy link

vmendi commented Jan 30, 2019

I can certify that with the latest release v0.11.0, flannel works with kubernetes 1.12.5 out of the box :)

@dlipovetsky
Copy link

Thanks!

Wasn't it fixed here? 13a990b

Yes, although you must know the commit to fetch the fixed manifest. Typically, I obtain the manifest by using the tag, e.g. for v0.10.0, I use

https://raw.githubusercontent.com/coreos/flannel/v0.10.0/Documentation/kube-flannel.yml

Of course, the manifest does not include the fix, since it is the manifest that existed when v0.10.0 was released.

I humbly ask the maintainers to consider making fixes like this easier to find. 🙂

(In my experience, a common way to make such fixes easy to find is to cherry-pick them to a release branch. I realize the flannel repo does not use release branches. I don't have insight into why that's the case.)

@dlipovetsky
Copy link

For anyone who wants to patch the v0.10.0 DaemonSet to tolerate all taints with the NoSchedule effect:

kubectl -nkube-system patch ds kube-flannel-ds --patch='{"spec":{"template":{"spec":{"tolerations":[{"effect":"NoSchedule","operator":"Exists"}]}}}}'

@willemm
Copy link

willemm commented Jul 10, 2019

I strongly disagree that flannel should tolerate all taints, because there are nodes it should certainly not run on, (e.g. windows nodes).

@stale
Copy link

stale bot commented Jan 26, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jan 26, 2023
@stale stale bot closed this as completed Feb 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests