Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

overlay network cannot be applied when host is behind a proxy #136

Closed
senthilrch opened this issue Nov 26, 2018 · 53 comments
Closed

overlay network cannot be applied when host is behind a proxy #136

senthilrch opened this issue Nov 26, 2018 · 53 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@senthilrch
Copy link

Environment

Host OS: RHEL 7.4
Host Docker version: 18.09.0
Host go version: go1.11.2
Node Image: kindest/node:v1.12.2

kind create cluster

[root@localhost bin]# kind create cluster
Creating cluster 'kind-1' ...
 ✓ Ensuring node image (kindest/node:v1.12.2) 🖼
 ✓ [kind-1-control-plane] Creating node container 📦
 ✓ [kind-1-control-plane] Fixing mounts 🗻
 ✓ [kind-1-control-plane] Starting systemd 🖥
 ✓ [kind-1-control-plane] Waiting for docker to be ready 🐋
 ✗ [kind-1-control-plane] Starting Kubernetes (this may take a minute) ☸
FATA[07:20:43] Failed to create cluster: failed to apply overlay network: exit status 1

Code below in pkg/cluster/context.go is trying to extract k8s version using kubectl version command in order to download the version-specific weave net.yaml. The code is not ok:-

        // TODO(bentheelder): support other overlay networks
        if err = node.Command(
                "/bin/sh", "-c",
                `kubectl apply --kubeconfig=/etc/kubernetes/admin.conf -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version --kubeconfig=/etc/kubernetes/admin.conf | base64 | tr -d '\n')"`,
        ).Run(); err != nil {
                return kubeadmConfig, errors.Wrap(err, "failed to apply overlay network")
        }

Why is the output of kubectl version command, base64 encoded?

@alejandrox1
Copy link
Contributor

Hi @senthilrch , thank you for filling this issue.
Could you run kind create cluster --loglevel debug and put the output of this command in a gist ?

As for your uestion about installing weave net, you can read more about it here.

@BenTheElder
Copy link
Member

Yep, as @alejandrox1 noted, the base64 encoding is from their guide. The reason for this is to pass it as an HTTP query parameter to weave so that their site can serve the appropriate weave version based on your Kubernetes version.

In the future we might use fixed weave versions, but this is the correct and normal way to install it per their upstream documentation.

kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"

It would be this verbatim, but we need to specify the admin kubeconfig location.


Regarding the failure, is this happening reliably, or did it just happen once?

@senthilrch
Copy link
Author

senthilrch commented Nov 27, 2018 via email

@senthilrch
Copy link
Author

https://gist.github.com/senthilrch/70eb56cfeee38e311c13f6898791121a

The host in which I am creating the kind cluster is behind a proxy. Perhaps that's the reason it fails. Will kind honor http_proxy and https_proxy env variables set on the host?

@BenTheElder
Copy link
Member

Ah, that's almost definitely it!

kind does nothing special regarding proxies, the rest of the bringup only works because everything else (besides the overlay network config and its images) is pre-packed into the node image and doesn't need to go out to the internet.

We can either try to get these packed into the image ahead of time (which is probably quite doable, and possibly desirable, but maybe a little tricky), or we can try to make this step respect proxy information on the host machine.

It looks like http_proxy and HTTPS_PROXY are mostly a convention that curl and a few others happen to follow to varying degrees, we'd probably need to also set the docker daemon on the "nodes" to respect this as well.

Both approaches are probably worth doing. I'll update this issue to track.

@BenTheElder BenTheElder changed the title FATA[07:20:43] Failed to create cluster: failed to apply overlay network: exit status 1 overlay network cannot be applied when host is behind a proxy Nov 27, 2018
@BenTheElder
Copy link
Member

/kind bug
/priority important-soon

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Nov 27, 2018
@metalmatze
Copy link

Within the last 1-2 weeks Kind broke for me with the same error (I believe).

 ✓ [control-plane] Creating the kubeadm config file ⛵ 
DEBU[16:26:27] Running: /usr/bin/docker [docker exec --privileged kind-1-control-plane kubeadm init --ignore-preflight-errors=all --config=/kind/kubeadm.conf] 
DEBU[16:26:52] Running: /usr/bin/docker [docker exec --privileged -t kind-1-control-plane cat /etc/kubernetes/admin.conf] 
DEBU[16:26:53] Running: /usr/bin/docker [docker exec --privileged kind-1-control-plane /bin/sh -c kubectl apply --kubeconfig=/etc/kubernetes/admin.conf -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version --kubeconfig=/etc/kubernetes/admin.conf | base64 | tr -d '\n')"] 
ERRO[16:28:25] failed to apply overlay network: exit status 1 ) ☸ 
 ✗ [control-plane] Starting Kubernetes (this may take a minute) ☸
ERRO[16:28:25] failed to apply overlay network: exit status 1 
DEBU[16:28:25] Running: /usr/bin/docker [docker ps -q -a --no-trunc --filter label=io.k8s.sigs.kind.cluster --format {{.Names}}\t{{.Label "io.k8s.sigs.kind.cluster"}} --filter label=io.k8s.sigs.kind.cluster=1] 
DEBU[16:28:25] Running: /usr/bin/docker [docker rm -f -v kind-1-control-plane] 
⠈⠁ [control-plane] Pre-loading images 🐋 Error: failed to create cluster: failed to apply overlay network: exit status 1

I didn't change anything on my system and simply do a git pull origin master && go install every now and then. I'm running on ArchLinux if that's helpful information.

@BenTheElder
Copy link
Member

@metalmatze is it possible that you're behind a proxy as well? we've not fixed that yet.

@metalmatze
Copy link

I don't think so.
At home I have the same issues, also running Arch but an entirely different system.

@BenTheElder
Copy link
Member

hmm. I don't think we've made any functional changes to this step in that time frame. FWIW making this step not depend on the internet is very high on my todo 😕

other known issues I've seen that can cause similar problems:

  • low disk / memory (kubelet will evict workloads, even core components if these get too low)
  • btrfs underlying /var/lib/docker causes issues

@metalmatze
Copy link

Pulling the latest master now fixed KinD for me again. I'm not entirely sure what happened. I can't see any changes related to my problem. I'm on the same machine and the same WiFi as first reported from. Additionally my machine was suspended most of the weekend and I didn't run any updates during the time (like updating Docker for exmaple)
302bb7d...4a348e0

@BenTheElder
Copy link
Member

BenTheElder commented Jan 14, 2019 via email

@alexmt
Copy link

alexmt commented Jan 23, 2019

I'm facing the same issue. In my case apply overlay network fails because cloud.weave.works is not resolvable from kind-1-control-plane container. Any help would be very appreciated.

@alexmt
Copy link

alexmt commented Jan 24, 2019

Upgraded docker from 18.09 to 18.09.1 and problem went away 🎉.

@BenTheElder
Copy link
Member

Huh, I wonder if there was a regression in docker somehow. What docker distribution are you using?

@metalmatze
Copy link

Interesting. For me it works since 10 days ago and I just checked that I'm on Docker 18.09.1 as well. I should have checked the version when it didn't work.
FWIW I looked at the Arch packages for Docker and the timeline of their releases pretty much adds up with that suspicion!

18.09.1 was pushed on Jan 10th:
https://git.archlinux.org/svntogit/community.git/commit/?h=packages/docker&id=0b11ffde10bf10ab1b08a459c12927ff02abf6d3

@alexmt
Copy link

alexmt commented Jan 24, 2019

@BenTheElder , I was running dind container docker:18.09-dind in kubernetes. After I've changed image to docker:18.09.1-dind issue got resolved.

@BenTheElder
Copy link
Member

Thanks for confirming, I'm going to file another issue to create a "known issues" section in our docs and highlhight this as one of the first ones!

@neolit123
Copy link
Member

It looks like http_proxy and HTTPS_PROXY are mostly a convention that curl and a few others happen to follow to varying degrees, we'd probably need to also set the docker daemon on the "nodes" to respect this as well.

+1
yes, i think given the containerization, passing the http(s)_proxy env. vars to the kind nodes might be necessary.

adding the option to pre-bake the overlay network and also provide air-gapped support will help users that don't want their kind cluster to talk to the internet. for the rest we might have to still expose the proxy env vars.

@neolit123
Copy link
Member

I was running dind container docker:18.09-dind in kubernetes. After I've changed image to docker:18.09.1-dind issue got resolved.

i wonder what was fixed.

@BenTheElder
Copy link
Member

so docker itself supports HTTP_PROXY / HTTPS_PROXY https://docs.docker.com/network/proxy/ 🤔
we could just blindly pass through these values from the host at node creation time... 🤔

@neolit123
Copy link
Member

it makes sense, especially if it's a fix.

@matthyx
Copy link

matthyx commented Feb 4, 2019

@BenTheElder I really need this as my company has a corporate proxy... will you be working on it, or should I jump in?
👍

@floreks
Copy link

floreks commented Feb 7, 2019

@BenTheElder Security context is set to allow privileged execution. I am using official docker:dind as a base image and docker itself is running in the container. I did not have to mount anything when running it locally and it was working correctly. Only when running in a k8s environment there is an issue.

Here is my test yaml:

apiVersion: v1
kind: Namespace
metadata:
  name: test-floreks
---
apiVersion: apps/v1 # for versions before 1.9.0 use apps/v1beta2
kind: Deployment
metadata:
  namespace: test-floreks
  name: dind
spec:
  selector:
    matchLabels:
      app: dind
  replicas: 1 # tells deployment to run 2 pods matching the template
  template:
    metadata:
      labels:
        app: dind
    spec:
      containers:
      - name: dind
        image: floreks/dind-with-kind:v1.0.0
        securityContext:
          privileged: true

@BenTheElder
Copy link
Member

so our actual podspec is ~ the contents of the pod_spec field in this prowjob (a few things get added for git checkout, environment variables...):

apiVersion: prow.k8s.io/v1
kind: ProwJob
metadata:
  annotations:
    prow.k8s.io/job: ci-kubernetes-kind-conformance
  creationTimestamp: null
  labels:
    created-by-prow: "true"
    preset-bazel-remote-cache-enabled: "true"
    preset-bazel-scratch-dir: "true"
    preset-dind-enabled: "true"
    preset-service-account: "true"
    prow.k8s.io/id: bc7c7a72-2b06-11e9-8fd7-0a580a6c037c
    prow.k8s.io/job: ci-kubernetes-kind-conformance
    prow.k8s.io/type: periodic
  name: f8f7ed86-2b0d-11e9-bfc2-0a580a6c0297
spec:
  agent: kubernetes
  cluster: default
  job: ci-kubernetes-kind-conformance
  namespace: test-pods
  pod_spec:
    containers:
    - args:
      - --job=$(JOB_NAME)
      - --root=/go/src
      - --repo=k8s.io/kubernetes=master
      - --repo=sigs.k8s.io/kind=master
      - --service-account=/etc/service-account/service-account.json
      - --upload=gs://kubernetes-jenkins/logs
      - --scenario=execute
      - --
      - ./../../sigs.k8s.io/kind/hack/ci/e2e.sh
      env:
      - name: GOOGLE_APPLICATION_CREDENTIALS
        value: /etc/service-account/service-account.json
      - name: E2E_GOOGLE_APPLICATION_CREDENTIALS
        value: /etc/service-account/service-account.json
      - name: TEST_TMPDIR
        value: /bazel-scratch/.cache/bazel
      - name: BAZEL_REMOTE_CACHE_ENABLED
        value: "true"
      - name: DOCKER_IN_DOCKER_ENABLED
        value: "true"
      image: gcr.io/k8s-testimages/kubekins-e2e:v20190205-d83780367-master
      name: ""
      resources:
        requests:
          cpu: "2"
          memory: 9000Mi
      securityContext:
        privileged: true
      volumeMounts:
      - mountPath: /lib/modules
        name: modules
        readOnly: true
      - mountPath: /sys/fs/cgroup
        name: cgroup
      - mountPath: /etc/service-account
        name: service
        readOnly: true
      - mountPath: /bazel-scratch/.cache
        name: bazel-scratch
      - mountPath: /docker-graph
        name: docker-graph
    dnsConfig:
      options:
      - name: ndots
        value: "1"
    volumes:
    - hostPath:
        path: /lib/modules
        type: Directory
      name: modules
    - hostPath:
        path: /sys/fs/cgroup
        type: Directory
      name: cgroup
    - name: service
      secret:
        secretName: service-account
    - emptyDir: {}
      name: bazel-scratch
    - emptyDir: {}
      name: docker-graph
  type: periodic
status:
  startTime: "2019-02-07T19:24:22Z"
  state: triggered

@BenTheElder
Copy link
Member

#275 just merged to pass through HTTPS_PROXY and HTTP_PROXY from the host to the nodes, thanks @pablochacin!

We should be getting the 0.2 release soon with this change, but right now you can obtain it by building from the current master branch sources.

hopefully this should resolve this issue, I am finalizing the design for handling CNIs as well, plan to bring up at the next meeting.

We've additionally uncovered #284 which may affect some configurations.

@matthyx
Copy link

matthyx commented Feb 11, 2019

Thanks @BenTheElder I now have a new issue:
Error: failed to create cluster: failed to init node with kubeadm: exit status 1

You can find the full debug log here.

@pablochacin
Copy link
Contributor

@matthyx from the log I see that the proxy has been set to http://127.0.0.1:3129/ This is localhost in the host machine, but inside the kind node container this address is the container's loopback (not the host's loopback). Therefore, you should set your proxy to an address witch is reachable from the kind node container.

@matthyx
Copy link

matthyx commented Feb 11, 2019

Ok, I feel so stupid indeed... so after setting the proxy to something reachable from containers, I get the following (full log here):
ERRO[11:30:05] failed to apply overlay network: exit status 1

@BenTheElder BenTheElder added this to the 1.0 milestone Feb 11, 2019
@pablochacin
Copy link
Contributor

@matthyx you set the proxy to 172.17.01. which is the address of the docker bridge on your host machine, so I guess you are running your proxy locally on the host machine. What I've found is that by default in my machine the firewall is set so that no traffic is allow from inside docker containers to the host. I had to turn of the firewall to make it work. I'm pretty much sure this is default behavior of docker.

My suggestion is that you either test it disabling your firewall (at your own risk ;-) ) or try with a proxy running on a public address.

@matthyx
Copy link

matthyx commented Feb 12, 2019

@pablochacin thanks for the suggestion, I have just checked and my proxy works from inside docker, as confirmed by a small Dockerfile like:

FROM ubuntu
ENV http_proxy=http://172.17.0.1:3129/
RUN apt update

@pablochacin
Copy link
Contributor

I'm not following here @matthyx This is a Docker file, right? It is applied at build time, the issue you have is at run time. Not sure if this two situations are comparable. What I suggest you is to start a container with ubuntu, and from inside the container try an update:

$> docker run -ti --rm ubuntu bash 
root@b85870b5faa1:/# export http_proxy=http://172.17.0.1:3129/
root@b85870b5faa1:# apt-get update

@matthyx
Copy link

matthyx commented Feb 12, 2019

Yes, this works:

$ docker run -ti --rm ubuntu bash
root@29cd8a005505:/# export http_proxy=http://172.17.0.1:3129/
root@29cd8a005505:/# apt-get update
Get:1 http://archive.ubuntu.com/ubuntu bionic InRelease [242 kB]
Get:2 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]      
Get:4 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Get:5 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages [339 kB]
Get:6 http://archive.ubuntu.com/ubuntu bionic/multiverse amd64 Packages [186 kB]
Get:7 http://security.ubuntu.com/ubuntu bionic-security/universe amd64 Packages [152 kB]
Get:8 http://archive.ubuntu.com/ubuntu bionic/universe amd64 Packages [11.3 MB]
Get:9 http://security.ubuntu.com/ubuntu bionic-security/multiverse amd64 Packages [3451 B]
Get:10 http://archive.ubuntu.com/ubuntu bionic/main amd64 Packages [1344 kB]   
Get:11 http://archive.ubuntu.com/ubuntu bionic/restricted amd64 Packages [13.5 kB]
Get:12 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages [679 kB]
Get:13 http://archive.ubuntu.com/ubuntu bionic-updates/multiverse amd64 Packages [6955 B]
Get:14 http://archive.ubuntu.com/ubuntu bionic-updates/restricted amd64 Packages [10.7 kB]
Get:15 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 Packages [932 kB]
Get:16 http://archive.ubuntu.com/ubuntu bionic-backports/universe amd64 Packages [3650 B]
Fetched 15.5 MB in 1s (10.8 MB/s)                          
Reading package lists... Done

However, I have reached IT and it seems our corporate proxy (which requires a local cntlm for AD authentication) uses an old protocol for the man in the middle... and for this reason we cannot upgrade our Docker past 18.06.1-ce
Do you think we could be hitting the same issue here?

@matthyx
Copy link

matthyx commented Feb 13, 2019

Some updates on this. I have the privilege to work with extremely bright people here, and the problem seems to lie on TLS negotiation (although not 1.3) because our proxy policy hasn't been updated in a while, and none of the algorithm proposed by the go tls client is supported atm...

We're working with network and security to update this policy, and I will keep you posted if that solves our problem!

@hjacobs
Copy link

hjacobs commented Feb 17, 2019

Just to confirm the problem from my side persists even after Docker upgrade (I don't have any HTTP proxy):

I get this error with Docker 18.06.1 from the official Ubuntu 18.04 LTS repository:

kind create cluster --image=kindest/node:v1.13.3@sha256:d1af504f20f3450ccb7aed63b67ec61c156f9ed3e8b0d973b3dee3c95991753c --retain
Creating cluster 'kind-1' ...
 ✓ Ensuring node image (kindest/node:v1.13.3) 🖼
 ✓ [control-plane] Creating node container 📦 
 ✓ [control-plane] Fixing mounts 🗻 
 ✓ [control-plane] Starting systemd 🖥 
 ✓ [control-plane] Waiting for docker to be ready 🐋 
 ✓ [control-plane] Pre-loading images 🐋 
 ✓ [control-plane] Creating the kubeadm config file ⛵ 
ERRO[11:41:36] failed to apply overlay network: exit status 1 ) ☸ 
 ✗ [control-plane] Starting Kubernetes (this may take a minute) ☸
ERRO[11:41:36] failed to apply overlay network: exit status 1 
Error: failed to create cluster: failed to apply overlay network: exit status 1
****

The problem persists for me after upgrading to docker-ce 18.09.23-0ubuntu-bionic (I followed the Docker CE instructions):

Creating cluster 'kind-1' ...
 ✓ Ensuring node image (kindest/node:v1.13.3) 🖼
 ✓ [control-plane] Creating node container 📦 
 ✓ [control-plane] Fixing mounts 🗻 
 ✓ [control-plane] Starting systemd 🖥 
 ✓ [control-plane] Waiting for docker to be ready 🐋 
 ✓ [control-plane] Pre-loading images 🐋 
 ✓ [control-plane] Creating the kubeadm config file ⛵ 
ERRO[11:55:19] failed to add default storage class: exit status 1 
 ✗ [control-plane] Starting Kubernetes (this may take a minute) ☸
ERRO[11:55:19] failed to add default storage class: exit status 1 
Error: failed to create cluster: failed to add default storage class: exit status 1

curl for the overlay network install works for me (but kubectl version fails as API server is already down):

root@kind-1-control-plane:/# curl --location https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version --kubeconfig=/etc/kubernetes/admin.conf | base64 | tr -d '\n')
The connection to the server 172.17.0.2:6443 was refused - did you specify the right host or port?
apiVersion: v1
kind: List
items:
... (cut for brevity)

@BenTheElder
Copy link
Member

BenTheElder commented Feb 18, 2019

@hjacobs can you update with go get -u sigs.k8s.io/kind? the cluster name suggests that you're on on an old version (or one of the previous releases), I suspect your API server is getting evicted, which we've patched around in #293.

@BenTheElder
Copy link
Member

the next release will contain this fix, but in the meantime it can be installed from the current source 😬

@BenTheElder
Copy link
Member

Should be actually fixed now, additionally new node images do not require pulling the overlay image at all.

@BenTheElder
Copy link
Member

#322
And
#331

@matthyx
Copy link

matthyx commented Feb 23, 2019

I will test on Monday since I don't have our corporate proxy at home... thanks for the update!

@matthyx
Copy link

matthyx commented Feb 25, 2019

@BenTheElder doesn't seem to work better... I did go get -u sigs.k8s.io/kind to update to latest, and then kind create cluster --loglevel debug which resulted in the same failure.

You can read the debug logs here.

@BenTheElder
Copy link
Member

hey @matthyx, can you run with kind create cluster --retain --loglevel debug and then run kind export logs after?

I suspect this is something else with your environment, at the latest source zero internet connectivity should be required after pulling the "node" image. (which I and one other user have been able to verify).

@matthyx
Copy link

matthyx commented Feb 25, 2019

hey @matthyx, can you run with kind create cluster --retain --loglevel debug and then run kind export logs after?

I suspect this is something else with your environment, at the latest source zero internet connectivity should be required after pulling the "node" image. (which I and one other user have been able to verify).

Should I open another issue once I have the logs?

@BenTheElder
Copy link
Member

that would be good, thanks!

@matthyx
Copy link

matthyx commented Feb 25, 2019

I think this is good now... looking at the logs before sending them, I have noticed that:

I0225 07:49:12.803064     726 checks.go:430] validating if the connectivity type is via proxy or direct
	[WARNING HTTPProxy]: Connection to "https://172.17.0.3" uses proxy "http://127.0.0.1:3129/". If that is not intended, adjust your proxy settings
I0225 07:49:12.803104     726 checks.go:466] validating http connectivity to first IP address in the CIDR
	[WARNING HTTPProxyCIDR]: connection to "10.96.0.0/12" uses proxy "http://127.0.0.1:3129/". This may lead to malfunctional cluster setup. Make sure that Pod and Services IP ranges specified correctly as exceptions in proxy configuration

And so I decided to give it a try by unsetting all my *_proxy env variables and suddently it worked! I can finally enjoy kind on my pro workstation.

Thanks a lot @pablochacin and @BenTheElder !

@akutz
Copy link
Contributor

akutz commented Mar 1, 2019

Hi @BenTheElder,

I'm also interested in helping with this issue as it relates to support for air gapped testing. I'm currently in-flight back to Austin, and I thought I'd get some Kind-based dev-work done. However, without a good internet connection things are just not working. I finally got past the above error, but now, due to a flaky network, the node is never ready due to the inability to initialize CNI.

@BenTheElder
Copy link
Member

hey @akutz -- on the latest code in master airgapped clusters should work, the CNI does not need to be pulled, is it possible you're using an older version?

stg-0 added a commit to stg-0/kind that referenced this issue Jun 19, 2023
[0.17.0-0.1] DOC Restructuration and review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests