Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubespray 2.18.0 calico failes without local-loadbalancer #8864

Closed
Talangor opened this issue May 24, 2022 · 16 comments
Closed

kubespray 2.18.0 calico failes without local-loadbalancer #8864

Talangor opened this issue May 24, 2022 · 16 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@Talangor
Copy link

Talangor commented May 24, 2022

hi guys and thank you for your hard work
previously I had installed kubernetes cluster with kubespray and weave cni without any problem (kubespray 2.18.0)
but since we need bgp functionality we decided to move to Calico CNI for a week i have tried the default configuration, the config you see today and tested with Kubernetes 1.23.6 to 1.22.2 with no success
i have been searching and found out if i run the localhost load balancer everything will work as expected but i don't want to use a local (nginx,haproxy) load balancer
is it mandatory to have use_localhost_as_kubeapi_loadbalancer: true?

Environment:

  • Cloud provider or hardware configuration:
    bare-metal installation

  • OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"):
    Linux 5.4.0-113-generic x86_64
    NAME="Ubuntu"
    VERSION="20.04.4 LTS (Focal Fossa)"
    ID=ubuntu
    ID_LIKE=debian
    PRETTY_NAME="Ubuntu 20.04.4 LTS"
    VERSION_ID="20.04"
    HOME_URL="https://www.ubuntu.com/"
    SUPPORT_URL="https://help.ubuntu.com/"
    BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
    PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
    VERSION_CODENAME=focal
    UBUNTU_CODENAME=focal

  • Version of Ansible (ansible --version):
    ansible [core 2.12.5]
    config file = /home/ubuntu/kubespray-v2.18.1/ansible.cfg
    configured module search path = ['/home/ubuntu/kubespray-v2.18.1/library']
    ansible python module location = /usr/local/lib/python3.8/dist-packages/ansible
    ansible collection location = /home/ubuntu/.ansible/collections:/usr/share/ansible/collections
    executable location = /usr/local/bin/ansible
    python version = 3.8.10 (default, Mar 15 2022, 12:22:08) [GCC 9.4.0]
    jinja version = 2.11.3
    libyaml = True

  • Version of Python (python --version):
    Python 3.8.10

Kubespray version (commit) (git rev-parse --short HEAD):
85bd1ee
2.18.1
Network plugin used:
netplan

Full inventory with variables (ansible -i inventory/sample/inventory.ini all -m debug -a "var=hostvars[inventory_hostname]"):

Command used to invoke ansible:
ansible-playbook -i inventory/pre-production/hosts.yaml --become -u sadmin -K cluster.yml

Output of ansible run:

calico kube controller log:

all pods that need calico to create a network for them fail with the below log:

thanks in advance for taking your time

@Talangor Talangor added the kind/bug Categorizes issue or PR as related to a bug. label May 24, 2022
@cristicalin
Copy link
Contributor

cristicalin commented May 24, 2022

use_localhost_as_kubeapi_loadbalancer: true is only needed when using calico with ebpf and if you don't set it kubespray defaults to False. The sample specifically states that this setting is there for cillium but it is needed for calico in ebpf mode as well, else it's not needed.

Could you be more precise with regard to the error you are seeing when setting this to False?

@Talangor
Copy link
Author

Talangor commented May 24, 2022

@cristicalin thanks for fast responce
i did reinstall with calico_bpf_enabled: false
and use_localhost_as_kubeapi_loadbalancer line is commented out
i thought you meant disabling bpf
should i reinstall with use_localhost_as_kubeapi_loadbalancer: false ?
as far as i can tell its disabled by default
result : no change except API URL changed to api-service IP address

│ Warning FailedCreatePodSandBox 18m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup │
│ network for sandbox "10a355bb378aa245368c5c9ac05f3f4045e6aeadc8af25167c8a1a808b70782d": plugin type="calico" failed (add): error getting ClusterInformatio │
│ n: Get "https://10.233.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": Service Unavailable │
│ Warning FailedCreatePodSandBox 15m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup │
│ network for sandbox "b14793930f295689742ca2112f49174c7634ec121283924955e83925fc2d5898": plugin type="calico" failed (add): error getting ClusterInformatio │
│ n: Get "https://10.233.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded │
│ Warning FailedCreatePodSandBox 13m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup │
│ network for sandbox "20e2ec4dd4d96b68f45d37dfe587e76bf3e19341b353aff57bd28545b2b467c4": plugin type="calico" failed (add): error getting ClusterInformatio │
│ n: Get "https://10.233.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded │
│ Warning FailedCreatePodSandBox 10m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup │
│ network for sandbox "5ff44b8fd7abd7ef5c9aedf2f3aad5407b4c1b1b82e8e171dde1319b4620695a": plugin type="calico" failed (add): error getting ClusterInformatio │
│ n: Get "https://10.233.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded │
│ Warning FailedCreatePodSandBox 8m16s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup │
│ network for sandbox "f581e717e804dc43e5f6c0c814efe5270227846edc317a546f2d0e847f0d0dcf": plugin type="calico" failed (add): error getting ClusterInformatio │
│ n: Get "https://10.233.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": Service Unavailable │
│ Warning FailedCreatePodSandBox 30s (x3 over 5m30s) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = │
│ Unknown desc = failed to setup network for sandbox "b5bf26f8e2ac33cf2ce5c01105745670d6e19ea652bb18e27aaab32b7f8cae70": plugin type="calico" failed (add): │
│ error getting ClusterInformation: Get "https://10.233.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": Service Unavailable

@Talangor
Copy link
Author

Talangor commented May 25, 2022

the thing is that the default configuration available in kubespray does this too
my past experience with kubespray was that I could deploy kubernetes cluster with default yaml files and get it to work but this time I cant get it to work
I was suspecting RBAC or compatibility issue between calico version and kube versions but since it works with use_localhost_as_kubeapi_loadbalancer I discarded this thought
excuse me for lack of knowledge
### just a thought: isn't this issue due to kube-proxy and kubeadm refusing to serve API due to policy?

for more info ( clarification )

  • **** calico kube controller log: ****
    W0525 07:23:09.449630 1 reflector.go:436] pkg/mod/github.com/projectcalico/k8s-client-go@v0.21.9-0.20220104180519-6bd7ec39553f/tools/cache/reflector │
    │ 2022-05-25 07:23:09.449 [INFO][1] watchercache.go 97: Watch channel closed by remote - recreate watcher ListRoot="/calico/resources/v3/projectcalico.org/n │
    │ 2022-05-25 07:23:09.450 [INFO][1] watchercache.go 188: Failed to perform list of current data during resync ListRoot="/calico/ipam/v2/assignment/" error=G │
    │ 2022-05-25 07:23:09.450 [INFO][1] watchercache.go 245: Failed to create watcher ListRoot="/calico/resources/v3/projectcalico.org/nodes" error=Get "https:/ │
    │ 2022-05-25 07:23:09.450 [INFO][1] watchercache.go 175: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/nodes" │
    │ 2022-05-25 07:23:09.450 [INFO][1] watchercache.go 188: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.o │
    │ 2022-05-25 07:23:10.446 [WARNING][1] runconfig.go 161: unable to get KubeControllersConfiguration(default) error=Get "https://10.233.0.1:443/apis/crd.proj
    │ 2022-05-25 07:23:10.450 [INFO][1] watchercache.go 175: Full resync is required ListRoot="/calico/ipam/v2/assignment/" │
    │ 2022-05-25 07:23:10.450 [INFO][1] watchercache.go 175: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/nodes" │
    │ 2022-05-25 07:23:10.451 [INFO][1] watchercache.go 188: Failed to perform list of current data during resync ListRoot="/calico/ipam/v2/assignment/" error=G │
    │ 2022-05-25 07:23:10.451 [INFO][1] watchercache.go 188: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.o │
    │ E0525 07:23:10.801148 1 reflector.go:138] pkg/mod/github.com/projectcalico/k8s-client-go@v0.21.9-0.20220104180519-6bd7ec39553f/tools/cache/reflector │
    │ 2022-05-25 07:23:11.448 [WARNING][1] runconfig.go 161: unable to get KubeControllersConfiguration(default) error=Get "https://10.233.0.1:443/apis/crd.proj
    │ 2022-05-25 07:23:11.454 [INFO][1] watchercache.go 175: Full resync is required ListRoot="/calico/ipam/v2/assignment/" │
    │ 2022-05-25 07:23:11.454 [INFO][1] watchercache.go 175: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/nodes" │
    │ 2022-05-25 07:23:11.454 [INFO][1] watchercache.go 188: Failed to perform list of current data during resync ListRoot="/calico/ipam/v2/assignment/" error=G │
    │ 2022-05-25 07:23:11.454 [INFO][1] watchercache.go 188: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.o │
    │ 2022-05-25 07:23:11.708 [ERROR][1] client.go 272: Error getting cluster information config ClusterInformation="default" error=Get "https://10.233.0.1:443/
    │ 2022-05-25 07:23:11.708 [ERROR][1] main.go 226: Failed to verify datastore error=Get "https://10.233.0.1:443/apis/crd.projectcalico.org/v1/clusterinformat
    │ 2022-05-25 07:23:11.708 [ERROR][1] main.go 257: Failed to reach apiserver error=Get "https://10.233.0.1:443/apis/crd.projectcalico.org/v1/clusterinformati
    │ 2022-05-25 07:23:12.449 [WARNING][1] runconfig.go 161: unable to get KubeControllersConfiguration(default) error=Get "https://10.233.0.1:443/apis/crd.proj
    │ 2022-05-25 07:23:12.455 [INFO][1] watchercache.go 175: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/nodes"

  • **** calico pods****
    Warning Unhealthy 33m (x2 over 33m) kubelet Readiness probe failed: calico/node is not ready: felix is not ready: Get "http://localhost:90
    │ 99/readiness": dial tcp 127.0.0.1:9099: connect: connection refused │
    │ Warning Unhealthy 33m kubelet Readiness probe failed: calico/node is not ready: felix is not ready: readiness probe reportin │
    │ g 503

  • **** other pods ****
    Warning FailedCreatePodSandBox 14m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup │
    │ network for sandbox "812cde7dc2338ecec5a205dd437943bba4a6cb21e761ae53d4be1a693acd6814": plugin type="calico" failed (add): error getting ClusterInformati │
    │ on: Get "https://10.233.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": Service Unavailable │
    │ Warning FailedCreatePodSandBox 11m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup │
    │ network for sandbox "d771f1ba4b37cb3d5be2ab5b2441be59242facbbda25c5976a271c4359c4e53e": plugin type="calico" failed (add): error getting ClusterInformati │
    │ on: Get "https://10.233.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded │
    │ Warning FailedCreatePodSandBox 9m31s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup │
    │ network for sandbox "f0c1377a528f09735df62a331928ef62305182398f1468fc1be8fb2a4bc1a781": plugin type="calico" failed (add): error getting ClusterInformati │
    │ on: Get "https://10.233.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded │
    │ Warning FailedCreatePodSandBox 106s (x3 over 6m46s) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code │
    │ = Unknown desc = failed to setup network for sandbox "594499d0afbb43fcdcc18cfb00d3a9eae92572857322bf91f55c2114e1562911": plugin type="calico" failed (add) │
    │ : error getting ClusterInformation: Get "https://10.233.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": Service Unavailable

@Talangor
Copy link
Author

@cristicalin
I tried to install with weave and this happened
#8881
What's happening here am I so far off?
I'm sure you guys had tested the code but its really strange

@Talangor
Copy link
Author

Talangor commented May 29, 2022

it seems that this proxy settings is propagated down to some process calling [https://10.233.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default] API.

Finally I add NO_PROXY to all private subnet (e.g. 10.233.0.0/16 , 10.233.64.0/16) and fix this issue.

i suggest putting cluster domain ( .cluster.local ) and network cidrs in no_proxy default configuration

@cristicalin
Copy link
Contributor

it seems that this proxy settings is propagated down to some process calling [https://10.233.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default] API.

Finally I add NO_PROXY to all private subnet (e.g. 10.233.0.0/16 , 10.233.64.0/16) and fix this issue.

i suggest putting cluster domain ( .cluster.local ) and network cidrs in no_proxy default configuration

So this comes from setting http_proxy in your environment? Unfortunately we don't have a CI test case for this scenario so it's difficult to catch it when it's broken. Personally my envoronments don't require a proxy so its not a part of the code I see often.

If you want to push a PR with the code you changed we are happy to review and include it.

@Talangor
Copy link
Author

Talangor commented May 30, 2022

i did add it like this in group_vars/all/all.yml
no_proxy: "node01,node02,node03,node04,node05,localhost,127.0.0.0,127.0.1.1,127.0.1.1,10.233.0.0/18,10.233.64.0/18,.cluster.local,local.home"

but if we want it to use variable maybe it should be like this:

in roles/kubespray-defaults/defaults/main.yaml
no_proxy: "{{ no_proxy | default ('{{ kube_service_addresses }}, {{ kube_pods_subnet }}, .{{ cluster_name }}') }}" NO_PROXY: "{{ no_proxy | default ('{{ kube_service_addresses }}, {{ kube_pods_subnet }}, .{{ cluster_name }}') }}"

in inventory/sample/group_vars/all/all.yml
// Refer to roles/kubespray-defaults/defaults/main.yml before modifying no_proxy
// Make sure you add kube_service_addresses, kube_pods_subnet and cluster_name
no_proxy: "{{ kube_service_addresses }}, {{ kube_pods_subnet }}, {{ cluster_name }}"
unfortunately, I don't have my test lab for some time now it's best if you could review it if not ill add this to my todo list and test it later on
I'm truly sorry
I should test and then give suggestions but I'm helpless right now
maybe it helps a bit tho

@Talangor
Copy link
Author

Talangor commented May 31, 2022

@cristicalin
update
fortunately, I had the opportunity to test this code and its working as expected

Talangor added a commit to Talangor/kubespray that referenced this issue Jun 6, 2022
<!--  Thanks for sending a pull request!  Here are some tips for you:

1. If this is your first time, please read our contributor guidelines: https://git.k8s.io/community/contributors/guide/first-contribution.md and developer guide https://git.k8s.io/community/contributors/devel/development.md
2. Please label this pull request according to what type of issue you are addressing, especially if this is a release targeted pull request. For reference on required PR/issue labels, read here:
https://git.k8s.io/community/contributors/devel/sig-release/release.md#issuepr-kind-label
3. Ensure you have added or ran the appropriate tests for your PR: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
4. If you want *faster* PR reviews, read how: https://git.k8s.io/community/contributors/guide/pull-requests.md#best-practices-for-faster-reviews
5. Follow the instructions for writing a release note: https://git.k8s.io/community/contributors/guide/release-notes.md
6. If the PR is unfinished, see how to mark it: https://git.k8s.io/community/contributors/guide/pull-requests.md#marking-unfinished-pull-requests
-->

**What type of PR is this?**

/kind feature

**What this PR does / why we need it**:
sets kube CIDR's in No_proxy environment

**Which issue(s) this PR fixes**:
<!--
*Automatically closes linked issue when PR is merged.
Usage: `Fixes #<issue number>`, or `Fixes (paste link of issue)`.
_If PR is about `failing-tests or flakes`, please post the related issues/tests in a comment and do not use `Fixes`_*
-->
Fixes kubernetes-sigs#8864

**Special notes for your reviewer**:
the default configuration does not include no_proxy settings
if one uses default config and sets proxy setting pods cannot connect to API service
PS it's my first time creating an PR
i'll include my code below

**Does this PR introduce a user-facing change?**:
<!--
If no, just write "NONE" in the release-note block below.
If yes, a release note is required:
Enter your extended release note in the block below. If the PR requires additional action from users switching to the new release, include the string "action required".
-->
```release-note
NONE
```
**roles/kubespray-defaults/defaults/main.yaml**
`no_proxy: "{{ no_proxy | default ('{{ kube_service_addresses }},{{ kube_pods_subnet }},{{ cluster_name }}') }}"
`
`NO_PROXY: "{{ no_proxy | default ('{{ kube_service_addresses }},{{ kube_pods_subnet }},{{ cluster_name }}') }}"`

**inventory/sample/group_vars/all/all.yml**
`##Refer to roles/kubespray-defaults/defaults/main.yml before modifying no_proxy`
`##Make sure you add kube_service_addresses, kube_pods_subnet and cluster_name below or pods cannot connect to API service`
`no_proxy: "{{ kube_service_addresses }}, {{ kube_pods_subnet }}, {{ cluster_name }}"`
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 29, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 28, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 28, 2022
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@vyom-soft
Copy link

/reopen

@k8s-ci-robot
Copy link
Contributor

@vyom-soft: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@vyom-soft
Copy link

Hello,
I am seeing the following error

Events:
  Type     Reason                  Age   From               Message
  ----     ------                  ----  ----               -------
  Normal   Scheduled               72s   default-scheduler  Successfully assigned kube-system/kube-proxy-jhf8d to node5
  Warning  FailedCreatePodSandBox  12s   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "1f81d650b5e9f17d4a01973d52b53417352babd39290e1267770d6b2141f6a8b": plugin type="calico" failed (add): error getting ClusterInformation: Get "https://10.233.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.233.0.1:443: i/o timeout

@Talangor
Copy link
Author

hi
@vyom-soft
are you using proxy in your deployment?
if so you should eather set correct exception for your cluster or use offline installation and avoid using proxy.
in my case when i was using proxy my cluster would send its entire traffic through it and it would cause numerous problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
5 participants