Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.17 alpha versions causing regression for kiam? #8562

Closed
jhohertz opened this issue Feb 14, 2020 · 22 comments
Closed

1.17 alpha versions causing regression for kiam? #8562

jhohertz opened this issue Feb 14, 2020 · 22 comments

Comments

@jhohertz
Copy link
Contributor

1. What kops version are you running? The command kops version, will display
this information.

Any of the 1.17 alphas so far.

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

seen in 1.17.0-rc.2 through 1.17.3. Works without issue on kops/k8s 1.15 and 1.16 built clusters. ONLY change is bump to 1.17.x.

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

Try to install kiam via it's included helm chart onto a kops 1.17.x built-cluster

5. What happened after the commands executed?

The kiam-agent daemonsets crashloop

6. What did you expect to happen?

No crashloop.

**7. Please provide your cluster manifest. Execute

Will follow up with this if asked for. Main thing applicable here is we are using CoreDNS

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

From the agent logs w/ gRPC debugging enabled:

kubectl -n kube-system logs kiam-agent-5z8tt
{"level":"info","msg":"started prometheus metric listener 0.0.0.0:9620","time":"2020-02-14T18:13:41Z"}
INFO: 2020/02/14 18:13:41 parsed scheme: "dns"
INFO: 2020/02/14 18:13:46 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.kiam-server on 100.64.0.10:53: dial udp 100.64.0.10:53: operation was canceled.
WARNING: 2020/02/14 18:13:46 grpc: failed dns A record lookup due to lookup kiam-server on 100.64.0.10:53: dial udp 100.64.0.10:53: operation was canceled.
INFO: 2020/02/14 18:13:46 ccResolverWrapper: got new service config: 
INFO: 2020/02/14 18:13:46 ccResolverWrapper: sending new addresses to cc: []
{"level":"fatal","msg":"error creating server gateway: error dialing grpc server: context deadline exceeded","time":"2020-02-14T18:13:46Z"}

9. Anything else do we need to know?

Bug also posted with kiam folks here: uswitch/kiam#378

@jhohertz
Copy link
Contributor Author

That kubernetes issue just linked is likely at the root of all this.

@jhohertz
Copy link
Contributor Author

Update: This seems to be specific to using the flannel/canal CNI with the vxlan backend by some accounts, and further testing seems to support that.

@jhohertz
Copy link
Contributor Author

So the problem clearly isn't with kops itself, however, it might be worthwhile to warn users in documentation, or even make invalid configurations with network CNI flannel/canal and vxlan backend with 1.17 versions, as it's going to result it more odd reports like this one. :)

@johngmyers
Copy link
Member

What, specifically, are the invalid configurations?

@jhohertz
Copy link
Contributor Author

  1. Using the Canal CNI (As it is fixed to use vxlan in kops, not sure it works with other backends)

  2. Using the Flannel CNI in it's default "vxlan" configuration. Superficial testing shows the problem doesn't seem to exist with the "udp" backend, however most using Flannel and working around the issue seem to be suggesting the "host-gw" backend, which is not currently usable via kops.

See the flannel issue for more info here: flannel-io/flannel#1243

@johngmyers
Copy link
Member

It seems there's not enough information to identify a particular bad configuration. It looks like the issue is still being triaged and is likely a bug in Flannel and/or Canal. There's time before kops 1.17 is released for the bug(s) to be fixed. If it later turns out to be a more permanent situation, we could add an api validation check then.

@jhohertz
Copy link
Contributor Author

See comment above for what constitutes a non-working configuration, which I've detailed as requested.

The bug is in Flannel (which Canal uses), and I've linked the issue involved. Yes, it's possible that there will be a fix made available, but I'm not holding my breath as the project seems to be trending towards dormancy.

@johngmyers
Copy link
Member

johngmyers commented Feb 22, 2020

So you're proposing kops should disallow a CNI of Canal or Flannel with Backend of vxlan for Kubernetes versions equal to or greater than 1.17?

@justinsb
Copy link
Member

Thanks for reporting @jhohertz. The current theory is that it's related to the kernel version, and some kernels have bugs with computation of the checksums which can be worked around by turning off offload of that computation.

Which image (AMI) are you using (or are you using the default kops image)?

@jhohertz
Copy link
Contributor Author

We're currently using the latest Flatcar stable release.

I am currently looking at trying to patch in the ethtool thing for testing.

@jhohertz
Copy link
Contributor Author

jhohertz commented Apr 27, 2020

I maybe have found hints as to "what's different between 1.16 and 1.17".

A dependency in a netlink library used was bumped, and within that there are specific changes to vxlan and the handling of checksums. It looks like it should have really only added IPv6 UDP support for checksums, but... after searching around for whats different between 1.16 and 1.17, this kind of stands out.

Comment on flannel issue: flannel-io/flannel#1243 (comment)

Perhaps this will help folks find out what's going on? (Or possibly prove to be a red herring...)

That update also includes new ethtoool-related code.

@johngmyers
Copy link
Member

Is someone able to write up a release note for Kops 1.17? I would prefer we not hold up 1.17 indefinitely for a new version of Flannel.

@johngmyers
Copy link
Member

Can this be closed now that #9074 has been merged and cherrypicked to 1.17?

@jhohertz
Copy link
Contributor Author

Probably? Any way you could cut another beta with this in place for wider testing?

@johngmyers
Copy link
Member

/close

@k8s-ci-robot
Copy link
Contributor

@johngmyers: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hakman
Copy link
Member

hakman commented May 11, 2020

@jhohertz I think next release will be more of a RC or final. Not sure anything else can be done to improve things with Flannel until a new release comes.

@paalkr
Copy link

paalkr commented May 13, 2020

Is there a kops 1.17.0 build available with this fix included? We have encountered kiam issues when testing kops 1.17.0-beta.2 with flannel networking, which we need for our windows worker nodes to join.

@olemarkus
Copy link
Member

No release yet. It will go into the next one.

@jhohertz
Copy link
Contributor Author

Just a note to warn, this nightmare may also have just landed in 1.16 as of 1.16.10 k8s. Still investigating but it's behaving the exact same way.

@paalkr
Copy link

paalkr commented May 27, 2020

We do run flannel on a non standard port, so for us the suggested fix wont help. But it's easy to already today address this flannel issue, using a custom hook in the cluster manifest.

Replace 4096 with 1 if you run with standard flannel setup.

spec:
  hooks:
  - name: flannel-4096-tx-checksum-offload-disable.service
    # Temporary fix until https://github.com/kubernetes/kops/pull/9074 is released
    roles:
    - Node
    - Master
    useRawManifest: true
    manifest: |
      [Unit]
      Description=Disable TX checksum offload on flannel.4096
      After=sys-devices-virtual-net-flannel.4096.device
      After=sys-subsystem-net-devices-flannel.4096.device
      After=docker.service
      [Service]
      Type=oneshot
      ExecStart=/sbin/ethtool -K flannel.4096 tx-checksum-ip-generic off

@jhohertz
Copy link
Contributor Author

I guess that was a bit dramatic of me. 😄 it just bothered me I couldn't explain why, though looking at the .10 patch looks like an iptables version bump (which also showed up in 1.16.0 to 1.17.0), may be the only thing networking related in the .10 patch.

I'm aware of that workaround but thank you for mentioning it anyways.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants