Interface MTU change on AWS instances supporting Jumbo packets #2443

dghubble · 2018-05-27T20:33:03Z

Issue Report

Bug

Between Container Linux releases, the default interface MTU used on AWS instances supporting Jumbo packets has changed/regressed from 9001 to 1500. This causes connectivity problems between pods across node boundaries. CNI providers don't expect the MTU to change over time node-by-node. It also means Jumbo packets aren't being used. I received pages shortly after the OS auto-update occurred.

Container Linux Version

Stable 1745.4.0
Stable 1745.3.1

NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1745.4.0
VERSION_ID=1745.4.0
BUILD_ID=2018-05-24-2146
PRETTY_NAME="Container Linux by CoreOS 1745.4.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

Environment

AWS

Reproduction Steps

Launch an EC2 instance that supports Jumbo packets. A t2.small will suffice. With stable 1688.5.3, the interface comes up with MTU 9001.

$ ip link show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000                                                                                                
    link/ether 06:06:9f:dc:08:10 brd ff:ff:ff:ff:ff:ff

Now try with stable 1745.3.1 or 1745.4.0. The MTU has changed to 1500, as though Jumbo frame support isn't available.

$ ip link show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether 02:6d:6b:43:05:ec brd ff:ff:ff:ff:ff:ff

The text was updated successfully, but these errors were encountered:

lucab · 2018-05-27T21:15:20Z

I'm not too familiar with such network details, but I guess that AWS is pushing the MTU parameter via DHCP and UseMTU= should be picking it up.

Can you please:

check for relevant log entries journalctl -u systemd-networkd
post the output of networkctl status -a
diff for relevant changes in /run/systemd/netif/leases/* and /run/systemd/netif/links/* between current and previous stable.

dghubble · 2018-05-27T21:49:53Z

Seems like a good guess. It looks like its attempting to set the MTU and now failing. On 1688.5.3,

$ journalctl -u systemd-networkd
May 27 21:35:00 localhost systemd-networkd[678]: Enumeration completed
May 27 21:35:00 localhost systemd[1]: Started Network Service.
May 27 21:35:00 localhost systemd-networkd[678]: lo: Configured
May 27 21:35:00 localhost systemd-networkd[678]: eth0: IPv6 successfully enabled
May 27 21:35:00 localhost systemd-networkd[678]: eth0: Gained carrier
May 27 21:35:00 localhost systemd-networkd[678]: eth0: DHCPv4 address 10.0.2.95/20 via 10.0.0.1
May 27 21:35:00 localhost systemd-networkd[678]: Not connected to system bus, not setting hostname.
May 27 21:35:01 localhost systemd-networkd[678]: eth0: Gained IPv6LL
May 27 21:35:01 localhost systemd-networkd[678]: eth0: Configured

and on 1745.4.0,

May 27 21:05:56 localhost systemd-networkd[675]: Enumeration completed
May 27 21:05:56 localhost systemd[1]: Started Network Service.
May 27 21:05:56 localhost systemd-networkd[675]: lo: Configured
May 27 21:05:56 localhost systemd-networkd[675]: eth0: IPv6 successfully enabled
May 27 21:05:56 localhost systemd-networkd[675]: eth0: Gained carrier
May 27 21:05:56 localhost systemd-networkd[675]: eth0: DHCPv4 address 10.0.44.198/20 via 10.0.32.1
May 27 21:05:56 localhost systemd-networkd[675]: Not connected to system bus, not setting hostname.
May 27 21:05:56 localhost systemd-networkd[675]: eth0: Could not set MTU: Invalid argument
May 27 21:05:58 localhost systemd-networkd[675]: eth0: Gained IPv6LL
May 27 21:06:02 localhost systemd-networkd[675]: eth0: Configured

I haven't found anything notable or different in the run locations.

lucab · 2018-05-28T08:40:21Z

Thanks. I'm observing this on latest alpha too, and it seems to have started with the systemd-v238 bump (unrelated to kernel version). I forwarded this to systemd/systemd#9102 with some additional details, but I failed to locate where the real issue is. Waiting for some more insights from upstream.

bgilbert · 2018-05-30T17:55:21Z

Per systemd/systemd#9102 (comment), this looks like a kernel regression. RHBZ.

ajeddeloh · 2018-05-30T18:44:22Z

Interestingly this doesn't impact all instance types. m4.large does not exhibit it, but t2.small does. Cannot repro with qemu.

bgilbert · 2018-05-31T02:11:16Z

Reverting coreos/linux@f599c64 fixes the problem.

This should be fixed in alpha 1786.2.0, beta 1772.2.0, and stable 1745.5.0, due shortly. Thanks for the report, @dghubble!

dghubble · 2018-05-31T03:59:36Z

👋 Hooray, thanks Ben and company!

whereisaaron · 2018-06-04T03:03:55Z

Thanks for fixing this. I've been chasing load balancer PMTU problems and tracked it back to this. I saw this issue when updating to CoreOS-stable-1745.3.1-hvm. Suddenly all my T2 instances has an MTU of 1500, whereas other instance time like M* and R* instances had an MTU of 9001. This caused PMTU problems for external applications that were load-balanced across both types of instance.

I am guessing that while T2 and other instance types support jumbo frames, T2 instances perhaps default to 1500 whereas the others default to 9001 (before DHCP overrides that).

bgilbert · 2018-06-16T01:12:01Z

netdev thread.

This was referenced May 27, 2018

AWS pod-to-pod connectivity issue when using Jumbo packets poseidon/typhoon#228

Closed

Connectivity issues between docker containers in 1745.4.0 #2442

Open

lucab mentioned this issue May 28, 2018

networkd: fail to set jumbo MTU from DHCP systemd/systemd#9102

Closed

lucab added component/networkd kind/regression area/distribution team/os platform/aws version/systemd/238 labels May 28, 2018

bgilbert closed this as completed May 31, 2018

whereisaaron mentioned this issue Jun 4, 2018

PSA for kube-aws users, skip CoreOS 1745.3.1 and 1745.4.0, go directly to 1745.5.0 kubernetes-retired/kube-aws#1349

Closed

whereisaaron mentioned this issue Jun 11, 2018

MTU issue with IPIP and host-networking projectcalico/calico#1709

Closed

dghubble mentioned this issue Jul 4, 2018

Fedora Atomic tracking issue poseidon/typhoon#200

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interface MTU change on AWS instances supporting Jumbo packets #2443

Interface MTU change on AWS instances supporting Jumbo packets #2443

dghubble commented May 27, 2018 •

edited

Loading

lucab commented May 27, 2018

dghubble commented May 27, 2018

lucab commented May 28, 2018

bgilbert commented May 30, 2018

ajeddeloh commented May 30, 2018 •

edited

Loading

bgilbert commented May 31, 2018

dghubble commented May 31, 2018

whereisaaron commented Jun 4, 2018

bgilbert commented Jun 16, 2018

Interface MTU change on AWS instances supporting Jumbo packets #2443

Interface MTU change on AWS instances supporting Jumbo packets #2443

Comments

dghubble commented May 27, 2018 • edited Loading

Issue Report

Bug

Container Linux Version

Environment

Reproduction Steps

lucab commented May 27, 2018

dghubble commented May 27, 2018

lucab commented May 28, 2018

bgilbert commented May 30, 2018

ajeddeloh commented May 30, 2018 • edited Loading

bgilbert commented May 31, 2018

dghubble commented May 31, 2018

whereisaaron commented Jun 4, 2018

bgilbert commented Jun 16, 2018

dghubble commented May 27, 2018 •

edited

Loading

ajeddeloh commented May 30, 2018 •

edited

Loading