Skip to content
This repository has been archived by the owner on Oct 16, 2020. It is now read-only.

AWS: Low timeout for NVMe devices #2484

Closed
johanneswuerbach opened this issue Jul 30, 2018 · 9 comments
Closed

AWS: Low timeout for NVMe devices #2484

johanneswuerbach opened this issue Jul 30, 2018 · 9 comments

Comments

@johanneswuerbach
Copy link

johanneswuerbach commented Jul 30, 2018

Issue Report

Bug

Container Linux Version

1800.5.0

Environment

What hardware/cloud provider/hypervisor is being used to run Container Linux?

AWS us-east-1 c5.2xlarge

Expected Behavior

Having an NVMe I/O Operation Timeout configured according to the recommendations from AWS. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html#timeout-nvme-ebs-volumes

Actual Behavior

The NVMe timeout defaults to the kernel default of 30 seconds.

$ cat /sys/module/nvme_core/parameters/io_timeout
30

Other Information

Might be related to #2371

@bgilbert bgilbert changed the title AWS: Low timeout for AWS: Low timeout for NVMe devices Jul 31, 2018
@mariusgrigoriu
Copy link

Where in the CoreOS code-base would this value be updated?

@Alalk
Copy link

Alalk commented Aug 8, 2018

@johanneswuerbach i'm trying to test this to see if the aws docs will fix this.

I was able to get the nvme_core params in the boot. I had to edit the grub in
/usr/share/oem/grub.cfg in a running ec2, then create an ami and try on a new ec2.

i'm still not see the drives get picked up. Although i'm trying on the i3 bare.metal.

[    0.000000] Command line: BOOT_IMAGE=/coreos/vmlinuz-a mount.usr=/dev/mapper/usr verity.usr=PARTUUID=7130c94a-213a-4e5a-8e26-6cce9662f132 rootflags=rw mount.usrflags=ro consoleblank=0 root=LABEL=ROOT console=ttyS0,115200n8 nvme_core.io_timeout=4294967295 nvme_core.max_retries=10 coreos.oem.id=ec2 modprobe.blacklist=xen_fbfront net.ifnames=0 verity.usrhash=398d83dd5252c42312d7ff4b49d0b854072cfcc03657d72e7c792ae24d60077e


@Alalk
Copy link

Alalk commented Aug 9, 2018

@johanneswuerbach
So i got this working. The grub file needs to be changed to have this nvme_core defaults set. Also it only works on the kernel version above 4.15. I will see if i can get a pull request into coreos for the fix.

Alalk added a commit to Alalk/coreos-overlay that referenced this issue Aug 10, 2018
updating the grub config to use the nvme defaults required by aws. Should solve the failure to pass status checks. (eventually)
This only works on kernel version above 4.15. (core timeout max is 255 for below 4.15)

coreos/bugs#2464
coreos/bugs#2484
coreos/bugs#2371
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html#timeout-nvme-ebs-volumes
Alalk added a commit to Alalk/coreos-overlay that referenced this issue Aug 10, 2018
updating the grub config to use the nvme defaults required by aws. Should solve the failure to pass status checks. (eventually)
This only works on kernel version above 4.15. (core timeout max is 255 for below 4.15)

coreos/bugs#2464
coreos/bugs#2484
coreos/bugs#2371
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html#timeout-nvme-ebs-volumes
05c1f12
Alalk added a commit to Alalk/coreos-overlay that referenced this issue Aug 10, 2018
updating the grub config to use the nvme defaults required by aws. Should solve the failure to pass status checks. (eventually)
This only works on kernel version above 4.15. (core timeout max is 255 for below 4.15)

coreos/bugs#2464
coreos/bugs#2484
coreos/bugs#2371
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html#timeout-nvme-ebs-volumes
05c1f12
Alalk added a commit to Alalk/coreos-overlay that referenced this issue Aug 10, 2018
updating the grub config to use the nvme defaults required by aws. Should solve the failure to pass status checks. (eventually)
This only works on kernel version above 4.15. (core timeout max is 255 for below 4.15)

coreos/bugs#2464
coreos/bugs#2484
coreos/bugs#2371
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html#timeout-nvme-ebs-volumes
Alalk added a commit to Alalk/coreos-overlay that referenced this issue Aug 10, 2018
updating the grub config to use the nvme defaults required by aws. Should solve the failure to pass status checks. (eventually)
This only works on kernel version above 4.15. (core timeout max is 255 for below 4.15)

coreos/bugs#2464
coreos/bugs#2484
coreos/bugs#2371
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html#timeout-nvme-ebs-volumes
Alalk added a commit to Alalk/coreos-overlay that referenced this issue Aug 10, 2018
updating the grub config to use the nvme defaults required by aws. Should solve the failure to pass status checks. (eventually)
This only works on kernel version above 4.15. (core timeout max is 255 for below 4.15)

coreos/bugs#2464
coreos/bugs#2484
coreos/bugs#2371
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html#timeout-nvme-ebs-volumes
@r7vme
Copy link

r7vme commented Sep 24, 2018

We are also hitting this issue, setting timeouts (255 sec) and retries (10) seems helps.

Would love to have this "out-of-the-box" in AWS images.

@bgilbert
Copy link
Contributor

Closing as duplicate of #2464.

@bgilbert
Copy link
Contributor

Reopening per #2464 (comment).

@bgilbert bgilbert reopened this Mar 14, 2019
@dm0-
Copy link

dm0- commented May 7, 2019

The NVMe timeout has been changed for EC2 in the current alpha. It will promote to stable in mid-June.

@pms1969
Copy link

pms1969 commented Jul 8, 2019

I'm running a mix of 2079.6 and 2135.4 and neither has the correct setting... What version is meant to have this fix?

@bgilbert
Copy link
Contributor

bgilbert commented Jul 8, 2019

It's in 2135.0.0 and above, but only for new installs. Machines that are upgraded from older releases retain their previous settings.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants