CoreOS 1855.4.0 AWS EBS Mount Lockup #2511

pctj101 · 2018-10-17T11:41:08Z

Issue Report

Bug

Ignition crashes system if storage.filesystem is specified

CT Input

storage:
  filesystems:
    - name: data
      mount:
        device: /dev/sdb
        format: ext4
        wipe_filesystem: true
        label: DATA

Convert to userdata
ct < test.ct

{"ignition":{"config":{},"security":{"tls":{}},"timeouts":{},"version":"2.2.0"},"networkd":{},"passwd":{},"storage":{"filesystems":[{"mount":{"device":"/dev/sdb","format":"ext4","label":"DATA","wipeFilesystem":true},"name":"data"}]},"systemd":{}}

Container Linux Version

CoreOS-stable-1855.4.0-hvm (ami-086eb64b7f4485a72)
ct v0.9.0

Environment

AWS

Expected Behavior

At minimum format my block device

Actual Behavior

System does not boot
Can't login, so can't get logs
Screenshot
https://www.evernote.com/l/AE__MLODCjJN_p8vv9G_LkqC2nBnb6BbAqI

Reproduction Steps

Create EC2 instance, attach 80GB EBS to /dev/sdb, add user data, boot and crash

Other Information

Worked before on older CoreOS-stable-1688.5.3-hvm (ami-a2b6a2de)

Manually booting without CT/Ignition allows manual format/mounting of /dev/sdb (mounting by label is also no problem)

The text was updated successfully, but these errors were encountered:

ajeddeloh · 2018-10-17T17:17:37Z

Thanks for the report. This probably isn't an Ignition bug but rather a kernel bug since Ignition didn't change between 1855.3.0 and 1855.4.0. Can you repro on alpha?

pctj101 · 2018-10-17T17:35:24Z

Will check tomorrow :)

pctj101 · 2018-10-18T05:31:38Z

@ajeddeloh - Same issue on alpha:
CoreOS-alpha-1925.0.0-hvm (ami-01d20d68c856200cc)

Also please note previous working version was much older:
CoreOS-stable-1688.5.3-hvm (ami-a2b6a2de)

enieuw · 2018-10-19T05:28:49Z

This happens for me as well on the latest gen instances. Switching instances from t2 and t3 results into the system hanging on a systemd unit that's waiting for device "/dev/xvdg".

Perhaps this has something to do with switching to the NVME names that t3 instances do.

enieuw · 2018-10-19T05:39:43Z

Fetched one of the logs from the machines, I see lots of these messages:

[�[0m�[0;31m*     �[0m] (1 of 3) A start job is running for dev-xvdg.device (4s / 1min 30s)
�[K[�[0;1;31m*�[0m�[0;31m*    �[0m] (1 of 3) A start job is running for dev-xvdg.device (5s / 1min 30s)
�[K[�[0;31m*�[0;1;31m*�[0m�[0;31m*   �[0m] (1 of 3) A start job is running for dev-xvdg.device (5s / 1min 30s)
�[K[ �[0;31m*�[0;1;31m*�[0m�[0;31m*  �[0m] (2 of 3) A start job is running for Ignition (disks) (10s / no limit)
�[K[  �[0;31m*�[0;1;31m*�[0m�[0;31m* �[0m] (2 of 3) A start job is running for Ignition (disks) (11s / no limit)
�[K[   �[0;31m*�[0;1;31m*�[0m�[0;31m*�[0m] (2 of 3) A start job is running for Ignition (disks) (11s / no limit)
�[K[    �[0;31m*�[0;1;31m*�[0m] (3 of 3) A start job is running for…mapper-usr.device (12s / no limit)
�[K[     �[0;31m*�[0m] (3 of 3) A start job is running for…mapper-usr.device (12s / no limit)
�[K[    �[0;31m*�[0;1;31m*�[0m] (3 of 3) A start job is running for…mapper-usr.device (13s / no limit)
�[K[   �[0;31m*�[0;1;31m*�[0m�[0;31m*�[0m]
 (1 of 3) A start job is running for dev-xvdg.device (9s / 1min 30s)
�[K[  �[0;31m*�[0;1;31m*�[0m�[0;31m* �[0m] (1 of 3) A start job is running for dev-xvdg.device (9s / 1min 30s)
�[K[ �[0;31m*�[0;1;31m*�[0m�[0;31m*  �[0m] (1 of 3) A start job is running for dev-xvdg.device (10s / 1min 30s)
�[K[�[0;31m*�[0;1;31m*�[0m�[0;31m*   �[0m] (2 of 3) A start job is running for Ignition (disks) (15s / no limit)
�[K[�[0;1;31m*�[0m�[0;31m*    �[0m] (2 of 3) A start job is running for Ignition (disks) (15s / no limit)
�[K[�[0m�[0;31m*     �[0m] (2 of 3) A start job is running for Ignition (disks) (16s / no limit)
�[K[�[0;1;31m*�[0m�[0;31m*    �[0m] (3 of 3) A start job is running for…mapper-usr.device (16s / no limit)
�[K[�[0;31m*�[0;1;31m*�[0m�[0;31m*   �[0m] (3 of 3) A start job is running for…mapper-usr.device (17s / no limit)[   24.010121] systemd-networkd[242]: eth0: Configured

�[K[ �[0;31m*�[0;1;31m*�[0m�[0;31m*  �[0m] (3 of 3) A start job is running for…mapper-usr.device (17s / no limit)
�[K[  �[0;31m*�[0;1;31m*�[0m�[0;31m* �[0m] (1 of 3) A start job is running for dev-xvdg.device (13s / 1min 30s)
�[K[   �[0;31m*�[0;1;31m*�[0m�[0;31m*�[0m] (1 of 3) A start job is running for dev-xvdg.device (14s / 1min 30s)
�[K[    �[0;31m*�[0;1;31m*�[0m] (1 of 3) A start job is running for dev-xvdg.device (14s / 1min 30s)
�[K[     �[0;31m*�[0m] (2 of 3) A start job is running for Ignition (disks) (19s / no limit)

It eventually times out:

[  101.111108] systemd[1]: Timed out waiting for device dev-xvdg.device.
[�[0;1;31mFAILED�[0m[  101.154243] ] ignitionFailed to start Ignition (disks).[415]: 
disks: createFilesystems: op(1): [failed]   waiting for devices [/dev/xvdg]: device unit dev-xvdg.device timeout
See 'systemctl status ignition-disks.service' for details.[  101.159042] 
systemd[1]: dev-xvdg.device: Job dev-xvdg.device/start failed with result 'timeout'.

pctj101 · 2018-10-19T06:34:58Z

@enieuw Hey... how did you get that system log? My AWS EC2 system logs are blank :(

enieuw · 2018-10-19T06:39:40Z

@enieuw Hey... how did you get that system log? My AWS EC2 system logs are blank :(

It takes a while but eventually they show up under "Instance Settings -> Get system log".

Which instance type are you running by the way?

pctj101 · 2018-10-19T06:40:58Z

For this debug session I was running t2/t3/m5 (can't remember the exact size)

enieuw · 2018-10-19T06:54:44Z

Creating a VM as a t2 instance and then changing the instance type to t3 works, I actually see the symlinks working:

Container Linux by CoreOS stable (1855.4.0)
core@ip-10-14-30-4 ~ $ systemctl status dev-xvdg.device
● dev-xvdg.device - Amazon Elastic Block Store
   Follow: unit currently follows state of sys-devices-pci0000:00-0000:00:1f.0-nvme-nvme1-nvme1n1.device
   Loaded: loaded
   Active: active (plugged) since Fri 2018-10-19 06:49:00 UTC; 1min 8s ago
   Device: /sys/devices/pci0000:00/0000:00:1f.0/nvme/nvme1/nvme1n1

Oct 19 06:49:00 ip-10-14-30-4 systemd[1]: Found device Amazon Elastic Block Store.
core@ip-10-14-30-4 ~ $ date
Fri Oct 19 06:50:14 UTC 2018
core@ip-10-14-30-4 ~ $ ls -al /dev/xvdg
lrwxrwxrwx. 1 root root 7 Oct 19 06:49 /dev/xvdg -> nvme1n1
core@ip-10-14-30-4 ~ $

Creating a fresh T3 instance results in the hanging behaviour

pctj101 · 2018-10-19T06:55:48Z

Ah I'm betting that's because when changing the instance type that Ignition doesn't run on the second boot.

enieuw · 2018-10-19T07:44:49Z

Yeah most likely. It doesn't trigger the wait for the systemd unit and then it does continue booting.

If I specify /dev/nvme1n1 in my ignition file it does boot properly. Perhaps the call to systemd is done before udev has mapped the aliases added by #2399

lucab · 2018-10-19T08:55:38Z

@enieuw I think you are waiting for coreos/bootengine#149 to do that.

pctj101 · 2018-10-22T10:51:09Z

Okay looks like a mismatch between assigning EBS to /dev/sdb in the AWS console and /dev/xvdb appearing in linux.

ap-northeast-1
t2.micro
CoreOS-stable-1855.4.0-hvm (ami-086eb64b7f4485a72)
Root device /dev/xvda
Block devices /dev/xvda /dev/sdb

{"ignition":{"config":{},"security":{"tls":{}},"timeouts":{},"version":"2.2.0"},"networkd":{},"passwd":{},"storage":{"filesystems":[{"mount":{"device":"/dev/sdb","format":"ext4","label":"DATA","wipeFilesystem":true},"name":"data"}]},"systemd":{}}

Results in:

[0;1;31m[0mdisks: createFilesystems: op(1): [started]  waiting for devices [/dev/sdb]
[0;1;31m[0mdisks: createFilesystems: op(1): [failed]   waiting for devices [/dev/sdb]: device unit dev-sdb.device timeout
[0;1;31m[0mdisks: failed to create filesystems: failed to wait on filesystems devs: device unit dev-sdb.device timeout

Updating the config from sdb -> xvdb finishes the boot.

Is there already a ticket for sdb vs xvdb? I think on some systems (can't remember) /dev/sdb shows up instead.

pctj101 · 2018-10-23T05:32:13Z

As a follow on thought, it seems sometimes EBS volumes (add-on disks on AWS) show up as /dev/sdb and sometimes as /dev/xvdb. That makes ignition scripts fail if mismatched and makes it difficult to use the same script on various servers.

Is there any guidance on /dev/sdb vs /dev/xvdb going forward in coreos? Perhaps following such guidance would have prevented this ticket.

lucab · 2018-10-23T07:20:36Z

@pctj101 this is an unfortunate choice on AWS side, see #2399 (comment). Their volumes/instances/names grid is documented here: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/device_naming.html

pctj101 · 2018-10-23T07:34:33Z

@lucab - Yes I too have seen where my EC2 launch spec and coreos device path mismatch. I think it's related to this item on the same page as the link you provided:

Depending on the block device driver of the kernel, the device could be attached with a different name than you specified. For example, if you specify a device name of /dev/sdh, your device could be renamed /dev/xvdh or /dev/hdh.

So it seems that the kernel configuration (and thus CoreOS) may also have some interaction. So it's not just "It's AWS", but "It's AWS and How CoreOS interact" which is why I'm bringing up this question. :)

Anyways, yes I read the other thread you linked to. I can share with you that on device mapping I've abandoned Ignition and resorted to a series of shell scripts to format and mount things properly (despite instance type changes). I'm not sure if that's the long term way to do it, but I'm pretty sure the discussion either way is lengthy and has plenty of ideology to go with it :)

When it comes to AWS totally changing device paths for NVMe, even I have trouble justifying automagic resolution with ignition.

It's definitely a usability discussion rather than a bug discussion.

seh · 2018-12-04T15:16:25Z

Possibly related: #2481.

pctj101 changed the title ~~CoreOS 1855.4.0 AWS EBS~~ CoreOS 1855.4.0 AWS EBS Mount Lockup Oct 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CoreOS 1855.4.0 AWS EBS Mount Lockup #2511

CoreOS 1855.4.0 AWS EBS Mount Lockup #2511

pctj101 commented Oct 17, 2018

ajeddeloh commented Oct 17, 2018

pctj101 commented Oct 17, 2018

pctj101 commented Oct 18, 2018 •

edited

Loading

enieuw commented Oct 19, 2018 •

edited

Loading

enieuw commented Oct 19, 2018

pctj101 commented Oct 19, 2018

enieuw commented Oct 19, 2018

pctj101 commented Oct 19, 2018

enieuw commented Oct 19, 2018

pctj101 commented Oct 19, 2018

enieuw commented Oct 19, 2018

lucab commented Oct 19, 2018

pctj101 commented Oct 22, 2018

pctj101 commented Oct 23, 2018

lucab commented Oct 23, 2018

pctj101 commented Oct 23, 2018 •

edited

Loading

seh commented Dec 4, 2018

CoreOS 1855.4.0 AWS EBS Mount Lockup #2511

CoreOS 1855.4.0 AWS EBS Mount Lockup #2511

Comments

pctj101 commented Oct 17, 2018

Issue Report

Bug

Container Linux Version

Environment

Expected Behavior

Actual Behavior

Reproduction Steps

Other Information

ajeddeloh commented Oct 17, 2018

pctj101 commented Oct 17, 2018

pctj101 commented Oct 18, 2018 • edited Loading

enieuw commented Oct 19, 2018 • edited Loading

enieuw commented Oct 19, 2018

pctj101 commented Oct 19, 2018

enieuw commented Oct 19, 2018

pctj101 commented Oct 19, 2018

enieuw commented Oct 19, 2018

pctj101 commented Oct 19, 2018

enieuw commented Oct 19, 2018

lucab commented Oct 19, 2018

pctj101 commented Oct 22, 2018

pctj101 commented Oct 23, 2018

lucab commented Oct 23, 2018

pctj101 commented Oct 23, 2018 • edited Loading

seh commented Dec 4, 2018

pctj101 commented Oct 18, 2018 •

edited

Loading

enieuw commented Oct 19, 2018 •

edited

Loading

pctj101 commented Oct 23, 2018 •

edited

Loading