Skip to content
This repository has been archived by the owner on Oct 16, 2020. It is now read-only.

CoreOS 1855.4.0 AWS EBS Mount Lockup #2511

Open
pctj101 opened this issue Oct 17, 2018 · 17 comments
Open

CoreOS 1855.4.0 AWS EBS Mount Lockup #2511

pctj101 opened this issue Oct 17, 2018 · 17 comments

Comments

@pctj101
Copy link

pctj101 commented Oct 17, 2018

Issue Report

Bug

Ignition crashes system if storage.filesystem is specified

CT Input

storage:
  filesystems:
    - name: data
      mount:
        device: /dev/sdb
        format: ext4
        wipe_filesystem: true
        label: DATA

Convert to userdata
ct < test.ct

{"ignition":{"config":{},"security":{"tls":{}},"timeouts":{},"version":"2.2.0"},"networkd":{},"passwd":{},"storage":{"filesystems":[{"mount":{"device":"/dev/sdb","format":"ext4","label":"DATA","wipeFilesystem":true},"name":"data"}]},"systemd":{}}

Container Linux Version

CoreOS-stable-1855.4.0-hvm (ami-086eb64b7f4485a72)
ct v0.9.0

Environment

AWS

Expected Behavior

At minimum format my block device

Actual Behavior

System does not boot
Can't login, so can't get logs
Screenshot
https://www.evernote.com/l/AE__MLODCjJN_p8vv9G_LkqC2nBnb6BbAqI

Reproduction Steps

  1. Create EC2 instance, attach 80GB EBS to /dev/sdb, add user data, boot and crash

Other Information

Worked before on older CoreOS-stable-1688.5.3-hvm (ami-a2b6a2de)

Manually booting without CT/Ignition allows manual format/mounting of /dev/sdb (mounting by label is also no problem)

@pctj101 pctj101 changed the title CoreOS 1855.4.0 AWS EBS CoreOS 1855.4.0 AWS EBS Mount Lockup Oct 17, 2018
@ajeddeloh
Copy link

Thanks for the report. This probably isn't an Ignition bug but rather a kernel bug since Ignition didn't change between 1855.3.0 and 1855.4.0. Can you repro on alpha?

@pctj101
Copy link
Author

pctj101 commented Oct 17, 2018

Will check tomorrow :)

@pctj101
Copy link
Author

pctj101 commented Oct 18, 2018

@ajeddeloh - Same issue on alpha:
CoreOS-alpha-1925.0.0-hvm (ami-01d20d68c856200cc)

Also please note previous working version was much older:
CoreOS-stable-1688.5.3-hvm (ami-a2b6a2de)

@enieuw
Copy link

enieuw commented Oct 19, 2018

This happens for me as well on the latest gen instances. Switching instances from t2 and t3 results into the system hanging on a systemd unit that's waiting for device "/dev/xvdg".

Perhaps this has something to do with switching to the NVME names that t3 instances do.

@enieuw
Copy link

enieuw commented Oct 19, 2018

Fetched one of the logs from the machines, I see lots of these messages:

[�[0m�[0;31m*     �[0m] (1 of 3) A start job is running for dev-xvdg.device (4s / 1min 30s)
�[K[�[0;1;31m*�[0m�[0;31m*    �[0m] (1 of 3) A start job is running for dev-xvdg.device (5s / 1min 30s)
�[K[�[0;31m*�[0;1;31m*�[0m�[0;31m*   �[0m] (1 of 3) A start job is running for dev-xvdg.device (5s / 1min 30s)
�[K[ �[0;31m*�[0;1;31m*�[0m�[0;31m*  �[0m] (2 of 3) A start job is running for Ignition (disks) (10s / no limit)
�[K[  �[0;31m*�[0;1;31m*�[0m�[0;31m* �[0m] (2 of 3) A start job is running for Ignition (disks) (11s / no limit)
�[K[   �[0;31m*�[0;1;31m*�[0m�[0;31m*�[0m] (2 of 3) A start job is running for Ignition (disks) (11s / no limit)
�[K[    �[0;31m*�[0;1;31m*�[0m] (3 of 3) A start job is running for…mapper-usr.device (12s / no limit)
�[K[     �[0;31m*�[0m] (3 of 3) A start job is running for…mapper-usr.device (12s / no limit)
�[K[    �[0;31m*�[0;1;31m*�[0m] (3 of 3) A start job is running for…mapper-usr.device (13s / no limit)
�[K[   �[0;31m*�[0;1;31m*�[0m�[0;31m*�[0m]
 (1 of 3) A start job is running for dev-xvdg.device (9s / 1min 30s)
�[K[  �[0;31m*�[0;1;31m*�[0m�[0;31m* �[0m] (1 of 3) A start job is running for dev-xvdg.device (9s / 1min 30s)
�[K[ �[0;31m*�[0;1;31m*�[0m�[0;31m*  �[0m] (1 of 3) A start job is running for dev-xvdg.device (10s / 1min 30s)
�[K[�[0;31m*�[0;1;31m*�[0m�[0;31m*   �[0m] (2 of 3) A start job is running for Ignition (disks) (15s / no limit)
�[K[�[0;1;31m*�[0m�[0;31m*    �[0m] (2 of 3) A start job is running for Ignition (disks) (15s / no limit)
�[K[�[0m�[0;31m*     �[0m] (2 of 3) A start job is running for Ignition (disks) (16s / no limit)
�[K[�[0;1;31m*�[0m�[0;31m*    �[0m] (3 of 3) A start job is running for…mapper-usr.device (16s / no limit)
�[K[�[0;31m*�[0;1;31m*�[0m�[0;31m*   �[0m] (3 of 3) A start job is running for…mapper-usr.device (17s / no limit)[   24.010121] systemd-networkd[242]: eth0: Configured

�[K[ �[0;31m*�[0;1;31m*�[0m�[0;31m*  �[0m] (3 of 3) A start job is running for…mapper-usr.device (17s / no limit)
�[K[  �[0;31m*�[0;1;31m*�[0m�[0;31m* �[0m] (1 of 3) A start job is running for dev-xvdg.device (13s / 1min 30s)
�[K[   �[0;31m*�[0;1;31m*�[0m�[0;31m*�[0m] (1 of 3) A start job is running for dev-xvdg.device (14s / 1min 30s)
�[K[    �[0;31m*�[0;1;31m*�[0m] (1 of 3) A start job is running for dev-xvdg.device (14s / 1min 30s)
�[K[     �[0;31m*�[0m] (2 of 3) A start job is running for Ignition (disks) (19s / no limit)

It eventually times out:

[  101.111108] systemd[1]: Timed out waiting for device dev-xvdg.device.
[�[0;1;31mFAILED�[0m[  101.154243] ] ignitionFailed to start Ignition (disks).[415]: 
disks: createFilesystems: op(1): [failed]   waiting for devices [/dev/xvdg]: device unit dev-xvdg.device timeout
See 'systemctl status ignition-disks.service' for details.[  101.159042] 
systemd[1]: dev-xvdg.device: Job dev-xvdg.device/start failed with result 'timeout'.

@pctj101
Copy link
Author

pctj101 commented Oct 19, 2018

@enieuw Hey... how did you get that system log? My AWS EC2 system logs are blank :(

@enieuw
Copy link

enieuw commented Oct 19, 2018

@enieuw Hey... how did you get that system log? My AWS EC2 system logs are blank :(

It takes a while but eventually they show up under "Instance Settings -> Get system log".

Which instance type are you running by the way?

@pctj101
Copy link
Author

pctj101 commented Oct 19, 2018

For this debug session I was running t2/t3/m5 (can't remember the exact size)

@enieuw
Copy link

enieuw commented Oct 19, 2018

Creating a VM as a t2 instance and then changing the instance type to t3 works, I actually see the symlinks working:

Container Linux by CoreOS stable (1855.4.0)
core@ip-10-14-30-4 ~ $ systemctl status dev-xvdg.device
● dev-xvdg.device - Amazon Elastic Block Store
   Follow: unit currently follows state of sys-devices-pci0000:00-0000:00:1f.0-nvme-nvme1-nvme1n1.device
   Loaded: loaded
   Active: active (plugged) since Fri 2018-10-19 06:49:00 UTC; 1min 8s ago
   Device: /sys/devices/pci0000:00/0000:00:1f.0/nvme/nvme1/nvme1n1

Oct 19 06:49:00 ip-10-14-30-4 systemd[1]: Found device Amazon Elastic Block Store.
core@ip-10-14-30-4 ~ $ date
Fri Oct 19 06:50:14 UTC 2018
core@ip-10-14-30-4 ~ $ ls -al /dev/xvdg
lrwxrwxrwx. 1 root root 7 Oct 19 06:49 /dev/xvdg -> nvme1n1
core@ip-10-14-30-4 ~ $

Creating a fresh T3 instance results in the hanging behaviour

@pctj101
Copy link
Author

pctj101 commented Oct 19, 2018

Ah I'm betting that's because when changing the instance type that Ignition doesn't run on the second boot.

@enieuw
Copy link

enieuw commented Oct 19, 2018

Yeah most likely. It doesn't trigger the wait for the systemd unit and then it does continue booting.

If I specify /dev/nvme1n1 in my ignition file it does boot properly. Perhaps the call to systemd is done before udev has mapped the aliases added by #2399

@lucab
Copy link

lucab commented Oct 19, 2018

@enieuw I think you are waiting for coreos/bootengine#149 to do that.

@pctj101
Copy link
Author

pctj101 commented Oct 22, 2018

Okay looks like a mismatch between assigning EBS to /dev/sdb in the AWS console and /dev/xvdb appearing in linux.

ap-northeast-1
t2.micro
CoreOS-stable-1855.4.0-hvm (ami-086eb64b7f4485a72)
Root device /dev/xvda
Block devices /dev/xvda /dev/sdb

{"ignition":{"config":{},"security":{"tls":{}},"timeouts":{},"version":"2.2.0"},"networkd":{},"passwd":{},"storage":{"filesystems":[{"mount":{"device":"/dev/sdb","format":"ext4","label":"DATA","wipeFilesystem":true},"name":"data"}]},"systemd":{}}

Results in:

[0;1;31m[0mdisks: createFilesystems: op(1): [started]  waiting for devices [/dev/sdb]
[0;1;31m[0mdisks: createFilesystems: op(1): [failed]   waiting for devices [/dev/sdb]: device unit dev-sdb.device timeout
[0;1;31m[0mdisks: failed to create filesystems: failed to wait on filesystems devs: device unit dev-sdb.device timeout

Updating the config from sdb -> xvdb finishes the boot.

Is there already a ticket for sdb vs xvdb? I think on some systems (can't remember) /dev/sdb shows up instead.

@pctj101
Copy link
Author

pctj101 commented Oct 23, 2018

As a follow on thought, it seems sometimes EBS volumes (add-on disks on AWS) show up as /dev/sdb and sometimes as /dev/xvdb. That makes ignition scripts fail if mismatched and makes it difficult to use the same script on various servers.

Is there any guidance on /dev/sdb vs /dev/xvdb going forward in coreos? Perhaps following such guidance would have prevented this ticket.

@lucab
Copy link

lucab commented Oct 23, 2018

@pctj101 this is an unfortunate choice on AWS side, see #2399 (comment). Their volumes/instances/names grid is documented here: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/device_naming.html

@pctj101
Copy link
Author

pctj101 commented Oct 23, 2018

@lucab - Yes I too have seen where my EC2 launch spec and coreos device path mismatch. I think it's related to this item on the same page as the link you provided:

Depending on the block device driver of the kernel, the device could be attached with a different name than you specified. For example, if you specify a device name of /dev/sdh, your device could be renamed /dev/xvdh or /dev/hdh.

So it seems that the kernel configuration (and thus CoreOS) may also have some interaction. So it's not just "It's AWS", but "It's AWS and How CoreOS interact" which is why I'm bringing up this question. :)

Anyways, yes I read the other thread you linked to. I can share with you that on device mapping I've abandoned Ignition and resorted to a series of shell scripts to format and mount things properly (despite instance type changes). I'm not sure if that's the long term way to do it, but I'm pretty sure the discussion either way is lengthy and has plenty of ideology to go with it :)

When it comes to AWS totally changing device paths for NVMe, even I have trouble justifying automagic resolution with ignition.

It's definitely a usability discussion rather than a bug discussion.

@seh
Copy link

seh commented Dec 4, 2018

Possibly related: #2481.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants