-
Notifications
You must be signed in to change notification settings - Fork 42
Latest dependency bump of OpenGCS, Alpine, kernel, runc doesn't work #47
Comments
@rn / @jhowardmsft any comments on this one? |
Do have an isolated repro without compose in the picture. ie a simple docker run statement? If so, please run the daemon in debug mode (dockerd -D --experimental) and provide the daemon debug output. Can you also provide a link to the initrd.img and kernel files you are using. |
Sorry if it wasn't clear, but a simple It just hangs like this on the given kernel image:
Binaries are available at https://puppet.box.com/s/17zos8hvr6mc0wsunwo7iu7in6b3irss Initial Daemon startup
Container request
To compare against the previous image, here is the same debug output: Initial Daemon Start
Container request (up to point at which new image hangs)
Container Request.. Continued Success
|
Hmmm. Those binaries seem to work just fine on my RS5 machine (slightly older build than yours - UBR 253 rather than your 437). So I think it's more environmental than a fundamental issue: The docker run part showing the kernel version
And the utility VM side showing the commit that was used to build:
It it 100% repro? Or intermittent? How long does it appear to hang for? Does it eventually timeout? Are you able to pinpoint which of the two files is the culprit (ie switch in just initrd.img or just kernel)? |
I can reproduce the issue 100% on two separate machines (with the same Windows kernel) - one running locally and another Azure hosted VM. Both are VMs rather than bare metal if the nested virtualization makes any difference.
That said, I haven't verified if the hang will time out - though I do know it will hang for several minutes at least. Let me try to run a few quick tests to investigate further. Are there any lower-level logs that I can get from the event log or elsewhere for the hcs bits? |
Ok, I can at least confirm that the kernel appears to be the issue. If I drop the newer
However, if I drop the newer |
After a period of 4 minutes, I do get a log entry immediately after the entry for
|
Hmmm this is going to be difficult to debug as getting the dmesg output from the kernel in the v1 HCS schema (which docker currently uses) is a) broken and unlikely to be fixed as we're aggressively moving to the v2 schema via containerd runtime and b) might need internal/non-public tools to grab the serial console output unless https://github.com/jstarks/npiperelay could be used as an alternate. Can you try in the first instance building uvmboot from https://github.com/microsoft/hcsshim/tree/master/internal/tools/uvmboot to see if a utility VM with that kernel will even boot on your machines. That tool uses the v2 schema, so we would be closer to getting the dmesg output that way. @kevpar might have a good example of how to use uvmboot with an arbitrary initrd and kernel. |
I have the same issue,
It works if i omit the 17/april changes and build from this commit:
|
It definitely appears to be a kernel issue. The initrd.img from the link above works fine. I'm running an internally built kernel on 4.19.24 which is fine, the linuxkit 4.19.27 must be missing some config option. @rn Where is the config the kernel was built from? Were there any obvious changes recently as I'm pretty certain I have had a 4.19 linuxkit kernel working. |
@jhowardmsft the kernel config is at https://github.com/linuxkit/linuxkit/blob/master/kernel/config-4.19.x-x86_64 I don't think there were any significant changes recently to the config file. |
4.19.27 actually works Ok its a kernel problem, so i used latest lcow master and just changed the kernel image. The last image that works (don't hangs) is So it must be a change in between PS: I also tested |
Thanks for the triage. The LinuxKit changes themselves are very unlikely candidates. More likely is that some code in the Linux kernel itself changed. I did a quick scan through the changes from |
Ah ok thanks, but should master not be reverted back, to a working kernel (4.19.28) than, until this can be resolved? |
👍 would be great if someone with commit rights could revert to a working kernel for now. Note - I was also having problems with Docker nightly builds no longer working (filed separately as moby/moby#39227). Will revisit that ticket once I've got a known good latest LCOW image to run on |
I'm not sure if the latest merged PR #45 was intended to be consumed publicly, but the kernel image resulting from building at that SHA will not launch containers in my environment.
I left some comments already at #45 (comment) with more specifics of my environment / build process / etc.
I bring this up because of the problems I'm seeing with DNS resolution in LCOW + Alpine that I raised in - microsoft/opengcs#303
Maybe this is an issue that's been addressed already in the dependency bumps, but I can't use it at the moment.
To recap the linked comments, this is what my Docker environment looks like:
Thanks!
The text was updated successfully, but these errors were encountered: