-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disk corruption without zfs_vdev_disk_classic=1 for a single virtual machine. #16279
Comments
Well, all that did was push the errors of for a couple of hours instead of them happening immediately |
Do you happen to have the first lines of the kernel panic? Looks like they may have been cut off. |
It looks like I cut it off. Here's a full one I think:
I get it for multiple CPUs. Here's one of the others:
It looks like that one got messed up being written to the syslog. There's a lot of them, but it looks like the same set of several error repeating. And then I start seeing these:
The pool:
When this happens I get hundreds of read/write/checksum errors until the pool faults-out. When I shut it down and load up the VMWare virtual machine with the same raw disks and run I've snapshotted the Proxmox VM and I'm going to try it out after updating to FC24, which is what another of our old systems is running along with a similar zpool that off the same JBOD that isn't having problems. |
After updating it to FC24 with kernel 4.11.12-100.fc24.x86_64 I'm not having disk errors or corruption reported anymore. The system has been stable for 18 hours. |
I started looking into this last night, but had nothing to report yet. Your last comment is interesting. Can you clarify specific RHEL/FC versions and kernel versions you tried, including the extended kernel versioning gunk that these kernels have, and whether or not they failed or not? At least, I'd like in/out version ranges. I don't have a solid theory, just a few smells. Red Hat backports heavily from newer kernels into their shipped ones, for example, the "3.10" kernel that ships with EL7 actually has some stuff from 5.x pulled back into it. This sort of thing sometimes requires us to go to some lengths inside OpenZFS to keep things running well, because version numbers stop being a good indicator of when a particular feature or behaviour changed. For a couple of months, the new BIO submission code ( The crash output you're showing has similar shapes, which is making me wonder if either 4.5 is not the right place to draw the line, or if RHEL/FC kernels are modified in such a way that they didn't get the changed behaviour until 4.11-FC24, or something of that kind. Or, maybe a different problem entirely! But yeah, that's why specific versions would help. Thanks :) The rough theory I'm guessing at is something like this. Red Hat backport heavily into their kernels, for example, the "3.10" kernel that ships with EL7 actually has some stuff from 5.x pulled back into it. So it's not alwa |
I keep forgetting that RH back-ports features and fixes. That can make diagnosis tricky. The machine that's having problems started with Fedora 22 kernel All of our CentOS 7, CentOS 8, Rocky Linux 8, and Rocky Linux 9 are ok, as are Fedora 18. I think those are all of the OS version we are running with ZFS and pass-through disks. The vm I'm having problems with has run for years under VMWare ESXI 5.5. I only had problems after migrating to Proxmox 8.2.2, so it has to either have something to do with an interaction between the guest OS and QEMU, or something between that and Proxmox itself. The disk errors I saw were only in the virtual machine's logs and nothing in the logs of the host server. Even though the virtual machine reported disk corruption, mounting them in another VM and scrubbing them revealed no corruption or errors. Interestingly, one of the search results that came up when I was looking for similar issues was one I posted years ago when they discovered there was a memory issue on NUMA systems and fixed it. |
@angstymeat hows it travelling since your last comment? |
I upgraded the system to a newer kernel and it hasn't repeated. Normally, I would have tried to diagnose what the specific issue was, but I didn't have the time since we were trying to get off of VMWare as quickly as possible. |
System information
Describe the problem you're observing
I'm migrating our virtual machines from VMWare ESXi to Proxmox 8.22. It has gone smoothly except for a single virtual machine is one of our older systems running proprietary software which is why it is still running Fedora Core 23. I have four disks connected over an SAS JBOD that are pass-thru directly to the VM, the same configuration they had under VMWare.
This VM immediately began exhibiting disk corruption, reporting numerous read, write, and checksum errors. I immediately stopped it and booted up the VMWare version using the same disks and scrubbed the pool. No errors were reported.
I have at least a dozen other virtual machines that I have migrated to Proxmox, also using ZFS, most of them the latest version, and none of them exhibit this issue. The VM configuration (hardware type, CPU type, etc) is the same between all of them (memory size & cpu # varies).
Some of them are Fedora Core 18. Some are CentOS 7, some are CentOS 8. None of them have this issue.
The VM was originally FC22 when I migrated, and thinking it was a kernel issue I updated it to FC23 (the kernel went from 4.4 to 4.8), however the issue persisted.
While searching I came across #15533, which exhibited the same symptoms but I'm not running on top of LUKS or anything. When I applied
zfs_vdev_disk_classic=1
, the errors went awayAgain, none of my other VMs need this option set. Other than the kernel versions, I can't figure out what is different or why this is happening. We either use older kernels like 3.10 under CentOS 7, or newer ones like 4.11 and above (FC24, CentOS 8, etc.).
Describe how to reproduce the problem
Currently, I can get this to occur regularly using this particular zpool on this particular machine under Proxmox 8.2.2, but not under VMWare. I boot them machine and start running our software which performs many small reads and writes in multiple threads (it is collecting seismic data from multiple sources) to memory-mapped files.
Include any warning/errors/backtraces from the system logs
Under FC22 I would see many errors about losing connection to the disks in the system logs. However, I did not hold onto those errors while I was debugging. These errors would not appear on the Proxmox host, on the the VM.
Under FC23 i get the following:
The text was updated successfully, but these errors were encountered: