-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OSBuild without compression yields GRUB failures #3728
Comments
I've been investigating why a seemingly innocuous change (changing compression on OSBuild generated qemu qcow2) would cause disk images to not boot [1]. I think I have found the issue. I was first trying to make sure 100% that the files got written out over the virtiofs mount before the VM got shutdown so I decided to add a `umount $workdir` to the process. But this ended up with a `umount: /srv/: target is busy.` error. When the supermin VM gets run we `cd "${workdir}"` at the end of supermin-init-prelude.sh. This has the effect of causing all spawned processes (including PID1/init) to have a cwd of /srv/. ``` bash-5.2# lsof /srv COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME init 1 root cwd DIR 0,26 4096 10485829 /srv kthreadd 2 root cwd DIR 0,26 4096 10485829 /srv pool_work 3 root cwd DIR 0,26 4096 10485829 /srv kworker/R 4 root cwd DIR 0,26 4096 10485829 /srv ... ... ``` Which means it's unlikely that the virtiofs mount ever gets cleanly unmounted. Let's rework things here so that actual work gets spawned in a subshell to prevent `init` from having a cwd on the virtiofs mount. We also add in an `umount` of the cache qcow2 (if exists) and the virtiofs mount to strengthen our chances of a clean unmount. [1] coreos#3728
We previously did this in a different way (2a8d1e6) but then had to revert it (39fdd61) because it caused images to not boot [1]. The root cause appears to have been the virtiofs mount not being unmounted cleanly from the supermin VM and that is now fixed so let's switch back to not compressing since we rely on our outer compression [2]. [1] coreos#3728 [2] coreos/fedora-coreos-tracker#1653 (comment)
I've been investigating why a seemingly innocuous change (changing compression on OSBuild generated qemu qcow2) would cause disk images to not boot [1]. I think I have found the issue. I was first trying to make sure 100% that the files got written out over the virtiofs mount before the VM got shutdown so I decided to add a `umount $workdir` to the process. But this ended up with a `umount: /srv/: target is busy.` error. When the supermin VM gets run we `cd "${workdir}"` at the end of supermin-init-prelude.sh. This has the effect of causing all spawned processes (including PID1/init) to have a cwd of /srv/. ``` bash-5.2# lsof /srv COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME init 1 root cwd DIR 0,26 4096 10485829 /srv kthreadd 2 root cwd DIR 0,26 4096 10485829 /srv pool_work 3 root cwd DIR 0,26 4096 10485829 /srv kworker/R 4 root cwd DIR 0,26 4096 10485829 /srv ... ... ``` Which means it's unlikely that the virtiofs mount ever gets cleanly unmounted. Let's rework things here so that actual work gets spawned in a subshell to prevent `init` from having a cwd on the virtiofs mount. We also add in an `umount` of the cache qcow2 (if exists) and the virtiofs mount to strengthen our chances of a clean unmount. [1] coreos#3728
We previously did this in a different way (2a8d1e6) but then had to revert it (39fdd61) because it caused images to not boot [1]. The root cause appears to have been the virtiofs mount not being unmounted cleanly from the supermin VM and that is now fixed so let's switch back to not compressing since we rely on our outer compression [2]. [1] coreos#3728 [2] coreos/fedora-coreos-tracker#1653 (comment)
Still trying to figure out data corruption from coreos#3728
some more info.. it appears something with the ext4 filesystem when OSBuild runs the
|
ok further update.. I became increasingly suspicious of the corruption happening earlier in the OSBuild steps themselves rather than when copying the file out of the supermin VM. I decided to just add a sanity check into the diff --git a/stages/org.osbuild.qemu b/stages/org.osbuild.qemu
index 642b5146..54e707d4 100755
--- a/stages/org.osbuild.qemu
+++ b/stages/org.osbuild.qemu
@@ -219,22 +219,34 @@ def main(inputs, output, options):
if coroutines:
print(f"qemu-img coroutines: {coroutines}")
extra_args += ["-m", coroutines]
cmd = [
"qemu-img", "convert",
"-O", fmt["type"],
*extra_args,
source, target
]
subprocess.run(
cmd, check=True
)
+ # Sanity check that the image is 100%
+ cmd = [
+ "qemu-img", "compare",
+ "-f", "raw",
+ "-F", fmt["type"],
+ source, target
+ ]
+ subprocess.run(
+ cmd, check=True
+ )
+
+
return 0
if __name__ == '__main__':
args = osbuild.api.arguments()
r = main(args["inputs"], args["tree"], args["options"])
sys.exit(r) and sure enough one of the first runs I went through complained:
From the |
Talked to @jlebon on this. We think (actually more @jlebon, I was just nodding my head to everything he said) that this might be a bug with reflinks on XFS. The cache qcow2 is an XFS filesystem and when we switch to not using compression for the i.e. this is a theory that could explain why we see issues only after switching off compression. This particular issue reminded @jlebon of #935 which we never fully got to the bottom of. |
Still trying to figure out data corruption from coreos#3728
We think there might be some XFS reflink issues when we run the OSBuild org.osbuild.qemu stage compression: false. See coreos#3728 (comment)
We previously did this in a different way (2a8d1e6) but then had to revert it (39fdd61) because it caused images to not boot [1]. The root cause appears to have been the virtiofs mount not being unmounted cleanly from the supermin VM and that is now fixed so let's switch back to not compressing since we rely on our outer compression [2]. [1] coreos#3728 [2] coreos/fedora-coreos-tracker#1653 (comment)
We think there might be some XFS reflink issues when we run the OSBuild org.osbuild.qemu stage compression: false. See coreos#3728 (comment)
We previously did this in a different way (2a8d1e6) but then had to revert it (39fdd61) because it caused images to not boot [1]. The root cause appears to have been the virtiofs mount not being unmounted cleanly from the supermin VM and that is now fixed so let's switch back to not compressing since we rely on our outer compression [2]. [1] coreos#3728 [2] coreos/fedora-coreos-tracker#1653 (comment)
We think there might be some XFS reflink issues when we run the OSBuild org.osbuild.qemu stage compression: false. See coreos#3728 (comment)
We previously did this in a different way (2a8d1e6) but then had to revert it (39fdd61) because it caused images to not boot [1]. The root cause appears to have been the virtiofs mount not being unmounted cleanly from the supermin VM and that is now fixed so let's switch back to not compressing since we rely on our outer compression [2]. [1] coreos#3728 [2] coreos/fedora-coreos-tracker#1653 (comment)
I'm having a lot of trouble investigating [1]. Let's add this here as a small sanity check for now. [1] coreos/coreos-assembler#3728
Let's see if we can reproduce coreos/coreos-assembler#3728
I'm having a lot of trouble investigating [1]. Let's add this here as a small sanity check for now. [1] coreos/coreos-assembler#3728
Let's see if we can reproduce coreos/coreos-assembler#3728
So I don't know if it's this, but the classic failure mode when you create a disk image and then boot from it shortly afterwards is:
If this is the scenario, then you need to modify your image creation pipeline so that it does an Here's how virt-builder does that: |
@rwmjones in the failure I detailed in #3728 (comment) all that is happening is:
and you can see the failure:
Would you anticipate this to be a kernel page cache issue? |
No that wouldn't the kernel page cache issue. It's extremely suspicious though. Usually qcow2 and qemu-img are rock solid tools. |
Yes :). I'd like to take a moment here and thank you and all the other maintainers of those tools over the years. You've truly built something that is load bearing for half of the internet and the world's economy at this point. I currently don't think it's an issue in qemu-img, but will let you know if that changes. I do appreciate you weighing in here, though. |
OK - I was asked by @sandeen to provide a disk image containing an XFS filesystem with the good and bad generated disk images on it. I had to make a slight modification to OSBuild to make it save off the bad disk image on failure: diff --git a/osbuild/pipeline.py b/osbuild/pipeline.py
index af4c3944..93184bf6 100644
--- a/osbuild/pipeline.py
+++ b/osbuild/pipeline.py
@@ -358,6 +358,8 @@ class Pipeline:
results["stages"].append(r)
if not r.success:
+ print(f"XXXX failed tree object at {tree.path} committing")
+ object_store.commit(tree, stage.id)
cleanup(build_tree, tree)
results["success"] = False
return results
Then I ran this in a loop:
It took a couple of iterations, but it did fail with:
Here's a link to the disk image: cache2.qcow2.zst.
Within the disk XFS filesystem the files for input and output of
I was also asked to run
|
As a lazy way to test this in other environments I opened a test PR against OSBuild that would then run in their CI Sure enough CI failed on two environments: |
Thanks. So the disk.img does indeed have many reflinked extents, for starters (extents with flags 100000)
Not sure what to make of this yet... my first thought is that it's possibly related to preallocated/unwritten extents in the original disk.img but that's just a guess. |
Hi, Can you upload the pre-convert raw image somewhere, or is there an easy way for me to get it from some pipeline (like osbuild/osbuild#1594)? (I’ve tried downloading the cache2.qcow2.zst that you linked, but the download is very slow for me (~30 kB/s), so would take more than a day to download.) |
Talked with @XanClic and she was able to get the file downloaded |
Some more context here about what OSBuild is doing that might be unique here. OSbuild builds things in stages. We have stages that do one or two things, but the entire output from that stage gets saved and copied into the next stages for the next work item. A simplified version of what is going on here is:
So what's happening here is there is a So there is an original This series of steps may be what is contributing to the problem here. Also note this isn't 100% reproducible. Sometimes I can reproduce it every time and other times only 10% or so. |
I've been investigating why a seemingly innocuous change (changing compression on OSBuild generated qemu qcow2) would cause disk images to not boot [1]. I think I have found the issue. I was first trying to make sure 100% that the files got written out over the virtiofs mount before the VM got shutdown so I decided to add a `umount $workdir` to the process. But this ended up with a `umount: /srv/: target is busy.` error. When the supermin VM gets run we `cd "${workdir}"` at the end of supermin-init-prelude.sh. This has the effect of causing all spawned processes (including PID1/init) to have a cwd of /srv/. ``` bash-5.2# lsof /srv COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME init 1 root cwd DIR 0,26 4096 10485829 /srv kthreadd 2 root cwd DIR 0,26 4096 10485829 /srv pool_work 3 root cwd DIR 0,26 4096 10485829 /srv kworker/R 4 root cwd DIR 0,26 4096 10485829 /srv ... ... ``` Which means it's unlikely that the virtiofs mount ever gets cleanly unmounted. Let's rework things here so that actual work gets spawned in a subshell to prevent `init` from having a cwd on the virtiofs mount. We also add in an `umount` of the cache qcow2 (if exists) and the virtiofs mount to strengthen our chances of a clean unmount. [1] #3728
We think there might be some XFS reflink issues when we run the OSBuild org.osbuild.qemu stage compression: false. See #3728 (comment)
We previously did this in a different way (2a8d1e6) but then had to revert it (39fdd61) because it caused images to not boot [1]. The root cause appears to have been the virtiofs mount not being unmounted cleanly from the supermin VM and that is now fixed so let's switch back to not compressing since we rely on our outer compression [2]. [1] #3728 [2] coreos/fedora-coreos-tracker#1653 (comment)
More data: I patched OSBuild to not
I ran the loop 30 times and didn't see any failures. |
So far, I haven’t found any reproducer on my system (and I’d like to have some simple local reproducer :)). What I’ve tried that didn’t work (50 iterations for most items, 1000 for the last):
However, I haven’t scrubbed the image between runs (i.e. basically rm cache2.qcow2 && zstd -d cache2.qcow2.zst), that’s one thing I still ought to try (but will of course take much longer). |
I’ve done that now for the VM case (doesn’t take that long with hot-plugging and -unplugging), and still couldn’t reproduce it (50 runs). |
OK I might have a more minimal reproducer that doesn't include OSBuild. I took a look at exactly what the OSbuild stages were doing when copying things around and wrote a script to try to simulate it more closely. The summary is that we are now:
Using the filesystem I uploaded before, mount it and cd into the root of that filesystem and then run this script:
|
Nice job on the reproducer! Not sure what to make of this yet. |
Thanks, that works indeed! With debugging information put into qemu-img, I can see that it believes the offset is zero, whereas it is not zero when you actually inspect it. qemu-img gets this information via SEEK_HOLE/SEEK_DATA. There seems to be some inconsistency in this hole information. Fully reading disk.img seems to update this information, so putting a Consequently, neither qcow2 nor qemu-img nor what target filesystem you use seem to be of importance, but beware that if you use
(Side note, I actually have no idea why it does report a mismatch when using qcow2. The hole information in disk.img should still be wrong, making So this says both images are identical, but if you use So replacing
(Again, note that using |
A lot to unpack there. Thanks @XanClic! I'm not sure quite what to make of it all, but there is one thing you mention in there that seems interesting:
I'd argue that the target filesystem is of importance here. We switched from I'm not saying it is 100% filesystem related, but there is at least some data to point towards it being related. |
Oh, absolutely, the (source) filesystem does matter. I just meant the target filesystem of the copy/convert operation (for qemu.qcow2), which I changed to tmpfs, and still saw the problem. I do believe that the (seemingly) incorrect hole information for disk.img is XFS-related. |
Patch from dchinner: https://lore.kernel.org/linux-xfs/20240220224928.3356-1-david@fromorbit.com/T/#u Still trying to craft a simple reproducer though. |
A data corruption problem was reported by CoreOS image builders when using reflink based disk image copies and then converting them to qcow2 images. The converted images failed the conversion verification step, and it was isolated down to the fact that qemu-img uses SEEK_HOLE/SEEK_DATA to find the data it is supposed to copy. The reproducer allowed me to isolate the issue down to a region of the file that had overlapping data and COW fork extents, and the problem was that the COW fork extent was being reported in it's entirity by xfs_seek_iomap_begin() and so skipping over the real data fork extents in that range. This was somewhat hidden by the fact that 'xfs_bmap -vvp' reported all the extents correctly, and reading the file completely (i.e. not using seek to skip holes) would map the file correctly and all the correct data extents are read. Hence the problem is isolated to just the xfs_seek_iomap_begin() implementation. Instrumentation with trace_printk made the problem obvious: we are passing the wrong length to xfs_trim_extent() in xfs_seek_iomap_begin(). We are passing the end_fsb, not the maximum length of the extent we want to trim the map too. Hence the COW extent map never gets trimmed to the start of the next data fork extent, and so the seek code treats the entire COW fork extent as unwritten and skips entirely over the data fork extents in that range. Link: coreos/coreos-assembler#3728 Fixes: 60271ab ("xfs: fix SEEK_DATA for speculative COW fork preallocation") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Landed upstream in https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4b2f459d86252619448455013f581836c8b1b7da Hasn't been backported to any stable branches yet. |
We switched to ext4 to get away from using reflinks (see [1]) but the problem was actually specific to XFS. It's now been fixed upstream in [2], but we need to wait for it to get back backported to stable branches. Let's switch to BTRFS for now so we can still get the space savings of reflinks, while avoiding the reflink bug in XFS. [1] coreos#3728 [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4b2f459d86252619448455013f581836c8b1b7da
We switched to ext4 to get away from using reflinks (see [1]) but the problem was actually specific to XFS. It's now been fixed upstream in [2], but we need to wait for it to get back backported to stable branches. Let's switch to BTRFS for now so we can still get the space savings of reflinks, while avoiding the reflink bug in XFS. [1] coreos#3728 [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4b2f459d86252619448455013f581836c8b1b7da
We switched to ext4 to get away from using reflinks (see [1]) but the problem was actually specific to XFS. It's now been fixed upstream in [2], but we need to wait for it to get back backported to stable branches. Let's switch to BTRFS for now so we can still get the space savings of reflinks, while avoiding the reflink bug in XFS. [1] #3728 [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4b2f459d86252619448455013f581836c8b1b7da
This was backported to Fedora 6.7 series in https://gitlab.com/cki-project/kernel-ark/-/commit/c0412c5250f7bdb2ea61b61f27eada0d8a135692 and should land in F39 with |
The XFS issue in the kernel is now fixed in F39: coreos#3728 (comment)
The XFS issue in the kernel is now fixed in F39: #3728 (comment)
With the new kernel landed and our cache qcow switched back to XFS we can now close this. |
commit 4b2f459 upstream. A data corruption problem was reported by CoreOS image builders when using reflink based disk image copies and then converting them to qcow2 images. The converted images failed the conversion verification step, and it was isolated down to the fact that qemu-img uses SEEK_HOLE/SEEK_DATA to find the data it is supposed to copy. The reproducer allowed me to isolate the issue down to a region of the file that had overlapping data and COW fork extents, and the problem was that the COW fork extent was being reported in it's entirity by xfs_seek_iomap_begin() and so skipping over the real data fork extents in that range. This was somewhat hidden by the fact that 'xfs_bmap -vvp' reported all the extents correctly, and reading the file completely (i.e. not using seek to skip holes) would map the file correctly and all the correct data extents are read. Hence the problem is isolated to just the xfs_seek_iomap_begin() implementation. Instrumentation with trace_printk made the problem obvious: we are passing the wrong length to xfs_trim_extent() in xfs_seek_iomap_begin(). We are passing the end_fsb, not the maximum length of the extent we want to trim the map too. Hence the COW extent map never gets trimmed to the start of the next data fork extent, and so the seek code treats the entire COW fork extent as unwritten and skips entirely over the data fork extents in that range. Link: coreos/coreos-assembler#3728 Fixes: 60271ab ("xfs: fix SEEK_DATA for speculative COW fork preallocation") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Signed-off-by: Catherine Hoang <catherine.hoang@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4b2f459 upstream. A data corruption problem was reported by CoreOS image builders when using reflink based disk image copies and then converting them to qcow2 images. The converted images failed the conversion verification step, and it was isolated down to the fact that qemu-img uses SEEK_HOLE/SEEK_DATA to find the data it is supposed to copy. The reproducer allowed me to isolate the issue down to a region of the file that had overlapping data and COW fork extents, and the problem was that the COW fork extent was being reported in it's entirity by xfs_seek_iomap_begin() and so skipping over the real data fork extents in that range. This was somewhat hidden by the fact that 'xfs_bmap -vvp' reported all the extents correctly, and reading the file completely (i.e. not using seek to skip holes) would map the file correctly and all the correct data extents are read. Hence the problem is isolated to just the xfs_seek_iomap_begin() implementation. Instrumentation with trace_printk made the problem obvious: we are passing the wrong length to xfs_trim_extent() in xfs_seek_iomap_begin(). We are passing the end_fsb, not the maximum length of the extent we want to trim the map too. Hence the COW extent map never gets trimmed to the start of the next data fork extent, and so the seek code treats the entire COW fork extent as unwritten and skips entirely over the data fork extents in that range. Link: coreos/coreos-assembler#3728 Fixes: 60271ab ("xfs: fix SEEK_DATA for speculative COW fork preallocation") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Signed-off-by: Catherine Hoang <catherine.hoang@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4b2f459 upstream. A data corruption problem was reported by CoreOS image builders when using reflink based disk image copies and then converting them to qcow2 images. The converted images failed the conversion verification step, and it was isolated down to the fact that qemu-img uses SEEK_HOLE/SEEK_DATA to find the data it is supposed to copy. The reproducer allowed me to isolate the issue down to a region of the file that had overlapping data and COW fork extents, and the problem was that the COW fork extent was being reported in it's entirity by xfs_seek_iomap_begin() and so skipping over the real data fork extents in that range. This was somewhat hidden by the fact that 'xfs_bmap -vvp' reported all the extents correctly, and reading the file completely (i.e. not using seek to skip holes) would map the file correctly and all the correct data extents are read. Hence the problem is isolated to just the xfs_seek_iomap_begin() implementation. Instrumentation with trace_printk made the problem obvious: we are passing the wrong length to xfs_trim_extent() in xfs_seek_iomap_begin(). We are passing the end_fsb, not the maximum length of the extent we want to trim the map too. Hence the COW extent map never gets trimmed to the start of the next data fork extent, and so the seek code treats the entire COW fork extent as unwritten and skips entirely over the data fork extents in that range. Link: coreos/coreos-assembler#3728 Fixes: 60271ab ("xfs: fix SEEK_DATA for speculative COW fork preallocation") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Signed-off-by: Catherine Hoang <catherine.hoang@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4b2f459 upstream. A data corruption problem was reported by CoreOS image builders when using reflink based disk image copies and then converting them to qcow2 images. The converted images failed the conversion verification step, and it was isolated down to the fact that qemu-img uses SEEK_HOLE/SEEK_DATA to find the data it is supposed to copy. The reproducer allowed me to isolate the issue down to a region of the file that had overlapping data and COW fork extents, and the problem was that the COW fork extent was being reported in it's entirity by xfs_seek_iomap_begin() and so skipping over the real data fork extents in that range. This was somewhat hidden by the fact that 'xfs_bmap -vvp' reported all the extents correctly, and reading the file completely (i.e. not using seek to skip holes) would map the file correctly and all the correct data extents are read. Hence the problem is isolated to just the xfs_seek_iomap_begin() implementation. Instrumentation with trace_printk made the problem obvious: we are passing the wrong length to xfs_trim_extent() in xfs_seek_iomap_begin(). We are passing the end_fsb, not the maximum length of the extent we want to trim the map too. Hence the COW extent map never gets trimmed to the start of the next data fork extent, and so the seek code treats the entire COW fork extent as unwritten and skips entirely over the data fork extents in that range. Link: coreos/coreos-assembler#3728 Fixes: 60271ab ("xfs: fix SEEK_DATA for speculative COW fork preallocation") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Signed-off-by: Catherine Hoang <catherine.hoang@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When we switched OSbuild over to not compressing we started getting failures where images wouldn't boot like:
The text was updated successfully, but these errors were encountered: