Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refreshing BTRFS backed instances is causing a segfault #13085

Closed
roosterfish opened this issue Mar 8, 2024 · 11 comments · Fixed by #13133
Closed

Refreshing BTRFS backed instances is causing a segfault #13085

roosterfish opened this issue Mar 8, 2024 · 11 comments · Fixed by #13133
Assignees
Labels
Bug Confirmed to be a bug
Milestone

Comments

@roosterfish
Copy link
Contributor

When refreshing Btrfs backed instances the command hangs infinitely and doesn't return. The affected LXD versions are both latest/edge and 5.0/edge.

The error is caused in https://github.com/canonical/lxd/blob/main/lxd/storage/drivers/driver_btrfs_volumes.go#L882 when running the btrfs receive command.

This was first discovered in canonical/lxd-ci#101.

Steps to reproduce

julian@thinkpad:~$ lxc storage create b btrfs
Storage pool b created
julian@thinkpad:~$ lxc launch ubuntu:jammy v1 --vm -s b
Creating v1
Starting v1                               
julian@thinkpad:~$ lxc snapshot v1
julian@thinkpad:~$ lxc cp v1 v2
julian@thinkpad:~$ lxc cp v1 v2 --refresh # doesn't return
root@thinkpad:~# dmesg -w
[34150.527621] btrfs[656287]: segfault at 56 ip 000055a07a36c3b4 sp 00007ffcc2149c60 error 4 in btrfs[55a07a32a000+71000] likely on CPU 12 (core 24, socket 0)
[34150.527663] Code: c0 79 48 e8 2e eb fb ff 4c 89 f6 48 8d 3d dd 31 05 00 44 8b 20 31 c0 41 f7 dc e8 6f 33 fe ff eb 2a 41 83 cd ff 48 85 ed 74 28 <48> 8b 7d 58 e8 63 ea fb ff 48 89 ef e8 5b ea fb ff 41 83 fd ff 74
root@thinkpad:~# journalctl -u snap.lxd.daemon -f
Mär 08 18:40:17 thinkpad lxd.daemon[656287]: ERROR: clone: did not find source subvol

The same is happening for containers. If the source container is stopped it doesn't occur. For VMs it doesn't matter if the VM is running or stopped.

@tomponline
Copy link
Member

Strange its not failing the main test suite?

@tomponline tomponline added the Bug Confirmed to be a bug label Mar 8, 2024
@tomponline tomponline added this to the lxd-6.1 milestone Mar 8, 2024
@tomponline tomponline changed the title Refreshing Btrfs backed instances is causing a segfault Refreshing BTRFS backed instances is causing a segfault Mar 11, 2024
@tomponline tomponline self-assigned this Mar 11, 2024
@roosterfish
Copy link
Contributor Author

There is a potential fix upstream which isn't yet merged but I can confirm the error cannot be reproduced when Btrfs is built from kdave/btrfs-progs#643.
I'll test with a more recent version of Btrfs too in order to narrow it down a bit more.

@tomponline
Copy link
Member

I've for a fix in the works for snapshot cleanup too

@tomponline
Copy link
Member

There is a potential fix upstream which isn't yet merged but I can confirm the error cannot be reproduced when Btrfs is built from kdave/btrfs-progs#643. I'll test with a more recent version of Btrfs too in order to narrow it down a bit more.

Do u think this ever worked and is a regression in the btrfs tooling?

@tomponline
Copy link
Member

Ah seems like its caused by certain content inside the instance kdave/btrfs-progs#606 (comment)

@roosterfish
Copy link
Contributor Author

Ah seems like its caused by certain content inside the instance kdave/btrfs-progs#606 (comment)

Yeah this also raised my attention. What I am wondering is what might cause a write in our case as we are freezing the FS.

@roosterfish
Copy link
Contributor Author

So it looks to be that this has started failing from btrfs-progs release v5.14.91 onwards. Version 5.14.2 is the last one which doesn't raise the error.
In the CHANGES file for this specific release I cannot identify any specifics on this. There aren't any additional release notes for this tag in GitHub.

In the snap we currently use the btrfs-progs package installed via apt using version 5.16.2.
If we would go back to version 5.14.2 we could fix the problem for new storage pools but already existing storage pools will continue to fail on the --refresh operation. The only change is that we cannot see the segfault message anymore but btrfs receive still returns ERROR: clone: did not find source subvol.

@tomponline
Copy link
Member

So it looks to be that this has started failing from btrfs-progs release v5.14.91 onwards. Version 5.14.2 is the last one which doesn't raise the error. In the CHANGES file for this specific release I cannot identify any specifics on this. There aren't any additional release notes for this tag in GitHub.

In the snap we currently use the btrfs-progs package installed via apt using version 5.16.2. If we would go back to version 5.14.2 we could fix the problem for new storage pools but already existing storage pools will continue to fail on the --refresh operation. The only change is that we cannot see the segfault message anymore but btrfs receive still returns ERROR: clone: did not find source subvol.

Any ideas why our main test suite doesn't experience this? Is it because its using a busybox image rather than ubuntu?

@tomponline
Copy link
Member

Hopefully it'll be fixed when we switch to core24

@roosterfish
Copy link
Contributor Author

So it looks to be that this has started failing from btrfs-progs release v5.14.91 onwards. Version 5.14.2 is the last one which doesn't raise the error. In the CHANGES file for this specific release I cannot identify any specifics on this. There aren't any additional release notes for this tag in GitHub.
In the snap we currently use the btrfs-progs package installed via apt using version 5.16.2. If we would go back to version 5.14.2 we could fix the problem for new storage pools but already existing storage pools will continue to fail on the --refresh operation. The only change is that we cannot see the segfault message anymore but btrfs receive still returns ERROR: clone: did not find source subvol.

Any ideas why our main test suite doesn't experience this? Is it because its using a busybox image rather than ubuntu?

Interesting idea, I have just tried it several times and I can say it consistently doesn't happen for the busybox image:

julian@thinkpad:~/dev/lxd/test/deps$ lxc launch 7c18be75dd3c c1 -s b # 7c18be75dd3c is busybox
Creating c1
Starting c1                                
julian@thinkpad:~/dev/lxd/test/deps$ lxc cp c1 c2
julian@thinkpad:~/dev/lxd/test/deps$ lxc snapshot c1
julian@thinkpad:~/dev/lxd/test/deps$ lxc cp c1 c2 --refresh
julian@thinkpad:~/dev/lxd/test/deps$ lxc cp c1 c2 --refresh
julian@thinkpad:~/dev/lxd/test/deps$ lxc cp c1 c2 --refresh
julian@thinkpad:~/dev/lxd/test/deps$ lxc cp c1 c2 --refresh
julian@thinkpad:~/dev/lxd/test/deps$ lxc cp c1 c2 --refresh
julian@thinkpad:~/dev/lxd/test/deps$ lxc cp c1 c2 --refresh
julian@thinkpad:~/dev/lxd/test/deps$ lxc launch ubuntu:jammy c3 -s b
Creating c3
Starting c3
julian@thinkpad:~/dev/lxd/test/deps$ lxc cp c3 c4
julian@thinkpad:~/dev/lxd/test/deps$ lxc snapshot c3
julian@thinkpad:~/dev/lxd/test/deps$ lxc cp c3 c4 --refresh
julian@thinkpad:~/dev/lxd/test/deps$ lxc cp c3 c4 --refresh
Error: Refresh instance: Failed BTRFS receive: signal: segmentation fault (core dumped) (ERROR: clone: did not find source subvol)
julian@thinkpad:~/dev/lxd/test/deps$ 
julian@thinkpad:~/dev/lxd/test/deps$ 
julian@thinkpad:~/dev/lxd/test/deps$ lxc launch 7c18be75dd3c c5 -s b
Creating c5
Starting c5
julian@thinkpad:~/dev/lxd/test/deps$ lxc cp c5 c6
julian@thinkpad:~/dev/lxd/test/deps$ lxc snapshot c5
julian@thinkpad:~/dev/lxd/test/deps$ lxc cp c5 c6 --refresh
julian@thinkpad:~/dev/lxd/test/deps$ lxc cp c5 c6 --refresh
julian@thinkpad:~/dev/lxd/test/deps$ lxc cp c5 c6 --refresh
julian@thinkpad:~/dev/lxd/test/deps$ lxc cp c5 c6 --refresh
julian@thinkpad:~/dev/lxd/test/deps$ 

@tomponline
Copy link
Member

tomponline commented Mar 12, 2024

Right so it appears to be due to the contents of the ubuntu image then (as well as the btrfs bug itself).

roosterfish added a commit to roosterfish/lxd-ci that referenced this issue Mar 13, 2024
This is related to the bug canonical/lxd#13085.

Signed-off-by: Julian Pelizäus <julian.pelizaeus@canonical.com>
roosterfish added a commit to roosterfish/lxd-ci that referenced this issue Mar 13, 2024
This is related to the bug canonical/lxd#13085.

Signed-off-by: Julian Pelizäus <julian.pelizaeus@canonical.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Confirmed to be a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants