Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

btrfs receive occasionally fails to find parent subvolume when data flows in the reverse direction #606

Closed
ettavolt opened this issue Mar 22, 2023 · 12 comments
Labels
Milestone

Comments

@ettavolt
Copy link
Contributor

I'm trying to carry user profile between two laptops via backups on an external disk (so that I have sync and backups and the same time).

For that I'm trying to utilize the following structure:

profile, btrfs subvolume:
/home/ask-rbr
btrfs ro snapshots:
/home/.ask-rbr-snapshots/@yyyy-MM-dd[Thh[:mm[:ss]]]
external drive, btrfs ro snapshots:
/run/rbrssd/ask-arch-profile/@yyyy-MM-dd[Thh[:mm[:ss]]]

Both systems are running Arch with:
linux 6.2.7.arch1-1
btrfs-progs 6.2.1-1
(one might be lagging by a week).

This is an extension of earlier backup-only setup, when send | receive was going profile → external drive direction only.

The problem happens when I try to send recent backup to laptop to recreate profile (making it a rw snapshot of most recent ro snapshot).

External drive:

ask-arch-profile/@2023-03-16T11
 Name: @2023-03-16T11
 UUID: 88e0644b-6542-384f-9498-980582b29226
 Parent UUID: -
 Received UUID: cab9c40b-98dd-ca49-98d4-6271e8cf4fac
 Creation time: 2023-03-21 17:42:34 -0400
 Subvolume ID: 291
 Generation: 212
 Gen at creation: 209
 Parent ID: 5
 Top level ID: 5
 Flags: readonly
 Send transid: 42
 Send time: 2023-03-21 17:42:34 -0400
 Receive transid: 210
 Receive time: 2023-03-21 17:43:05 -0400
 Snapshot(s):
 ask-arch-profile/@2023-03-21

ask-arch-profile/@2023-03-21
	Name: 			@2023-03-21
	UUID: 			918fa20d-d6a5-724e-8f09-5b4aeeb22844
	Parent UUID: 		88e0644b-6542-384f-9498-980582b29226
	Received UUID: 		4bb66e19-4e51-9e49-ac0f-1574419be949
	Creation time: 		2023-03-21 17:43:24 -0400
	Subvolume ID: 		292
	Generation: 		215
	Gen at creation: 	212
	Parent ID: 		5
	Top level ID: 		5
	Flags: 			readonly
	Send transid: 		1553
	Send time: 		2023-03-21 17:43:24 -0400
	Receive transid: 	213
	Receive time: 		2023-03-21 17:43:25 -0400
	Snapshot(s):

Hexdump of first 128 bytes of btrfs send -p ask-arch-profile/@2023-03-16T11 ask-arch-profile/@2023-03-21:

00000000 62 74 72 66 73 2d 73 74 72 65 61 6d 00 01 00 00 |btrfs-stream....|
00000010 00 4f 00 00 00 02 00 17 b5 d6 ec 0f 00 0b 00 40 |.O.............@|
00000020 32 30 32 33 2d 30 33 2d 32 31 01 00 10 00 4b b6 |2023-03-21....K.|
00000030 6e 19 4e 51 9e 49 ac 0f 15 74 41 9b e9 49 02 00 |n.NQ.I...tA..I..|
00000040 08 00 d7 00 00 00 00 00 00 00 14 00 10 00 ca b9 |................|
        ↓↓↓↓↓↓↓↓↓↓↓Received UUID of the -p snapshot↑↑↑↑↑
00000050 c4 0b 98 dd ca 49 98 d4 62 71 e8 cf 4f ac 15 00 |.....I..bq..O...|
00000060 08 00 d4 00 00 00 00 00 00 00 34 00 00 00 14 00 |..........4.....|
00000070 5c 74 75 5b 0f 00 00 00 0b 00 0c 00 96 42 cf 63 |\tu[.........B.c|

Laptop backups:

home/.ask-rbr-snapshots/@2023-03-16T11
 Name: @2023-03-16T11
 UUID: cab9c40b-98dd-ca49-98d4-6271e8cf4fac
 Parent UUID: 748d38ea-bd22-e446-8800-8be6fea2a203
 Received UUID: -
 Creation time: 2023-03-17 15:46:46 -0400
 Subvolume ID: 292
 Generation: 215166
 Gen at creation: 211513
 Parent ID: 5
 Top level ID: 5
 Flags: readonly
 Send transid: 0
 Send time: 2023-03-17 15:46:46 -0400
 Receive transid: 0
 Receive time: -
 Snapshot(s):

And output of btrfs receive /home/.ask-rbr-snapshots:

At snapshot @2023-03-21
receiving snapshot @2023-03-21 uuid=4bb66e19-4e51-9e49-ac0f-1574419be949, ctransid=215 parent_uuid=cab9c40b-98dd-ca49-98d4-6271e8cf4fac, parent_ctransid=212
write …
# No ERROR among these write …
ERROR: clone: did not find source subvol

and 1 as the exit code.

However, snapshot has been correctly transferred:

home/.ask-rbr-snapshots/@2023-03-21
	Name: 			@2023-03-21
	UUID: 			2abb4616-4d70-e849-b5d6-a2c44a5b035a
	Parent UUID: 		cab9c40b-98dd-ca49-98d4-6271e8cf4fac
	Received UUID: 		-
	Creation time: 		2023-03-22 15:22:01 -0400
	Subvolume ID: 		299
	Generation: 		215232
	Gen at creation: 	215231
	Parent ID: 		5
	Top level ID: 		5
	Flags: 			-
	Send transid: 		0
	Send time: 		2023-03-22 15:22:01 -0400
	Receive transid: 	0
	Receive time: 		-
	Snapshot(s):

I think this is unrelated to #538, because transfer works seemlessly into the other direction.

Might this happen because I had a break in day-to-day snapshots before 03-18? (See that snapshot having no parent uuid)?

@kdave kdave added the bug label Mar 31, 2023
@ettavolt
Copy link
Contributor Author

ettavolt commented Apr 2, 2023

OK, I've scratched a test with all these snapshots mangling. And it worked flawlessly (see log.txt). 🙄

Here's what it retains in the end:

@ /tmp % sudo btrfs subvolume list -uqR mnt/pc1
ID 260 gen 14 top level 5 parent_uuid e50ce334-51b4-4f4e-a6f2-41b7585eb419 received_uuid -                                    uuid fa6df392-9918-be4f-b1d1-a954dcf9ed3d path profile
ID 261 gen 14 top level 5 parent_uuid fa6df392-9918-be4f-b1d1-a954dcf9ed3d received_uuid -                                    uuid 5799a60f-d89c-ea48-ab28-85ce3af2c5a9 path .snapshots/day5
@ /tmp % sudo btrfs subvolume list -uqR mnt/external
ID 256 gen 10 top level 5 parent_uuid -                                    received_uuid d307e0a9-8164-a14f-ba0f-d0c1cc69f9cd uuid fad6ef45-8625-6a4b-9c36-994e27bc8afe path .snapshots/day1
ID 257 gen 14 top level 5 parent_uuid fad6ef45-8625-6a4b-9c36-994e27bc8afe received_uuid db8a6897-b31f-6a49-8e05-eec1cd484a9f uuid 9bdcf876-8a0a-744d-a12e-c464445f4ed9 path .snapshots/day2
ID 258 gen 17 top level 5 parent_uuid 9bdcf876-8a0a-744d-a12e-c464445f4ed9 received_uuid 9c532cf9-99fe-434c-b8d0-58afe4f929f5 uuid 5c804b03-70e8-274f-ab1b-4e3d70e9a4ad path .snapshots/day3
ID 259 gen 21 top level 5 parent_uuid 5c804b03-70e8-274f-ab1b-4e3d70e9a4ad received_uuid 27747bce-064b-4c4e-beb0-2ef56b5f5262 uuid 1e498523-eb18-8a42-a141-a0a15f487500 path .snapshots/day4
ID 260 gen 25 top level 5 parent_uuid 1e498523-eb18-8a42-a141-a0a15f487500 received_uuid 5799a60f-d89c-ea48-ab28-85ce3af2c5a9 uuid 389f14ac-c19d-2c4a-af12-b7b10f61d7d6 path .snapshots/day5
ID 261 gen 26 top level 5 parent_uuid 389f14ac-c19d-2c4a-af12-b7b10f61d7d6 received_uuid bdab90d4-4cd4-d041-bf52-2400cb233c84 uuid 4e817707-7015-f242-a136-d95defa299ab path .snapshots/day6
@ /tmp % sudo btrfs subvolume list -uqR mnt/pc2     
ID 261 gen 17 top level 5 parent_uuid d2e487dc-9b08-ba43-a0a0-2ff9268d1346 received_uuid -                                    uuid 6ad226a5-9ccd-7945-9d6b-e7a209afb134 path profile
ID 262 gen 17 top level 5 parent_uuid 6ad226a5-9ccd-7945-9d6b-e7a209afb134 received_uuid -                                    uuid bdab90d4-4cd4-d041-bf52-2400cb233c84 path .snapshots/day6

I suspect it's just one of the snapshots I have is somehow bugged, and I just need to start a new lineage.

Tell me if you want to me to push the test. Feel free to incorporate it yourself.

@ettavolt
Copy link
Contributor Author

ettavolt commented Apr 22, 2023

After working flawlessly for two weeks, I've got to the same state again. This is the state after recv returned ERROR: clone: did not find source subvol when synchronizing to PC1 (which originated the current lineage at some point in the past):

% sudo btrfs subvolume list -uqR /          
ID 318 gen 235295 top level 5 parent_uuid c8220c06-51e5-3249-a3b0-0dc34f90e878 received_uuid -                                    uuid 82adf20b-a638-2f4c-a098-e30efa71169b path home/.ask-rbr-snapshots/@2023-04-19
ID 320 gen 235297 top level 5 parent_uuid 82adf20b-a638-2f4c-a098-e30efa71169b received_uuid -                                    uuid a1485bb9-797a-0745-add1-fa152c4009e7 path home/.ask-rbr-snapshots/@2023-04-21

The new snapshot, @2023-04-21 says it has parent, but if I dry-run rsync to measure the difference between its state in PC1 and on external drive, there are a lot of missing files, which I attribute to the incomplete parenting. Comparing Maven (Java build system) cache, ~/.m2, yields no differences, however, du -s reports 13958448 for the snapshot on external drive and 13576320 for the one in PC1 (i.e. after incomplete restoration).

% sudo btrfs subvolume list -uqR /run/rbrssd
ID 312 gen 6529 top level 5 parent_uuid d4fa1bba-2621-ca4d-893a-0e4efcc69b2b received_uuid 14f1636a-5f1e-8f45-ad00-4d856add6975 uuid 0964fd63-a6fd-0944-a5a9-a8ebb059b6e5 path ask-arch-profile/@2023-04-17
ID 313 gen 6534 top level 5 parent_uuid 0964fd63-a6fd-0944-a5a9-a8ebb059b6e5 received_uuid 795c7e3f-cfa8-2a4c-9766-6f16061e549a uuid b8693023-d3c5-fa42-942d-3eda274af2f5 path ask-arch-profile/@2023-04-18
ID 314 gen 6539 top level 5 parent_uuid b8693023-d3c5-fa42-942d-3eda274af2f5 received_uuid 82adf20b-a638-2f4c-a098-e30efa71169b uuid 1f435645-5d5b-8048-9fe3-04fb7cdebd75 path ask-arch-profile/@2023-04-19
ID 315 gen 6544 top level 5 parent_uuid 1f435645-5d5b-8048-9fe3-04fb7cdebd75 received_uuid 866bef31-20d7-d64d-89fe-eac2b995b6e3 uuid 882e57ce-ec62-274c-bfe3-f0b311585858 path ask-arch-profile/@2023-04-20
ID 316 gen 6549 top level 5 parent_uuid 882e57ce-ec62-274c-bfe3-f0b311585858 received_uuid 746e18f7-01ec-324b-adc1-024223b230d4 uuid eaa06624-44a8-6b4b-abfd-d280a1b46227 path ask-arch-profile/@2023-04-20T16
ID 317 gen 6552 top level 5 parent_uuid eaa06624-44a8-6b4b-abfd-d280a1b46227 received_uuid ea5b21cf-9d24-974f-bdd9-60ce858eb889 uuid 1d7a8120-1f9d-3d4c-b799-0d4a6b838243 path ask-arch-profile/@2023-04-21

I think I had an asynchronized kernel update 6.2.7 → 6.2.11 recently, like on 2023-04-15 on PC1 and 2023-04-17 on PC2. Since then I had correct "to PC1" - "to PC2" cycle around 2023-04-19.

I'll keep the snapshots around, so that it is possible to debug them. Please tell me, what diagnostics I can perform.

P.S. I wonder if clearing older snapshots in PC1 and PC2 might somehow influence this.

@ettavolt ettavolt changed the title btrfs receive erroneously reports fail when data flows in the opposite direction btrfs receive occasionally fails to find parent subvolume when data flows in the reverse direction Apr 22, 2023
@joanbm
Copy link

joanbm commented Nov 20, 2023

I also ran into the same issue when syncing snapshot back and forth between two different systems using btrfs send / btrfs receive and I can confirm that applying PR #643 fixes the issue.

What I don't understand though, is why I only ran into this issue recently. I have been keeping my two systems in sync by sending snapshots back and forth for around 5 years and it used to mostly work (though I still had to do full transfers a handful of times, which may have been due to this issue).
However, it was not until some point around 2023 (IIRC) when I started to run into this issue recurrently, basically every time. I may be doing something slightly different from my side, though I'm still a bit intrigued.

@ettavolt
Copy link
Contributor Author

This happens only if one of the files is partially updated involving CoW mechanism. I don't think there is a lot of software doing that.
My errors where mostly because of Tracker databases. In its NEWS I see "Big performance improvements to inserts, updates and deletes" for 3.4.0, which does sound like CoW might have been employed. Depending on your system, you might have got 3.4.0+ Tracker only in April.

@joanbm
Copy link

joanbm commented Nov 21, 2023

Thanks, that makes sense as an explanation of why this issue can become more (or less) frequent with time and not just happen all the time.

In my case, I went back to the unpatched btrfs-progs and managed to reproduce the issue again, and the culprit (=path argument to process_clone) was ~/.cache/mesa_shader_cache/index. Looking at the code I see that it is basically a hash table that is managed via posix_fallocate, mmap, etc. so it makes sense that some change to either mesa or btrfs could have made it use that partial update CoW mechanism.

@kdave kdave added this to the v6.9 milestone May 17, 2024
kdave pushed a commit that referenced this issue May 17, 2024
…in reverse direction

process_clone() only searches the received_uuid, but could exist in an
earlier uuid that isn't the received_uuid.  Mirror what process_snapshot
does and search both the received_uuid and if that fails look up by
normal uuid.

Fixes: #606

Issue: #606
Pull-request: #643
Signed-off-by: Arsenii Skvortsov <ettavolt@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
@kdave
Copy link
Owner

kdave commented May 17, 2024

Now fixed in devel, by PR #643. I'm not sure it's entirely correct to switch the uuids for matching subvolume, there are some obscure states and combinations and I don't remember if there wasn't a specific reason to do it like it is. But you have a real world use case so it counts.

@kdave kdave closed this as completed May 17, 2024
@ettavolt
Copy link
Contributor Author

Somehow if I navigate to the updated commit, 1ebdb7d, GitHub says

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

I've looked through the history of devel branch and don't see it there.
Just in case I've restored my fork and pushed this commit to the branch.

@ettavolt
Copy link
Contributor Author

@kdave, sorry, but the PR wasn't merged, and v6.9 doesn't include the fix.

@kdave kdave reopened this Jun 16, 2024
@kdave kdave modified the milestones: v6.9, v6.9.2 Jun 24, 2024
@kdave
Copy link
Owner

kdave commented Jun 24, 2024

Moved to 6.9.2 target.

@kdave kdave modified the milestones: v6.9.2, v6.9.3 Jun 25, 2024
@kdave kdave modified the milestones: v6.9.3, v6.10, v6.10.1 Jul 29, 2024
@ettavolt
Copy link
Contributor Author

ettavolt commented Aug 3, 2024

I've rebased the branch on top of devel. Could the PR #643 please be reopened, switched target, and merged before the milestone is begun to be worked on?

@kdave
Copy link
Owner

kdave commented Aug 5, 2024

I can't reopen the pull request (the button is disabled in the UI), please open a new one.

@ettavolt
Copy link
Contributor Author

ettavolt commented Aug 5, 2024

Done.

@kdave kdave closed this as completed in 6b5854c Aug 7, 2024
kdave pushed a commit that referenced this issue Aug 9, 2024
…in reverse direction

process_clone() only searches the received_uuid, but could exist in an
earlier uuid that isn't the received_uuid.  Mirror what process_snapshot
does and search both the received_uuid and if that fails look up by
normal uuid.

Fixes: #606

Issue: #606
Pull-request: #643
Pull-request: #862
Signed-off-by: Arsenii Skvortsov <ettavolt@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
kdave pushed a commit that referenced this issue Aug 14, 2024
…in reverse direction

process_clone() only searches the received_uuid, but could exist in an
earlier uuid that isn't the received_uuid.  Mirror what process_snapshot
does and search both the received_uuid and if that fails look up by
normal uuid.

Fixes: #606

Issue: #606
Pull-request: #643
Pull-request: #862
Signed-off-by: Arsenii Skvortsov <ettavolt@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
kdave pushed a commit that referenced this issue Aug 14, 2024
…in reverse direction

process_clone() only searches the received_uuid, but could exist in an
earlier uuid that isn't the received_uuid.  Mirror what process_snapshot
does and search both the received_uuid and if that fails look up by
normal uuid.

Fixes: #606

Issue: #606
Pull-request: #643
Pull-request: #862
Signed-off-by: Arsenii Skvortsov <ettavolt@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment