Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zfs_bclone_wait_dirty=1 broken for files with unallocated blocks at the end #15994

Closed
rrevans opened this issue Mar 14, 2024 · 1 comment
Closed
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@rrevans
Copy link
Contributor

rrevans commented Mar 14, 2024

System information

Type Version/Name
Distribution Name Fedora
Distribution Version 39
Kernel Version 6.6.8-200.fc39.x86_64
Architecture x86_64
OpenZFS Version Build from head (8f2f6cd DEBUG)

Describe the problem you're observing

With zfs_bclone_enabled=1 and zfs_bclone_wait_dirty=1, copying a file with unallocated blocks at the end gets stuck in the kernel forever. During this time, the kernel is also forcing txg syncs in an infinite loop.

This behavior is also observed in #15933.

(The same underlying issue also means that bclone always fails for files with unallocated blocks at the end if zfs_bclone_wait_dirty=0.)

Describe how to reproduce the problem

$ echo 1 > /sys/module/zfs/parameters/zfs_bclone_enabled
$ echo 1 > /sys/module/zfs/parameters/zfs_bclone_wait_dirty
$ cd /testpool
$ truncate --size=256M src
$ zpool sync
$ cp --reflink=always src dst

cp never finishes and is stuck in an uninterruptible state unresponsive to both SIGINT and SIGQUIT.

Setting zfs_bclone_wait_dirty=0 while cp is still running causes cp to finish immediately with error cp: failed to clone 'dst' from 'src': Resource temporarily unavailable

Include any warning/errors/backtraces from the system logs

Nothing immediately in dmesg or dbgmsg, but during the failure /proc/spl/kstat/zfs/testpool/txgs shows that zfs is generating a lot of empty txgs:

$ tail -30 /proc/spl/kstat/zfs/testpool/txgs
1571017  393315256830     C     0            0            0            0        0        701860       6090         25220        77480       
1571018  393315958690     C     0            0            0            0        0        704610       6090         25140        77340       
1571019  393316663300     C     0            0            0            0        0        695860       6800         25400        106120      
1571020  393317359160     C     0            0            0            0        0        719080       6120         25310        88020       
1571021  393318078240     C     0            0            0            0        0        710381       6100         25260        78150       
1571022  393318788621     C     0            0            0            0        0        710010       6100         25780        78240       
1571023  393319498631     C     0            0            0            0        0        693820       6130         25420        79990       
1571024  393320192451     C     0            0            0            0        0        701620       6090         24460        77400       
1571025  393320894071     C     0            0            0            0        0        701870       6100         25250        78390       
1571026  393321595941     C     0            0            0            0        0        696431       6080         25560        77490       
1571027  393322292372     C     0            0            0            0        0        699960       6070         25330        77490       
1571028  393322992332     C     0            0            0            0        0        702420       6220         25210        77350       
1571029  393323694752     C     0            0            0            0        0        695030       6160         25730        97120       
1571030  393324389782     C     0            0            0            0        0        710290       6080         25340        87720       
1571031  393325100072     C     0            0            0            0        0        709801       6070         25430        78120       
1571032  393325809873     C     0            0            0            0        0        697140       6240         24630        77370       
1571033  393326507013     C     0            0            0            0        0        696710       6080         25130        78150       
1571034  393327203723     C     0            0            0            0        0        700170       6110         24730        77050       
1571035  393327903893     C     0            0            0            0        0        703480       6140         25691        78570       
1571036  393328607373     C     0            0            0            0        0        698401       6110         25590        78960       
1571037  393329305774     C     0            0            0            0        0        699180       6090         25290        77460       
1571038  393330004954     C     0            0            0            0        0        703890       6090         25230        77220       
1571039  393330708844     C     0            0            0            0        0        694670       6080         25520        96280       
1571040  393331403514     C     0            0            0            0        0        707651       6080         25380        87750       
1571041  393332111165     C     0            0            0            0        0        711130       6250         25450        77950       
1571042  393332822295     C     0            0            0            0        0        705860       6120         24780        83430       
1571043  393333528155     C     0            0            0            0        0        745090       6140         34510        96930       
1571044  393334273245     C     0            0            0            0        0        790700       6180         32650        106500      
1571045  393335063945     S     0            0            0            0        0        784211       6140         32690        0           
1571046  393335848156     O     0            0            0            0        0        0            0            0            0           

Root cause

#15842 adds logic to wait for sync when encountering dirty blocks implemented as syncing when dmu_read_l0_bps returns EAGAIN, but the logic is broken. cc @behlendorf

Normally dmu_read_l0_bps returns EAGAIN for dirty blocks. However it also returns EAGAIN whenever db->db_blkptr == NULL. This normally occurs for newly-written blocks not-yet-allocated, but it also occurs for sparse, unallocated blocks beyond the end of a fully-synced object. (More specifically, this occurs for any of the conditions that cause dbuf_findbp to return ENOENT when holding the dbuf.)

In this situation, zfs_clone_range tries to force a sync when zfs_bclone_wait_dirty=1, but syncing does not allocate any blocks since none are actually dirty. Then the next attempt runs into the same condition and syncs again in an infinite loop. Setting zfs_bclone_wait_dirty=0 breaks the loop and returns an error to cp.

This is trivially reproducible by creating an empty sparse file, as seen by zdb:

$ zdb -vvvvvv -O testpool src
obj=128 dataset=testpool path=/src type=19 bonustype=44

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
       128    1   128K   128K      0     512   128K    0.00  ZFS plain file (K=inherit) (Z=inherit=lz4)
                                               184   bonus  System attributes
	dnode flags: USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED 
	dnode maxblkid: 0
	uid     1000
	gid     1000
	atime	Wed Mar 13 23:17:42 2024
	mtime	Wed Mar 13 23:17:42 2024
	ctime	Wed Mar 13 23:17:42 2024
	crtime	Wed Mar 13 23:17:42 2024
	gen	2368119
	mode	100644
	size	268435456
	parent	34
	links	1
	pflags	840800000004
	xattr	129
Indirect blocks:

Note that despite being a 256MiB file size according to ZPL metadata, the actual on-disk object is still a 1-level object with dn_maxblkid == 0 and no indirect blocks which is sufficient to trigger the db_blkptr == NULL case upon dbuf_hold.

@rrevans rrevans added the Type: Defect Incorrect behavior (e.g. crash, hang) label Mar 14, 2024
amotin added a commit to amotin/zfs that referenced this issue Mar 18, 2024
 - When reading L0 block pointers handle buffers without ones and
without dirty records as a holes.  Those appear when dnode size
was increased, but the end was never written, so there are no new
indirection levels to store the pointers.  It makes no sense to
return EAGAIN here, since sync won't create new indirection levels
until there will be actual writes.
 - When cloning blocks always set destination logical birth time
to the current TXG.  Otherwise if we are cloning over existing
data, newly created holes may not be properly replicated later.
While there, stop messing with physical birth time, it was already
copied as part of the pointer and should already be zero for holes.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes: openzfs#15994
@amotin amotin mentioned this issue Mar 18, 2024
13 tasks
@amotin
Copy link
Member

amotin commented Mar 18, 2024

This seems to be completely unrelated issue from #15933. #16007 should fix this.

amotin added a commit to amotin/zfs that referenced this issue Mar 19, 2024
 - When reading L0 block pointers handle buffers without ones and
without dirty records as a holes.  Those appear when dnode size
was increased, but the end was never written, so there are no new
indirection levels to store the pointers.  It makes no sense to
return EAGAIN here, since sync won't create new indirection levels
until there will be actual writes.
 - When cloning blocks set destination hole logical birth time
to the current TXG.  Otherwise if we are cloning over existing
data, newly created holes may not be properly replicated later.
While there, stop messing with physical birth time, it was already
copied as part of the pointer and should already be zero for holes.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes: openzfs#15994
amotin added a commit to amotin/zfs that referenced this issue Mar 20, 2024
 - When reading L0 block pointers handle buffers without ones and
without dirty records as a holes.  Those appear when dnode size
was increased, but the end was never written, so there are no new
indirection levels to store the pointers.  It makes no sense to
return EAGAIN here, since sync won't create new indirection levels
until there will be actual writes.
 - When cloning blocks set destination hole logical birth time
to the current TXG.  Otherwise if we are cloning over existing
data, newly created holes may not be properly replicated later.
Use BP_SET_BIRTH() when possible to not replicate its logic.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes: openzfs#15994
amotin added a commit to amotin/zfs that referenced this issue Mar 25, 2024
 - When reading L0 block pointers handle buffers without ones and
without dirty records as a holes.  Those appear when dnode size
was increased, but the end was never written, so there are no new
indirection levels to store the pointers.  It makes no sense to
return EAGAIN here, since sync won't create new indirection levels
until there will be actual writes.
 - When cloning blocks set destination hole logical birth time
to the current TXG.  Otherwise if we are cloning over existing
data, newly created holes may not be properly replicated later.
Use BP_SET_BIRTH() when possible to not replicate its logic.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes: openzfs#15994
amotin added a commit to amotin/zfs that referenced this issue Mar 26, 2024
 - When reading L0 block pointers handle buffers without ones and
without dirty records as a holes.  Those appear when dnode size
was increased, but the end was never written, so there are no new
indirection levels to store the pointers.  It makes no sense to
return EAGAIN here, since sync won't create new indirection levels
until there will be actual writes.
 - When cloning blocks set destination hole logical birth time
to the current TXG.  Otherwise if we are cloning over existing
data, newly created holes may not be properly replicated later.
Use BP_SET_BIRTH() when possible to not replicate its logic.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes: openzfs#15994
amotin added a commit to amotin/zfs that referenced this issue Apr 17, 2024
 - When reading L0 block pointers handle buffers without ones and
without dirty records as a holes.  Those appear when dnode size
was increased, but the end was never written, so there are no new
indirection levels to store the pointers.  It makes no sense to
return EAGAIN here, since sync won't create new indirection levels
until there will be actual writes.
 - When cloning blocks set destination hole logical birth time
to the current TXG.  Otherwise if we are cloning over existing
data, newly created holes may not be properly replicated later.
Use BP_SET_BIRTH() when possible to not replicate its logic.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes openzfs#15994
Closes openzfs#16007
behlendorf pushed a commit that referenced this issue Apr 19, 2024
 - When reading L0 block pointers handle buffers without ones and
without dirty records as a holes.  Those appear when dnode size
was increased, but the end was never written, so there are no new
indirection levels to store the pointers.  It makes no sense to
return EAGAIN here, since sync won't create new indirection levels
until there will be actual writes.
 - When cloning blocks set destination hole logical birth time
to the current TXG.  Otherwise if we are cloning over existing
data, newly created holes may not be properly replicated later.
Use BP_SET_BIRTH() when possible to not replicate its logic.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15994
Closes #16007
lundman pushed a commit to openzfsonwindows/openzfs that referenced this issue Sep 4, 2024
 - When reading L0 block pointers handle buffers without ones and
without dirty records as a holes.  Those appear when dnode size
was increased, but the end was never written, so there are no new
indirection levels to store the pointers.  It makes no sense to
return EAGAIN here, since sync won't create new indirection levels
until there will be actual writes.
 - When cloning blocks set destination hole logical birth time
to the current TXG.  Otherwise if we are cloning over existing
data, newly created holes may not be properly replicated later.
Use BP_SET_BIRTH() when possible to not replicate its logic.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes openzfs#15994
Closes openzfs#16007
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants