Replies: 2 comments 3 replies
-
I think it should be fixable. My #14243 PR already improves short read and prefetch cases with primarycache=metadata. May be some alike logic could be applied for writes too. We could make those reads in dmu_tx_check_ioerr() to look like a prescient prefetch, which they really are, so they could be legally cached at very least by ARC. But I think this problem should happen only once per block write in TXG. If you do another write to the same block while the first haven't completed, it should not cause any reads. |
Beta Was this translation helpful? Give feedback.
-
I think small addition I've made to #14243 should fix the issue. @Finix1979, you are welcome to test it. That PR should give you huge improvements on many fronts if you are running with primarycache=metadata. |
Beta Was this translation helpful? Give feedback.
-
I recently found that if I re-write a file by a small IO(< recordsize) whose primarcache is set to metadata, ZFS will read the record from disk two times in zfs_write routine. I understand RMW is necessary but it could be better if just read once.
The process is simple: (zfs 2.1.7 the newest version, ubuntu22.04)
dd if=/dev/urandom of=/p1/idx/a bs=8K count=1024
zpool sync
dd if=/dev/urandom of=/p1/idx/a bs=32 count=1 conv=notrunc
1302504 1302504 dd arc_read
b'arc_read+0x1 [zfs]'
b'dbuf_read+0x103 [zfs]'
b'dmu_tx_check_ioerr+0x6e [zfs]'
b'dmu_tx_count_write+0x1af [zfs]'
b'dmu_tx_hold_write_by_dnode+0x66 [zfs]'
b'zfs_write+0x423 [zfs]'
b'zpl_iter_write+0xf3 [zfs]'
b'new_sync_write+0x111 [kernel]'
b'vfs_write+0x1d5 [kernel]'
b'ksys_write+0x67 [kernel]'
b'__x64_sys_write+0x19 [kernel]'
b'do_syscall_64+0x59 [kernel]'
b'entry_SYSCALL_64_after_hwframe+0x61 [kernel]'
1302504 1302504 dd arc_read
b'arc_read+0x1 [zfs]'
b'dbuf_read+0x103 [zfs]'
b'dmu_buf_will_dirty_impl+0xeb [zfs]'
b'dmu_buf_will_dirty+0x16 [zfs]'
b'dmu_write_uio_dnode+0x85 [zfs]'
b'dmu_write_uio_dbuf+0x4f [zfs]'
b'zfs_write+0x8be [zfs]'
b'zpl_iter_write+0xf3 [zfs]'
b'new_sync_write+0x111 [kernel]'
b'vfs_write+0x1d5 [kernel]'
b'ksys_write+0x67 [kernel]'
b'__x64_sys_write+0x19 [kernel]'
b'do_syscall_64+0x59 [kernel]'
b'entry_SYSCALL_64_after_hwframe+0x61 [kernel]'
In terms of the trace, the record to be written is read by both dmu_tx_check_ioerr and dmu_buf_will_dirty. Because I set primarycache to metadata, the dbuf and its arc are destroyed after dmu_tx_check_ioerr. That's why dmu_buf_will_dirty could not find dbuf and create a new one.
Is there any possibility to reduce this double read even when metadata is set as primarycache? Thanks.
Beta Was this translation helpful? Give feedback.
All reactions