Skip to content

Commit

Permalink
Fix read errors race after block cloning
Browse files Browse the repository at this point in the history
Investigating read errors triggering panic fixed in #16042 I've
found that we have a race in a sync process between the moment
dirty record for cloned block is removed and the moment dbuf is
destroyed.  If dmu_buf_hold_array_by_dnode() take a hold on a
cloned dbuf before it is synced/destroyed, then dbuf_read_impl()
may see it still in DB_NOFILL state, but without the dirty record.
Such case is not an error, but equivalent to DB_CACHED, since
the dbuf block pointer is already updated by dbuf_write_ready().
Unfortunately it is impossible to safely change the dbuf state
to DB_CACHED there, since there may already be another cloning
in progress, that dropped dbuf lock before creating a new dirty
record, protected only by the range lock.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
  • Loading branch information
amotin committed Apr 2, 2024
1 parent 39be46f commit 12433ce
Showing 1 changed file with 11 additions and 4 deletions.
15 changes: 11 additions & 4 deletions module/zfs/dbuf.c
Original file line number Diff line number Diff line change
Expand Up @@ -1581,6 +1581,7 @@ dbuf_read_impl(dmu_buf_impl_t *db, zio_t *zio, uint32_t flags,
}

if (db->db_state == DB_UNCACHED) {
uncached:
if (db->db_blkptr == NULL) {
bpp = NULL;
} else {
Expand All @@ -1593,12 +1594,18 @@ dbuf_read_impl(dmu_buf_impl_t *db, zio_t *zio, uint32_t flags,
ASSERT3S(db->db_state, ==, DB_NOFILL);

/*
* Block cloning: If we have a pending block clone,
* we don't want to read the underlying block, but the content
* of the block being cloned, so we have the most recent data.
* If we have a pending block clone, we don't want to read
* the underlying block, but the content of the block being
* cloned, pointed by the dirty record, so we have the most
* recent data. If there is no dirty record, then we hit a
* race in a sync process when the dirty record is already
* removed, while the dbuf is not yet destroyed. Such case
* is equivalent to uncached.
*/
dr = list_head(&db->db_dirty_records);
if (dr == NULL || !dr->dt.dl.dr_brtwrite) {
if (dr == NULL)
goto uncached;
if (!dr->dt.dl.dr_brtwrite) {
err = EIO;
goto early_unlock;
}
Expand Down

0 comments on commit 12433ce

Please sign in to comment.