Fix read errors race after block cloning #16052

amotin · 2024-04-02T21:58:57Z

Investigating read errors triggering panic fixed in #16042 I've found that we have a race in a sync process between the moment dirty record for cloned block is removed and the moment dbuf is destroyed. If dmu_buf_hold_array_by_dnode() take a hold on a cloned dbuf before it is synced/destroyed, then dbuf_read_impl() may see it still in DB_NOFILL state, but already without the dirty record. Such case is not an error, but equivalent to DB_UNCACHED, since the dbuf block pointer is already updated by dbuf_write_ready(). Unfortunately it is impossible to safely change the dbuf state to DB_UNCACHED there, since there may already be another cloning in progress, that dropped dbuf lock before creating a new dirty record, protected only by the range lock.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

robn

Wow, yeah. Your analysis seems right and matches the patch, so I think this is ok. I'm disgusted by the goto into a different branch arm, but its pretty self-contained, I can't think of a better way that wouldn't be harder to read in other ways. Really its probably a symptom of a design flaw than a problem all by itself.

Thanks for your service, this stuff is quite bonkers 🙇

rrevans · 2024-04-03T11:46:47Z

module/zfs/dbuf.c

 		 */
 		dr = list_head(&db->db_dirty_records);
-		if (dr == NULL || !dr->dt.dl.dr_brtwrite) {
+		if (dr == NULL)


Is it possible to set db_state = DB_UNCACHED in dbuf_write_override_done after testing that db->state == DB_NOFILL and no pending BRT clone in the next txg? That seems cleaner because then the dbuf has the same state it would have if the dbuf had been newly allocated by the reader hold.

Also, are there any lingering races/issues on the write path during this racing hold window? e.g. write (or clone, or free) happens instead of read.

Is it possible to set db_state = DB_UNCACHED in dbuf_write_override_done after testing that db->state == DB_NOFILL and no pending BRT clone in the next txg?

That was my original thought too, as I have written in the description above, but if you look on dmu_buf_will_not_fill() and few other places, there are windows when dbuf lock is dropped between dbuf state change and dbuf_dirty() calls. So I think there may be cases when DB_NOFILL is not still set, but is already set, and clearing it would be wrong.

Also, are there any lingering races/issues on the write path during this racing hold window? e.g. write (or clone, or free) happens instead of read.

I am not aware of any. Though I have some bad feelings about correctness of db_state checks in dbuf_write(), since IMHO it should care only about dirty records from that specific transaction being synced, not a current dbuf state couple transactions later. But I need more thinking to understand it better.

I mean, there are.

That's one of the reasons for the panic in #11679 - on the write path, we set something we swear can't ever be in use by anyone else to NULL after writing before allocating a fresh buffer, and we're sometimes wrong and lose a race.

#15538 makes this NULL dereference less often (I forget if it removed it entirely or not), but it still produced incorrect behavior after that patch, just fewer null dereferences outright. So it's still incorrect.

Though I have some bad feelings about correctness of db_state checks in dbuf_write()

After closer look I think #16057 should be cleaner there.

I mean, there are.

@rincebrain On a quick look I am not sure how the issue and PR you mention are related to the issue of block cloning we hit, but I'll take a closer look. If you just want to bring attention to the problem, there are cleaner ways.

I wasn't trying to bring more attention to the issue, and I'm sorry if it came across that way; I just was reading the discussion and saw you voicing not knowing if there were races in dbuf_write, and I happened to know of an example, where, to the best of my understanding, you go down the } else { at the bottom of dbuf_write into arc_write which sets the abds to NULL on the way through, and then the panic I mentioned in #15538 happens if you go through arc_buf_untransform_in_place in the window between that and them being replaced afterward.

On rereading it, I can see you were trying to address races around cloning specifically, and I'm sorry for the confusion. I wasn't trying to draw more attention to that example, just that it seemed like a logical reply to the question of "are there any".

Ah, okay. Thanks for explaining.

Here's a goto-free implementation for your consideration:

blkptr_t bp, *bpp = NULL; ... if (db->db_state == DB_NOFILL) { dbuf_dirty_record_t *dr; /* * Block cloning: If we have a pending block clone, * we don't want to read the underlying block, but the content * of the block being cloned, so we have the most recent data. */ dr = list_head(&db->db_dirty_records); if (dr != NULL) { if (!dr->dt.dl.dr_brtwrite) { err = EIO; goto early_unlock; } bp = dr->dt.dl.dr_overridden_by; bpp = &bp; } } else if (db->db_state != DB_UNCACHED) { err = EIO; goto early_unlock; } if (bpp == NULL && db->db_blkptr != NULL) { bp = *db->db_blkptr; bpp = &bp; }

@rrevans Thanks. I've updated with some more compact version.

Investigating read errors triggering panic fixed in openzfs#16042 I've found that we have a race in a sync process between the moment dirty record for cloned block is removed and the moment dbuf is destroyed. If dmu_buf_hold_array_by_dnode() take a hold on a cloned dbuf before it is synced/destroyed, then dbuf_read_impl() may see it still in DB_NOFILL state, but without the dirty record. Such case is not an error, but equivalent to DB_UNCACHED, since the dbuf block pointer is already updated by dbuf_write_ready(). Unfortunately it is impossible to safely change the dbuf state to DB_UNCACHED there, since there may already be another cloning in progress, that dropped dbuf lock before creating a new dirty record, protected only by the range lock. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc.

mmatuska · 2024-04-13T07:21:42Z

Possible candidate for zfs-2.2.4-staging?

behlendorf · 2024-04-15T20:59:27Z

Definitely. We need to make a pass over the recently commits to master and open PRs against 2.2.4-staging for those which should be backported.

Investigating read errors triggering panic fixed in openzfs#16042 I've found that we have a race in a sync process between the moment dirty record for cloned block is removed and the moment dbuf is destroyed. If dmu_buf_hold_array_by_dnode() take a hold on a cloned dbuf before it is synced/destroyed, then dbuf_read_impl() may see it still in DB_NOFILL state, but without the dirty record. Such case is not an error, but equivalent to DB_UNCACHED, since the dbuf block pointer is already updated by dbuf_write_ready(). Unfortunately it is impossible to safely change the dbuf state to DB_UNCACHED there, since there may already be another cloning in progress, that dropped dbuf lock before creating a new dirty record, protected only by the range lock. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Robert Evans <evansr@google.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16052 Signed-off-by: Brian Atkinson <batkinson@lanl.gov>

Investigating read errors triggering panic fixed in openzfs#16042 I've found that we have a race in a sync process between the moment dirty record for cloned block is removed and the moment dbuf is destroyed. If dmu_buf_hold_array_by_dnode() take a hold on a cloned dbuf before it is synced/destroyed, then dbuf_read_impl() may see it still in DB_NOFILL state, but without the dirty record. Such case is not an error, but equivalent to DB_UNCACHED, since the dbuf block pointer is already updated by dbuf_write_ready(). Unfortunately it is impossible to safely change the dbuf state to DB_UNCACHED there, since there may already be another cloning in progress, that dropped dbuf lock before creating a new dirty record, protected only by the range lock. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Robert Evans <evansr@google.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16052

Investigating read errors triggering panic fixed in #16042 I've found that we have a race in a sync process between the moment dirty record for cloned block is removed and the moment dbuf is destroyed. If dmu_buf_hold_array_by_dnode() take a hold on a cloned dbuf before it is synced/destroyed, then dbuf_read_impl() may see it still in DB_NOFILL state, but without the dirty record. Such case is not an error, but equivalent to DB_UNCACHED, since the dbuf block pointer is already updated by dbuf_write_ready(). Unfortunately it is impossible to safely change the dbuf state to DB_UNCACHED there, since there may already be another cloning in progress, that dropped dbuf lock before creating a new dirty record, protected only by the range lock. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Robert Evans <evansr@google.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16052

Investigating read errors triggering panic fixed in openzfs#16042 I've found that we have a race in a sync process between the moment dirty record for cloned block is removed and the moment dbuf is destroyed. If dmu_buf_hold_array_by_dnode() take a hold on a cloned dbuf before it is synced/destroyed, then dbuf_read_impl() may see it still in DB_NOFILL state, but without the dirty record. Such case is not an error, but equivalent to DB_UNCACHED, since the dbuf block pointer is already updated by dbuf_write_ready(). Unfortunately it is impossible to safely change the dbuf state to DB_UNCACHED there, since there may already be another cloning in progress, that dropped dbuf lock before creating a new dirty record, protected only by the range lock. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Robert Evans <evansr@google.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16052

amotin added the Status: Code Review Needed Ready for review and testing label Apr 2, 2024

amotin force-pushed the clone_no_dr branch 2 times, most recently from 12433ce to 4858047 Compare April 2, 2024 22:03

robn approved these changes Apr 3, 2024

View reviewed changes

rrevans reviewed Apr 3, 2024

View reviewed changes

amotin force-pushed the clone_no_dr branch from 4858047 to 4ad1561 Compare April 4, 2024 15:24

rrevans approved these changes Apr 5, 2024

View reviewed changes

behlendorf approved these changes Apr 8, 2024

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Apr 8, 2024

behlendorf merged commit eeca9a9 into openzfs:master Apr 8, 2024
23 of 26 checks passed

amotin deleted the clone_no_dr branch April 9, 2024 13:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix read errors race after block cloning #16052

Fix read errors race after block cloning #16052

amotin commented Apr 2, 2024 •

edited

Loading

robn left a comment

rrevans Apr 3, 2024

amotin Apr 3, 2024 •

edited

Loading

rincebrain Apr 3, 2024 •

edited

Loading

amotin Apr 3, 2024

rincebrain Apr 3, 2024 •

edited

Loading

rrevans Apr 4, 2024 •

edited

Loading

amotin Apr 4, 2024

mmatuska commented Apr 13, 2024

behlendorf commented Apr 15, 2024

Fix read errors race after block cloning #16052

Fix read errors race after block cloning #16052

Conversation

amotin commented Apr 2, 2024 • edited Loading

Types of changes

Checklist:

robn left a comment

Choose a reason for hiding this comment

rrevans Apr 3, 2024

Choose a reason for hiding this comment

amotin Apr 3, 2024 • edited Loading

Choose a reason for hiding this comment

rincebrain Apr 3, 2024 • edited Loading

Choose a reason for hiding this comment

amotin Apr 3, 2024

Choose a reason for hiding this comment

rincebrain Apr 3, 2024 • edited Loading

Choose a reason for hiding this comment

rrevans Apr 4, 2024 • edited Loading

Choose a reason for hiding this comment

amotin Apr 4, 2024

Choose a reason for hiding this comment

mmatuska commented Apr 13, 2024

behlendorf commented Apr 15, 2024

amotin commented Apr 2, 2024 •

edited

Loading

amotin Apr 3, 2024 •

edited

Loading

rincebrain Apr 3, 2024 •

edited

Loading

rincebrain Apr 3, 2024 •

edited

Loading

rrevans Apr 4, 2024 •

edited

Loading