-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix read errors race after block cloning #16052
Conversation
12433ce
to
4858047
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, yeah. Your analysis seems right and matches the patch, so I think this is ok. I'm disgusted by the goto into a different branch arm, but its pretty self-contained, I can't think of a better way that wouldn't be harder to read in other ways. Really its probably a symptom of a design flaw than a problem all by itself.
Thanks for your service, this stuff is quite bonkers 🙇
module/zfs/dbuf.c
Outdated
*/ | ||
dr = list_head(&db->db_dirty_records); | ||
if (dr == NULL || !dr->dt.dl.dr_brtwrite) { | ||
if (dr == NULL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to set db_state = DB_UNCACHED
in dbuf_write_override_done
after testing that db->state == DB_NOFILL
and no pending BRT clone in the next txg? That seems cleaner because then the dbuf has the same state it would have if the dbuf had been newly allocated by the reader hold.
Also, are there any lingering races/issues on the write path during this racing hold window? e.g. write (or clone, or free) happens instead of read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to set
db_state = DB_UNCACHED
indbuf_write_override_done
after testing thatdb->state == DB_NOFILL
and no pending BRT clone in the next txg?
That was my original thought too, as I have written in the description above, but if you look on dmu_buf_will_not_fill() and few other places, there are windows when dbuf lock is dropped between dbuf state change and dbuf_dirty() calls. So I think there may be cases when DB_NOFILL is not still set, but is already set, and clearing it would be wrong.
Also, are there any lingering races/issues on the write path during this racing hold window? e.g. write (or clone, or free) happens instead of read.
I am not aware of any. Though I have some bad feelings about correctness of db_state checks in dbuf_write(), since IMHO it should care only about dirty records from that specific transaction being synced, not a current dbuf state couple transactions later. But I need more thinking to understand it better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, there are.
That's one of the reasons for the panic in #11679 - on the write path, we set something we swear can't ever be in use by anyone else to NULL after writing before allocating a fresh buffer, and we're sometimes wrong and lose a race.
#15538 makes this NULL dereference less often (I forget if it removed it entirely or not), but it still produced incorrect behavior after that patch, just fewer null dereferences outright. So it's still incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though I have some bad feelings about correctness of db_state checks in dbuf_write()
After closer look I think #16057 should be cleaner there.
I mean, there are.
@rincebrain On a quick look I am not sure how the issue and PR you mention are related to the issue of block cloning we hit, but I'll take a closer look. If you just want to bring attention to the problem, there are cleaner ways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't trying to bring more attention to the issue, and I'm sorry if it came across that way; I just was reading the discussion and saw you voicing not knowing if there were races in dbuf_write
, and I happened to know of an example, where, to the best of my understanding, you go down the } else {
at the bottom of dbuf_write
into arc_write
which sets the abds to NULL on the way through, and then the panic I mentioned in #15538 happens if you go through arc_buf_untransform_in_place
in the window between that and them being replaced afterward.
On rereading it, I can see you were trying to address races around cloning specifically, and I'm sorry for the confusion. I wasn't trying to draw more attention to that example, just that it seemed like a logical reply to the question of "are there any".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, okay. Thanks for explaining.
Here's a goto-free implementation for your consideration:
blkptr_t bp, *bpp = NULL;
...
if (db->db_state == DB_NOFILL) {
dbuf_dirty_record_t *dr;
/*
* Block cloning: If we have a pending block clone,
* we don't want to read the underlying block, but the content
* of the block being cloned, so we have the most recent data.
*/
dr = list_head(&db->db_dirty_records);
if (dr != NULL) {
if (!dr->dt.dl.dr_brtwrite) {
err = EIO;
goto early_unlock;
}
bp = dr->dt.dl.dr_overridden_by;
bpp = &bp;
}
} else if (db->db_state != DB_UNCACHED) {
err = EIO;
goto early_unlock;
}
if (bpp == NULL && db->db_blkptr != NULL) {
bp = *db->db_blkptr;
bpp = &bp;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rrevans Thanks. I've updated with some more compact version.
Investigating read errors triggering panic fixed in openzfs#16042 I've found that we have a race in a sync process between the moment dirty record for cloned block is removed and the moment dbuf is destroyed. If dmu_buf_hold_array_by_dnode() take a hold on a cloned dbuf before it is synced/destroyed, then dbuf_read_impl() may see it still in DB_NOFILL state, but without the dirty record. Such case is not an error, but equivalent to DB_UNCACHED, since the dbuf block pointer is already updated by dbuf_write_ready(). Unfortunately it is impossible to safely change the dbuf state to DB_UNCACHED there, since there may already be another cloning in progress, that dropped dbuf lock before creating a new dirty record, protected only by the range lock. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc.
Possible candidate for zfs-2.2.4-staging? |
Definitely. We need to make a pass over the recently commits to master and open PRs against 2.2.4-staging for those which should be backported. |
Investigating read errors triggering panic fixed in openzfs#16042 I've found that we have a race in a sync process between the moment dirty record for cloned block is removed and the moment dbuf is destroyed. If dmu_buf_hold_array_by_dnode() take a hold on a cloned dbuf before it is synced/destroyed, then dbuf_read_impl() may see it still in DB_NOFILL state, but without the dirty record. Such case is not an error, but equivalent to DB_UNCACHED, since the dbuf block pointer is already updated by dbuf_write_ready(). Unfortunately it is impossible to safely change the dbuf state to DB_UNCACHED there, since there may already be another cloning in progress, that dropped dbuf lock before creating a new dirty record, protected only by the range lock. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Robert Evans <evansr@google.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16052 Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Investigating read errors triggering panic fixed in openzfs#16042 I've found that we have a race in a sync process between the moment dirty record for cloned block is removed and the moment dbuf is destroyed. If dmu_buf_hold_array_by_dnode() take a hold on a cloned dbuf before it is synced/destroyed, then dbuf_read_impl() may see it still in DB_NOFILL state, but without the dirty record. Such case is not an error, but equivalent to DB_UNCACHED, since the dbuf block pointer is already updated by dbuf_write_ready(). Unfortunately it is impossible to safely change the dbuf state to DB_UNCACHED there, since there may already be another cloning in progress, that dropped dbuf lock before creating a new dirty record, protected only by the range lock. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Robert Evans <evansr@google.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16052
Investigating read errors triggering panic fixed in #16042 I've found that we have a race in a sync process between the moment dirty record for cloned block is removed and the moment dbuf is destroyed. If dmu_buf_hold_array_by_dnode() take a hold on a cloned dbuf before it is synced/destroyed, then dbuf_read_impl() may see it still in DB_NOFILL state, but without the dirty record. Such case is not an error, but equivalent to DB_UNCACHED, since the dbuf block pointer is already updated by dbuf_write_ready(). Unfortunately it is impossible to safely change the dbuf state to DB_UNCACHED there, since there may already be another cloning in progress, that dropped dbuf lock before creating a new dirty record, protected only by the range lock. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Robert Evans <evansr@google.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16052
Investigating read errors triggering panic fixed in openzfs#16042 I've found that we have a race in a sync process between the moment dirty record for cloned block is removed and the moment dbuf is destroyed. If dmu_buf_hold_array_by_dnode() take a hold on a cloned dbuf before it is synced/destroyed, then dbuf_read_impl() may see it still in DB_NOFILL state, but without the dirty record. Such case is not an error, but equivalent to DB_UNCACHED, since the dbuf block pointer is already updated by dbuf_write_ready(). Unfortunately it is impossible to safely change the dbuf state to DB_UNCACHED there, since there may already be another cloning in progress, that dropped dbuf lock before creating a new dirty record, protected only by the range lock. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Robert Evans <evansr@google.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16052
Investigating read errors triggering panic fixed in #16042 I've found that we have a race in a sync process between the moment dirty record for cloned block is removed and the moment dbuf is destroyed. If dmu_buf_hold_array_by_dnode() take a hold on a cloned dbuf before it is synced/destroyed, then dbuf_read_impl() may see it still in DB_NOFILL state, but already without the dirty record. Such case is not an error, but equivalent to DB_UNCACHED, since the dbuf block pointer is already updated by dbuf_write_ready(). Unfortunately it is impossible to safely change the dbuf state to DB_UNCACHED there, since there may already be another cloning in progress, that dropped dbuf lock before creating a new dirty record, protected only by the range lock.
Types of changes
Checklist:
Signed-off-by
.