Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix read errors race after block cloning #16052

Merged
merged 1 commit into from
Apr 8, 2024

Conversation

amotin
Copy link
Member

@amotin amotin commented Apr 2, 2024

Investigating read errors triggering panic fixed in #16042 I've found that we have a race in a sync process between the moment dirty record for cloned block is removed and the moment dbuf is destroyed. If dmu_buf_hold_array_by_dnode() take a hold on a cloned dbuf before it is synced/destroyed, then dbuf_read_impl() may see it still in DB_NOFILL state, but already without the dirty record. Such case is not an error, but equivalent to DB_UNCACHED, since the dbuf block pointer is already updated by dbuf_write_ready(). Unfortunately it is impossible to safely change the dbuf state to DB_UNCACHED there, since there may already be another cloning in progress, that dropped dbuf lock before creating a new dirty record, protected only by the range lock.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@amotin amotin added the Status: Code Review Needed Ready for review and testing label Apr 2, 2024
@amotin amotin force-pushed the clone_no_dr branch 2 times, most recently from 12433ce to 4858047 Compare April 2, 2024 22:03
Copy link
Member

@robn robn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, yeah. Your analysis seems right and matches the patch, so I think this is ok. I'm disgusted by the goto into a different branch arm, but its pretty self-contained, I can't think of a better way that wouldn't be harder to read in other ways. Really its probably a symptom of a design flaw than a problem all by itself.

Thanks for your service, this stuff is quite bonkers 🙇

*/
dr = list_head(&db->db_dirty_records);
if (dr == NULL || !dr->dt.dl.dr_brtwrite) {
if (dr == NULL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to set db_state = DB_UNCACHED in dbuf_write_override_done after testing that db->state == DB_NOFILL and no pending BRT clone in the next txg? That seems cleaner because then the dbuf has the same state it would have if the dbuf had been newly allocated by the reader hold.

Also, are there any lingering races/issues on the write path during this racing hold window? e.g. write (or clone, or free) happens instead of read.

Copy link
Member Author

@amotin amotin Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to set db_state = DB_UNCACHED in dbuf_write_override_done after testing that db->state == DB_NOFILL and no pending BRT clone in the next txg?

That was my original thought too, as I have written in the description above, but if you look on dmu_buf_will_not_fill() and few other places, there are windows when dbuf lock is dropped between dbuf state change and dbuf_dirty() calls. So I think there may be cases when DB_NOFILL is not still set, but is already set, and clearing it would be wrong.

Also, are there any lingering races/issues on the write path during this racing hold window? e.g. write (or clone, or free) happens instead of read.

I am not aware of any. Though I have some bad feelings about correctness of db_state checks in dbuf_write(), since IMHO it should care only about dirty records from that specific transaction being synced, not a current dbuf state couple transactions later. But I need more thinking to understand it better.

Copy link
Contributor

@rincebrain rincebrain Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, there are.

That's one of the reasons for the panic in #11679 - on the write path, we set something we swear can't ever be in use by anyone else to NULL after writing before allocating a fresh buffer, and we're sometimes wrong and lose a race.

#15538 makes this NULL dereference less often (I forget if it removed it entirely or not), but it still produced incorrect behavior after that patch, just fewer null dereferences outright. So it's still incorrect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I have some bad feelings about correctness of db_state checks in dbuf_write()

After closer look I think #16057 should be cleaner there.

I mean, there are.

@rincebrain On a quick look I am not sure how the issue and PR you mention are related to the issue of block cloning we hit, but I'll take a closer look. If you just want to bring attention to the problem, there are cleaner ways.

Copy link
Contributor

@rincebrain rincebrain Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't trying to bring more attention to the issue, and I'm sorry if it came across that way; I just was reading the discussion and saw you voicing not knowing if there were races in dbuf_write, and I happened to know of an example, where, to the best of my understanding, you go down the } else { at the bottom of dbuf_write into arc_write which sets the abds to NULL on the way through, and then the panic I mentioned in #15538 happens if you go through arc_buf_untransform_in_place in the window between that and them being replaced afterward.

On rereading it, I can see you were trying to address races around cloning specifically, and I'm sorry for the confusion. I wasn't trying to draw more attention to that example, just that it seemed like a logical reply to the question of "are there any".

Copy link
Contributor

@rrevans rrevans Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay. Thanks for explaining.

Here's a goto-free implementation for your consideration:

	blkptr_t bp, *bpp = NULL;

        ...

	if (db->db_state == DB_NOFILL) {
		dbuf_dirty_record_t *dr;

		/*
		 * Block cloning: If we have a pending block clone,
		 * we don't want to read the underlying block, but the content
		 * of the block being cloned, so we have the most recent data.
		 */
		dr = list_head(&db->db_dirty_records);
		if (dr != NULL) {
			if (!dr->dt.dl.dr_brtwrite) {
				err = EIO;
				goto early_unlock;
			}
			bp = dr->dt.dl.dr_overridden_by;
			bpp = &bp;
		}
	} else if (db->db_state != DB_UNCACHED) {
		err = EIO;
		goto early_unlock;
	}

	if (bpp == NULL && db->db_blkptr != NULL) {
		bp = *db->db_blkptr;
		bpp = &bp;
	}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rrevans Thanks. I've updated with some more compact version.

Investigating read errors triggering panic fixed in openzfs#16042 I've
found that we have a race in a sync process between the moment
dirty record for cloned block is removed and the moment dbuf is
destroyed.  If dmu_buf_hold_array_by_dnode() take a hold on a
cloned dbuf before it is synced/destroyed, then dbuf_read_impl()
may see it still in DB_NOFILL state, but without the dirty record.
Such case is not an error, but equivalent to DB_UNCACHED, since
the dbuf block pointer is already updated by dbuf_write_ready().
Unfortunately it is impossible to safely change the dbuf state
to DB_UNCACHED there, since there may already be another cloning
in progress, that dropped dbuf lock before creating a new dirty
record, protected only by the range lock.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Apr 8, 2024
@behlendorf behlendorf merged commit eeca9a9 into openzfs:master Apr 8, 2024
23 of 26 checks passed
@amotin amotin deleted the clone_no_dr branch April 9, 2024 13:16
@mmatuska
Copy link
Contributor

Possible candidate for zfs-2.2.4-staging?

@behlendorf
Copy link
Contributor

Definitely. We need to make a pass over the recently commits to master and open PRs against 2.2.4-staging for those which should be backported.

bwatkinson pushed a commit to bwatkinson/zfs that referenced this pull request Apr 15, 2024
Investigating read errors triggering panic fixed in openzfs#16042 I've
found that we have a race in a sync process between the moment
dirty record for cloned block is removed and the moment dbuf is
destroyed.  If dmu_buf_hold_array_by_dnode() take a hold on a
cloned dbuf before it is synced/destroyed, then dbuf_read_impl()
may see it still in DB_NOFILL state, but without the dirty record.
Such case is not an error, but equivalent to DB_UNCACHED, since
the dbuf block pointer is already updated by dbuf_write_ready().
Unfortunately it is impossible to safely change the dbuf state
to DB_UNCACHED there, since there may already be another cloning
in progress, that dropped dbuf lock before creating a new dirty
record, protected only by the range lock.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Robert Evans <evansr@google.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes openzfs#16052

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
amotin added a commit to amotin/zfs that referenced this pull request Apr 17, 2024
Investigating read errors triggering panic fixed in openzfs#16042 I've
found that we have a race in a sync process between the moment
dirty record for cloned block is removed and the moment dbuf is
destroyed.  If dmu_buf_hold_array_by_dnode() take a hold on a
cloned dbuf before it is synced/destroyed, then dbuf_read_impl()
may see it still in DB_NOFILL state, but without the dirty record.
Such case is not an error, but equivalent to DB_UNCACHED, since
the dbuf block pointer is already updated by dbuf_write_ready().
Unfortunately it is impossible to safely change the dbuf state
to DB_UNCACHED there, since there may already be another cloning
in progress, that dropped dbuf lock before creating a new dirty
record, protected only by the range lock.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Robert Evans <evansr@google.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes openzfs#16052
behlendorf pushed a commit that referenced this pull request Apr 19, 2024
Investigating read errors triggering panic fixed in #16042 I've
found that we have a race in a sync process between the moment
dirty record for cloned block is removed and the moment dbuf is
destroyed.  If dmu_buf_hold_array_by_dnode() take a hold on a
cloned dbuf before it is synced/destroyed, then dbuf_read_impl()
may see it still in DB_NOFILL state, but without the dirty record.
Such case is not an error, but equivalent to DB_UNCACHED, since
the dbuf block pointer is already updated by dbuf_write_ready().
Unfortunately it is impossible to safely change the dbuf state
to DB_UNCACHED there, since there may already be another cloning
in progress, that dropped dbuf lock before creating a new dirty
record, protected only by the range lock.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Robert Evans <evansr@google.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16052
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Sep 4, 2024
Investigating read errors triggering panic fixed in openzfs#16042 I've
found that we have a race in a sync process between the moment
dirty record for cloned block is removed and the moment dbuf is
destroyed.  If dmu_buf_hold_array_by_dnode() take a hold on a
cloned dbuf before it is synced/destroyed, then dbuf_read_impl()
may see it still in DB_NOFILL state, but without the dirty record.
Such case is not an error, but equivalent to DB_UNCACHED, since
the dbuf block pointer is already updated by dbuf_write_ready().
Unfortunately it is impossible to safely change the dbuf state
to DB_UNCACHED there, since there may already be another cloning
in progress, that dropped dbuf lock before creating a new dirty
record, protected only by the range lock.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Robert Evans <evansr@google.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes openzfs#16052
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants