dmu_tx_wait() hang likely due to cv_signal() in dsl_pool_dirty_delta()

Even though the bug's writeup (Github issue openzfs#9136) is very detailed, we still don't know exactly how we got to that state, thus I wasn't able to reproduce the bug. That said, we can make an educated guess combining the information on filled issue with the code. From the fact that `dp_dirty_total` was 0 (which is less than `zfs_dirty_data_max`) we know that there was one thread that set it to 0 and then signaled one of the waiters of `dp_spaceavail_cv` [see `dsl_pool_dirty_delta()` which is also the only place that `dp_dirty_total` is changed]. Thus, the only logical explaination then for the bug being hit is that the waiter that just got awaken didn't go through `dsl_pool_dirty_data()`. Given that this function is only called by `dsl_pool_dirty_space()` or `dsl_pool_undirty_space()` I can only think of two possible ways of the above scenario happening: [1] The waiter didn't call into any of the two functions - which I find highly unlikely (i.e. why wait on `dp_spaceavail_cv` to begin with?). [2] The waiter did call in one of the above function but it passed 0 as the space/delta to be dirtied (or undirtied) and then the callee returned immediately (e.g both `dsl_pool_dirty_space()` and `dsl_pool_undirty_space()` return immediately when space is 0). In any case and no matter how we got there, the easy fix would be to just broadcast to all waiters whenever `dp_dirty_total` hits 0. That said and given that we've never hit this before, it would make sense to think more on why the above situation occured. Attempting to mimic what Prakash was doing in the issue filed, I created a dataset with `sync=always` and started doing contiguous writes in a file within that dataset. I observed with DTrace that even though we update the pool's dirty data accounting when we would dirty stuff, the accounting wouldn't be decremented incrementally as we were done with the ZIOs of those writes (the reason being that `dbuf_write_physdone()` isn't be called as we go through the override code paths, and thus `dsl_pool_undirty_space()` is never called). As a result we'd have to wait until we get to `dsl_pool_sync()` where we zero out all dirty data accounting for the pool and the current TXG's metadata. In addition, as Matt noted and I later verified, the same issue would arise when using dedup. In both cases (sync & dedup) we shouldn't have to wait until `dsl_pool_sync()` zeros out the accounting data. According to the comment in that part of the code, the reasons why we do the zeroing, have nothing to do with what we observe: ```` /* * We have written all of the accounted dirty data, so our * dp_space_towrite should now be zero. However, some seldom-used * code paths do not adhere to this (e.g. dbuf_undirty(), also * rounding error in dbuf_write_physdone). * Shore up the accounting of any dirtied space now. */ dsl_pool_undirty_space(dp, dp->dp_dirty_pertxg[txg & TXG_MASK], txg); ```` Ideally what we want to do is to undirty in the accounting exactly what we dirty (I use the word ideally as we can still have rounding errors). This would make the behavior of the system more clear and predictable. Another interesting issue that I observed with DTrace was that we wouldn't update any of the pool's dirty data accounting whenever we would dirty and/or undirty MOS data. In addition, every time we would change the size of a dbuf through `dbuf_new_size()` we wouldn't update the accounted space dirtied in the appropriate dirty record, so when ZIOs are done we would undirty less that we dirtied from the pool's accounting point of view. For the first two issues observed (sync & dedup) this patch ensures that we still update the pool's accounting when we undirty data, regardless of the write being physical or not. For changes in the MOS, we first ensure to zero out the pool's dirty data accounting in `dsl_pool_sync()` after we synced the MOS. Then we can go ahead and enable the update of the pool's dirty data accounting wheneve we change MOS data. Another fix is that we now update the accounting explicitly for counting errors in `dbuf_write_done()`. Finally, `dbuf_new_size()` updates the accounted space of the appropriate dirty record correctly now. The problem is that we still don't know how the bug came up in the issue filled. That said the issues fixed seem to be very relevant, so instead of going with the broadcasting solution right away, I decided to leave this patch as is. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> External-issue: DLPX-47285 Closes openzfs#9137
tonyhutter · Dec 23, 2019 · b322597 · b322597
1 parent 8bcdeb2
commit b322597
Show file tree

Hide file tree

Showing 4 changed files with 60 additions and 18 deletions.
diff --git a/module/zfs/dbuf.c b/module/zfs/dbuf.c
@@ -1826,9 +1826,11 @@ dbuf_new_size(dmu_buf_impl_t *db, int size, dmu_tx_t *tx)
 	db->db.db_size = size;
 
 	if (db->db_level == 0) {
-		ASSERT3U(db->db_last_dirty->dr_txg, ==, tx->tx_txg);
 		db->db_last_dirty->dt.dl.dr_data = buf;
 	}
+	ASSERT3U(db->db_last_dirty->dr_txg, ==, tx->tx_txg);
+	ASSERT3U(db->db_last_dirty->dr_accounted, ==, osize);
+	db->db_last_dirty->dr_accounted = size;
 	mutex_exit(&db->db_mtx);
 
 	dmu_objset_willuse_space(dn->dn_objset, size - osize, tx);
@@ -2041,7 +2043,7 @@ dbuf_dirty(dmu_buf_impl_t *db, dmu_tx_t *tx)
 		    sizeof (dbuf_dirty_record_t),
 		    offsetof(dbuf_dirty_record_t, dr_dirty_node));
 	}
-	if (db->db_blkid != DMU_BONUS_BLKID && os->os_dsl_dataset != NULL)
+	if (db->db_blkid != DMU_BONUS_BLKID)
 		dr->dr_accounted = db->db.db_size;
 	dr->dr_dbuf = db;
 	dr->dr_txg = tx->tx_txg;
@@ -4210,8 +4212,7 @@ dbuf_write_physdone(zio_t *zio, arc_buf_t *buf, void *arg)
 	/*
 	 * The callback will be called io_phys_children times.  Retire one
 	 * portion of our dirty space each time we are called.  Any rounding
-	 * error will be cleaned up by dsl_pool_sync()'s call to
-	 * dsl_pool_undirty_space().
+	 * error will be cleaned up by dbuf_write_done().
 	 */
 	delta = dr->dr_accounted / zio->io_phys_children;
 	dsl_pool_undirty_space(dp, delta, zio->io_txg);
@@ -4294,13 +4295,36 @@ dbuf_write_done(zio_t *zio, arc_buf_t *buf, void *vdb)
 		mutex_destroy(&dr->dt.di.dr_mtx);
 		list_destroy(&dr->dt.di.dr_children);
 	}
-	kmem_free(dr, sizeof (dbuf_dirty_record_t));
 
 	cv_broadcast(&db->db_changed);
 	ASSERT(db->db_dirtycnt > 0);
 	db->db_dirtycnt -= 1;
 	db->db_data_pending = NULL;
 	dbuf_rele_and_unlock(db, (void *)(uintptr_t)tx->tx_txg, B_FALSE);
+
+	/*
+	 * If we didn't do a physical write in this ZIO and we
+	 * still ended up here, it means that the space of the
+	 * dbuf that we just released (and undirtied) above hasn't
+	 * been marked as undirtied in the pool's accounting.
+	 *
+	 * Thus, we undirty that space in the pool's view of the
+	 * world here. For physical writes this type of update
+	 * happens in dbuf_write_physdone().
+	 *
+	 * If we did a physical write, cleanup any rounding errors
+	 * that came up due to writing multiple copies of a block
+	 * on disk [see dbuf_write_physdone()].
+	 */
+	if (zio->io_phys_children == 0) {
+		dsl_pool_undirty_space(dmu_objset_pool(os),
+		    dr->dr_accounted, zio->io_txg);
+	} else {
+		dsl_pool_undirty_space(dmu_objset_pool(os),
+		    dr->dr_accounted % zio->io_phys_children, zio->io_txg);
+	}
+
+	kmem_free(dr, sizeof (dbuf_dirty_record_t));
 }
 
 static void

diff --git a/module/zfs/dmu.c b/module/zfs/dmu.c
@@ -1098,6 +1098,9 @@ dmu_write(objset_t *os, uint64_t object, uint64_t offset, uint64_t size,
 	dmu_buf_rele_array(dbp, numbufs, FTAG);
 }
 
+/*
+ * Note: Lustre is an external consumer of this interface.
+ */
 void
 dmu_write_by_dnode(dnode_t *dn, uint64_t offset, uint64_t size,
     const void *buf, dmu_tx_t *tx)

diff --git a/module/zfs/dmu_objset.c b/module/zfs/dmu_objset.c
@@ -3002,9 +3002,17 @@ dmu_fsname(const char *snapname, char *buf)
 }
 
 /*
- * Call when we think we're going to write/free space in open context to track
- * the amount of dirty data in the open txg, which is also the amount
- * of memory that can not be evicted until this txg syncs.
+ * Call when we think we're going to write/free space in open context
+ * to track the amount of dirty data in the open txg, which is also the
+ * amount of memory that can not be evicted until this txg syncs.
+ *
+ * Note that there are two conditions where this can be called from
+ * syncing context:
+ *
+ * [1] When we just created the dataset, in which case we go on with
+ *     updating any accounting of dirty data as usual.
+ * [2] When we are dirtying MOS data, in which case we only update the
+ *     pool's accounting of dirty data.
  */
 void
 dmu_objset_willuse_space(objset_t *os, int64_t space, dmu_tx_t *tx)
@@ -3014,8 +3022,9 @@ dmu_objset_willuse_space(objset_t *os, int64_t space, dmu_tx_t *tx)
 
 	if (ds != NULL) {
 		dsl_dir_willuse_space(ds->ds_dir, aspace, tx);
-		dsl_pool_dirty_space(dmu_tx_pool(tx), space, tx);
 	}
+
+	dsl_pool_dirty_space(dmu_tx_pool(tx), space, tx);
 }
 
 #if defined(_KERNEL)

diff --git a/module/zfs/dsl_pool.c b/module/zfs/dsl_pool.c
@@ -659,15 +659,6 @@ dsl_pool_sync(dsl_pool_t *dp, uint64_t txg)
 	}
 	VERIFY0(zio_wait(zio));
 
-	/*
-	 * We have written all of the accounted dirty data, so our
-	 * dp_space_towrite should now be zero.  However, some seldom-used
-	 * code paths do not adhere to this (e.g. dbuf_undirty(), also
-	 * rounding error in dbuf_write_physdone).
-	 * Shore up the accounting of any dirtied space now.
-	 */
-	dsl_pool_undirty_space(dp, dp->dp_dirty_pertxg[txg & TXG_MASK], txg);
-
 	/*
 	 * Update the long range free counter after
 	 * we're done syncing user data
@@ -762,6 +753,21 @@ dsl_pool_sync(dsl_pool_t *dp, uint64_t txg)
 		dsl_pool_sync_mos(dp, tx);
 	}
 
+	/*
+	 * We have written all of the accounted dirty data, so our
+	 * dp_space_towrite should now be zero. However, some seldom-used
+	 * code paths do not adhere to this (e.g. dbuf_undirty()). Shore up
+	 * the accounting of any dirtied space now.
+	 *
+	 * Note that, besides any dirty data from datasets, the amount of
+	 * dirty data in the MOS is also accounted by the pool. Therefore,
+	 * we want to do this cleanup after dsl_pool_sync_mos() so we don't
+	 * attempt to update the accounting for the same dirty data twice.
+	 * (i.e. at this point we only update the accounting for the space
+	 * that we know that we "leaked").
+	 */
+	dsl_pool_undirty_space(dp, dp->dp_dirty_pertxg[txg & TXG_MASK], txg);
+
 	/*
 	 * If we modify a dataset in the same txg that we want to destroy it,
 	 * its dsl_dir's dd_dbuf will be dirty, and thus have a hold on it.