-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dmu_tx_wait() hang likely due to cv_signal() in dsl_pool_dirty_delta() #9137
Conversation
Codecov Report
@@ Coverage Diff @@
## master #9137 +/- ##
=========================================
- Coverage 79.14% 79.04% -0.1%
=========================================
Files 400 400
Lines 121812 121819 +7
=========================================
- Hits 96405 96294 -111
- Misses 25407 25525 +118
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, as long as we keep dmu_write_by_dnode
which is used by Lustre.
Even though the bug's writeup (Github issue openzfs#9136) is very detailed, we still don't know exactly how we got to that state, thus I wasn't able to reproduce the bug. That said, we can make an educated guess combining the information on filled issue with the code. From the fact that `dp_dirty_total` was 0 (which is less than `zfs_dirty_data_max`) we know that there was one thread that set it to 0 and then signaled one of the waiters of `dp_spaceavail_cv` [see `dsl_pool_dirty_delta()` which is also the only place that `dp_dirty_total` is changed]. Thus, the only logical explaination then for the bug being hit is that the waiter that just got awaken didn't go through `dsl_pool_dirty_data()`. Given that this function is only called by `dsl_pool_dirty_space()` or `dsl_pool_undirty_space()` I can only think of two possible ways of the above scenario happening: [1] The waiter didn't call into any of the two functions - which I find highly unlikely (i.e. why wait on `dp_spaceavail_cv` to begin with?). [2] The waiter did call in one of the above function but it passed 0 as the space/delta to be dirtied (or undirtied) and then the callee returned immediately (e.g both `dsl_pool_dirty_space()` and `dsl_pool_undirty_space()` return immediately when space is 0). In any case and no matter how we got there, the easy fix would be to just broadcast to all waiters whenever `dp_dirty_total` hits 0. That said and given that we've never hit this before, it would make sense to think more on why the above situation occured. Attempting to mimic what Prakash was doing in the issue filed, I created a dataset with `sync=always` and started doing contiguous writes in a file within that dataset. I observed with DTrace that even though we update the pool's dirty data accounting when we would dirty stuff, the accounting wouldn't be decremented incrementally as we were done with the ZIOs of those writes (the reason being that `dbuf_write_physdone()` isn't be called as we go through the override code paths, and thus `dsl_pool_undirty_space()` is never called). As a result we'd have to wait until we get to `dsl_pool_sync()` where we zero out all dirty data accounting for the pool and the current TXG's metadata. In addition, as Matt noted and I later verified, the same issue would arise when using dedup. In both cases (sync & dedup) we shouldn't have to wait until `dsl_pool_sync()` zeros out the accounting data. According to the comment in that part of the code, the reasons why we do the zeroing, have nothing to do with what we observe: ```` /* * We have written all of the accounted dirty data, so our * dp_space_towrite should now be zero. However, some seldom-used * code paths do not adhere to this (e.g. dbuf_undirty(), also * rounding error in dbuf_write_physdone). * Shore up the accounting of any dirtied space now. */ dsl_pool_undirty_space(dp, dp->dp_dirty_pertxg[txg & TXG_MASK], txg); ```` Ideally what we want to do is to undirty in the accounting exactly what we dirty (I use the word ideally as we can still have rounding errors). This would make the behavior of the system more clear and predictable. Another interesting issue that I observed with DTrace was that we wouldn't update any of the pool's dirty data accounting whenever we would dirty and/or undirty MOS data. In addition, every time we would change the size of a dbuf through `dbuf_new_size()` we wouldn't update the accounted space dirtied in the appropriate dirty record, so when ZIOs are done we would undirty less that we dirtied from the pool's accounting point of view. For the first two issues observed (sync & dedup) this patch ensures that we still update the pool's accounting when we undirty data, regardless of the write being physical or not. For changes in the MOS, we first ensure to zero out the pool's dirty data accounting in `dsl_pool_sync()` after we synced the MOS. Then we can go ahead and enable the update of the pool's dirty data accounting wheneve we change MOS data. Another fix is that we now update the accounting explicitly for counting errors in `dbuf_write_done()`. Finally, `dbuf_new_size()` updates the accounted space of the appropriate dirty record correctly now. The problem is that we still don't know how the bug came up in the issue filled. That said the issues fixed seem to be very relevant, so instead of going with the broadcasting solution right away, I decided to leave this patch as is. Delphix Upstream Bug: DLPX-47285 Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
The removal of |
Even though the bug's writeup (Github issue openzfs#9136) is very detailed, we still don't know exactly how we got to that state, thus I wasn't able to reproduce the bug. That said, we can make an educated guess combining the information on filled issue with the code. From the fact that `dp_dirty_total` was 0 (which is less than `zfs_dirty_data_max`) we know that there was one thread that set it to 0 and then signaled one of the waiters of `dp_spaceavail_cv` [see `dsl_pool_dirty_delta()` which is also the only place that `dp_dirty_total` is changed]. Thus, the only logical explaination then for the bug being hit is that the waiter that just got awaken didn't go through `dsl_pool_dirty_data()`. Given that this function is only called by `dsl_pool_dirty_space()` or `dsl_pool_undirty_space()` I can only think of two possible ways of the above scenario happening: [1] The waiter didn't call into any of the two functions - which I find highly unlikely (i.e. why wait on `dp_spaceavail_cv` to begin with?). [2] The waiter did call in one of the above function but it passed 0 as the space/delta to be dirtied (or undirtied) and then the callee returned immediately (e.g both `dsl_pool_dirty_space()` and `dsl_pool_undirty_space()` return immediately when space is 0). In any case and no matter how we got there, the easy fix would be to just broadcast to all waiters whenever `dp_dirty_total` hits 0. That said and given that we've never hit this before, it would make sense to think more on why the above situation occured. Attempting to mimic what Prakash was doing in the issue filed, I created a dataset with `sync=always` and started doing contiguous writes in a file within that dataset. I observed with DTrace that even though we update the pool's dirty data accounting when we would dirty stuff, the accounting wouldn't be decremented incrementally as we were done with the ZIOs of those writes (the reason being that `dbuf_write_physdone()` isn't be called as we go through the override code paths, and thus `dsl_pool_undirty_space()` is never called). As a result we'd have to wait until we get to `dsl_pool_sync()` where we zero out all dirty data accounting for the pool and the current TXG's metadata. In addition, as Matt noted and I later verified, the same issue would arise when using dedup. In both cases (sync & dedup) we shouldn't have to wait until `dsl_pool_sync()` zeros out the accounting data. According to the comment in that part of the code, the reasons why we do the zeroing, have nothing to do with what we observe: ```` /* * We have written all of the accounted dirty data, so our * dp_space_towrite should now be zero. However, some seldom-used * code paths do not adhere to this (e.g. dbuf_undirty(), also * rounding error in dbuf_write_physdone). * Shore up the accounting of any dirtied space now. */ dsl_pool_undirty_space(dp, dp->dp_dirty_pertxg[txg & TXG_MASK], txg); ```` Ideally what we want to do is to undirty in the accounting exactly what we dirty (I use the word ideally as we can still have rounding errors). This would make the behavior of the system more clear and predictable. Another interesting issue that I observed with DTrace was that we wouldn't update any of the pool's dirty data accounting whenever we would dirty and/or undirty MOS data. In addition, every time we would change the size of a dbuf through `dbuf_new_size()` we wouldn't update the accounted space dirtied in the appropriate dirty record, so when ZIOs are done we would undirty less that we dirtied from the pool's accounting point of view. For the first two issues observed (sync & dedup) this patch ensures that we still update the pool's accounting when we undirty data, regardless of the write being physical or not. For changes in the MOS, we first ensure to zero out the pool's dirty data accounting in `dsl_pool_sync()` after we synced the MOS. Then we can go ahead and enable the update of the pool's dirty data accounting wheneve we change MOS data. Another fix is that we now update the accounting explicitly for counting errors in `dbuf_write_done()`. Finally, `dbuf_new_size()` updates the accounted space of the appropriate dirty record correctly now. The problem is that we still don't know how the bug came up in the issue filled. That said the issues fixed seem to be very relevant, so instead of going with the broadcasting solution right away, I decided to leave this patch as is. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> External-issue: DLPX-47285 Closes openzfs#9137
Even though the bug's writeup (Github issue openzfs#9136) is very detailed, we still don't know exactly how we got to that state, thus I wasn't able to reproduce the bug. That said, we can make an educated guess combining the information on filled issue with the code. From the fact that `dp_dirty_total` was 0 (which is less than `zfs_dirty_data_max`) we know that there was one thread that set it to 0 and then signaled one of the waiters of `dp_spaceavail_cv` [see `dsl_pool_dirty_delta()` which is also the only place that `dp_dirty_total` is changed]. Thus, the only logical explaination then for the bug being hit is that the waiter that just got awaken didn't go through `dsl_pool_dirty_data()`. Given that this function is only called by `dsl_pool_dirty_space()` or `dsl_pool_undirty_space()` I can only think of two possible ways of the above scenario happening: [1] The waiter didn't call into any of the two functions - which I find highly unlikely (i.e. why wait on `dp_spaceavail_cv` to begin with?). [2] The waiter did call in one of the above function but it passed 0 as the space/delta to be dirtied (or undirtied) and then the callee returned immediately (e.g both `dsl_pool_dirty_space()` and `dsl_pool_undirty_space()` return immediately when space is 0). In any case and no matter how we got there, the easy fix would be to just broadcast to all waiters whenever `dp_dirty_total` hits 0. That said and given that we've never hit this before, it would make sense to think more on why the above situation occured. Attempting to mimic what Prakash was doing in the issue filed, I created a dataset with `sync=always` and started doing contiguous writes in a file within that dataset. I observed with DTrace that even though we update the pool's dirty data accounting when we would dirty stuff, the accounting wouldn't be decremented incrementally as we were done with the ZIOs of those writes (the reason being that `dbuf_write_physdone()` isn't be called as we go through the override code paths, and thus `dsl_pool_undirty_space()` is never called). As a result we'd have to wait until we get to `dsl_pool_sync()` where we zero out all dirty data accounting for the pool and the current TXG's metadata. In addition, as Matt noted and I later verified, the same issue would arise when using dedup. In both cases (sync & dedup) we shouldn't have to wait until `dsl_pool_sync()` zeros out the accounting data. According to the comment in that part of the code, the reasons why we do the zeroing, have nothing to do with what we observe: ```` /* * We have written all of the accounted dirty data, so our * dp_space_towrite should now be zero. However, some seldom-used * code paths do not adhere to this (e.g. dbuf_undirty(), also * rounding error in dbuf_write_physdone). * Shore up the accounting of any dirtied space now. */ dsl_pool_undirty_space(dp, dp->dp_dirty_pertxg[txg & TXG_MASK], txg); ```` Ideally what we want to do is to undirty in the accounting exactly what we dirty (I use the word ideally as we can still have rounding errors). This would make the behavior of the system more clear and predictable. Another interesting issue that I observed with DTrace was that we wouldn't update any of the pool's dirty data accounting whenever we would dirty and/or undirty MOS data. In addition, every time we would change the size of a dbuf through `dbuf_new_size()` we wouldn't update the accounted space dirtied in the appropriate dirty record, so when ZIOs are done we would undirty less that we dirtied from the pool's accounting point of view. For the first two issues observed (sync & dedup) this patch ensures that we still update the pool's accounting when we undirty data, regardless of the write being physical or not. For changes in the MOS, we first ensure to zero out the pool's dirty data accounting in `dsl_pool_sync()` after we synced the MOS. Then we can go ahead and enable the update of the pool's dirty data accounting wheneve we change MOS data. Another fix is that we now update the accounting explicitly for counting errors in `dbuf_write_done()`. Finally, `dbuf_new_size()` updates the accounted space of the appropriate dirty record correctly now. The problem is that we still don't know how the bug came up in the issue filled. That said the issues fixed seem to be very relevant, so instead of going with the broadcasting solution right away, I decided to leave this patch as is. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> External-issue: DLPX-47285 Closes openzfs#9137
Even though the bug's writeup (Github issue #9136) is very detailed, we still don't know exactly how we got to that state, thus I wasn't able to reproduce the bug. That said, we can make an educated guess combining the information on filled issue with the code. From the fact that `dp_dirty_total` was 0 (which is less than `zfs_dirty_data_max`) we know that there was one thread that set it to 0 and then signaled one of the waiters of `dp_spaceavail_cv` [see `dsl_pool_dirty_delta()` which is also the only place that `dp_dirty_total` is changed]. Thus, the only logical explaination then for the bug being hit is that the waiter that just got awaken didn't go through `dsl_pool_dirty_data()`. Given that this function is only called by `dsl_pool_dirty_space()` or `dsl_pool_undirty_space()` I can only think of two possible ways of the above scenario happening: [1] The waiter didn't call into any of the two functions - which I find highly unlikely (i.e. why wait on `dp_spaceavail_cv` to begin with?). [2] The waiter did call in one of the above function but it passed 0 as the space/delta to be dirtied (or undirtied) and then the callee returned immediately (e.g both `dsl_pool_dirty_space()` and `dsl_pool_undirty_space()` return immediately when space is 0). In any case and no matter how we got there, the easy fix would be to just broadcast to all waiters whenever `dp_dirty_total` hits 0. That said and given that we've never hit this before, it would make sense to think more on why the above situation occured. Attempting to mimic what Prakash was doing in the issue filed, I created a dataset with `sync=always` and started doing contiguous writes in a file within that dataset. I observed with DTrace that even though we update the pool's dirty data accounting when we would dirty stuff, the accounting wouldn't be decremented incrementally as we were done with the ZIOs of those writes (the reason being that `dbuf_write_physdone()` isn't be called as we go through the override code paths, and thus `dsl_pool_undirty_space()` is never called). As a result we'd have to wait until we get to `dsl_pool_sync()` where we zero out all dirty data accounting for the pool and the current TXG's metadata. In addition, as Matt noted and I later verified, the same issue would arise when using dedup. In both cases (sync & dedup) we shouldn't have to wait until `dsl_pool_sync()` zeros out the accounting data. According to the comment in that part of the code, the reasons why we do the zeroing, have nothing to do with what we observe: ```` /* * We have written all of the accounted dirty data, so our * dp_space_towrite should now be zero. However, some seldom-used * code paths do not adhere to this (e.g. dbuf_undirty(), also * rounding error in dbuf_write_physdone). * Shore up the accounting of any dirtied space now. */ dsl_pool_undirty_space(dp, dp->dp_dirty_pertxg[txg & TXG_MASK], txg); ```` Ideally what we want to do is to undirty in the accounting exactly what we dirty (I use the word ideally as we can still have rounding errors). This would make the behavior of the system more clear and predictable. Another interesting issue that I observed with DTrace was that we wouldn't update any of the pool's dirty data accounting whenever we would dirty and/or undirty MOS data. In addition, every time we would change the size of a dbuf through `dbuf_new_size()` we wouldn't update the accounted space dirtied in the appropriate dirty record, so when ZIOs are done we would undirty less that we dirtied from the pool's accounting point of view. For the first two issues observed (sync & dedup) this patch ensures that we still update the pool's accounting when we undirty data, regardless of the write being physical or not. For changes in the MOS, we first ensure to zero out the pool's dirty data accounting in `dsl_pool_sync()` after we synced the MOS. Then we can go ahead and enable the update of the pool's dirty data accounting wheneve we change MOS data. Another fix is that we now update the accounting explicitly for counting errors in `dbuf_write_done()`. Finally, `dbuf_new_size()` updates the accounted space of the appropriate dirty record correctly now. The problem is that we still don't know how the bug came up in the issue filled. That said the issues fixed seem to be very relevant, so instead of going with the broadcasting solution right away, I decided to leave this patch as is. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> External-issue: DLPX-47285 Closes #9137
Even though the bug's writeup (Github issue openzfs#9136) is very detailed, we still don't know exactly how we got to that state, thus I wasn't able to reproduce the bug. That said, we can make an educated guess combining the information on filled issue with the code. From the fact that `dp_dirty_total` was 0 (which is less than `zfs_dirty_data_max`) we know that there was one thread that set it to 0 and then signaled one of the waiters of `dp_spaceavail_cv` [see `dsl_pool_dirty_delta()` which is also the only place that `dp_dirty_total` is changed]. Thus, the only logical explaination then for the bug being hit is that the waiter that just got awaken didn't go through `dsl_pool_dirty_data()`. Given that this function is only called by `dsl_pool_dirty_space()` or `dsl_pool_undirty_space()` I can only think of two possible ways of the above scenario happening: [1] The waiter didn't call into any of the two functions - which I find highly unlikely (i.e. why wait on `dp_spaceavail_cv` to begin with?). [2] The waiter did call in one of the above function but it passed 0 as the space/delta to be dirtied (or undirtied) and then the callee returned immediately (e.g both `dsl_pool_dirty_space()` and `dsl_pool_undirty_space()` return immediately when space is 0). In any case and no matter how we got there, the easy fix would be to just broadcast to all waiters whenever `dp_dirty_total` hits 0. That said and given that we've never hit this before, it would make sense to think more on why the above situation occured. Attempting to mimic what Prakash was doing in the issue filed, I created a dataset with `sync=always` and started doing contiguous writes in a file within that dataset. I observed with DTrace that even though we update the pool's dirty data accounting when we would dirty stuff, the accounting wouldn't be decremented incrementally as we were done with the ZIOs of those writes (the reason being that `dbuf_write_physdone()` isn't be called as we go through the override code paths, and thus `dsl_pool_undirty_space()` is never called). As a result we'd have to wait until we get to `dsl_pool_sync()` where we zero out all dirty data accounting for the pool and the current TXG's metadata. In addition, as Matt noted and I later verified, the same issue would arise when using dedup. In both cases (sync & dedup) we shouldn't have to wait until `dsl_pool_sync()` zeros out the accounting data. According to the comment in that part of the code, the reasons why we do the zeroing, have nothing to do with what we observe: ```` /* * We have written all of the accounted dirty data, so our * dp_space_towrite should now be zero. However, some seldom-used * code paths do not adhere to this (e.g. dbuf_undirty(), also * rounding error in dbuf_write_physdone). * Shore up the accounting of any dirtied space now. */ dsl_pool_undirty_space(dp, dp->dp_dirty_pertxg[txg & TXG_MASK], txg); ```` Ideally what we want to do is to undirty in the accounting exactly what we dirty (I use the word ideally as we can still have rounding errors). This would make the behavior of the system more clear and predictable. Another interesting issue that I observed with DTrace was that we wouldn't update any of the pool's dirty data accounting whenever we would dirty and/or undirty MOS data. In addition, every time we would change the size of a dbuf through `dbuf_new_size()` we wouldn't update the accounted space dirtied in the appropriate dirty record, so when ZIOs are done we would undirty less that we dirtied from the pool's accounting point of view. For the first two issues observed (sync & dedup) this patch ensures that we still update the pool's accounting when we undirty data, regardless of the write being physical or not. For changes in the MOS, we first ensure to zero out the pool's dirty data accounting in `dsl_pool_sync()` after we synced the MOS. Then we can go ahead and enable the update of the pool's dirty data accounting wheneve we change MOS data. Another fix is that we now update the accounting explicitly for counting errors in `dbuf_write_done()`. Finally, `dbuf_new_size()` updates the accounted space of the appropriate dirty record correctly now. The problem is that we still don't know how the bug came up in the issue filled. That said the issues fixed seem to be very relevant, so instead of going with the broadcasting solution right away, I decided to leave this patch as is. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> External-issue: DLPX-47285 Closes openzfs#9137
Even though the bug's writeup (Github issue #9136) is very detailed,
we still don't know exactly how we got to that state, thus I wasn't
able to reproduce the bug. That said, we can make an educated guess
combining the information on filled issue with the code.
From the fact that
dp_dirty_total
was 0 (which is less thanzfs_dirty_data_max
) we know that there was one thread that set it to0 and then signaled one of the waiters of
dp_spaceavail_cv
[seedsl_pool_dirty_delta()
which is also the only place thatdp_dirty_total
is changed]. Thus, the only logical explainationthen for the bug being hit is that the waiter that just got awaken
didn't go through
dsl_pool_dirty_data()
. Given that this functionis only called by
dsl_pool_dirty_space()
ordsl_pool_undirty_space()
I can only think of two possible ways of the above scenario happening:
[1] The waiter didn't call into any of the two functions - which I
find highly unlikely (i.e. why wait on
dp_spaceavail_cv
to beginwith?).
[2] The waiter did call in one of the above function but it passed 0 as
the space/delta to be dirtied (or undirtied) and then the callee
returned immediately (e.g both
dsl_pool_dirty_space()
anddsl_pool_undirty_space()
return immediately when space is 0).In any case and no matter how we got there, the easy fix would be to
just broadcast to all waiters whenever
dp_dirty_total
hits 0. Thatsaid and given that we've never hit this before, it would make sense
to think more on why the above situation occured.
Attempting to mimic what Prakash was doing in the issue filed, I
created a dataset with
sync=always
and started doing contiguouswrites in a file within that dataset. I observed with DTrace that even
though we update the pool's dirty data accounting when we would dirty
stuff, the accounting wouldn't be decremented incrementally as we were
done with the ZIOs of those writes (the reason being that
dbuf_write_physdone()
isn't be called as we go through the overridecode paths, and thus
dsl_pool_undirty_space()
is never called). As aresult we'd have to wait until we get to
dsl_pool_sync()
where wezero out all dirty data accounting for the pool and the current TXG's
metadata.
In addition, as Matt noted and I later verified, the same issue would
arise when using dedup.
In both cases (sync & dedup) we shouldn't have to wait until
dsl_pool_sync()
zeros out the accounting data. According to thecomment in that part of the code, the reasons why we do the zeroing,
have nothing to do with what we observe:
Ideally what we want to do is to undirty in the accounting exactly what
we dirty (I use the word ideally as we can still have rounding errors).
This would make the behavior of the system more clear and predictable.
Another interesting issue that I observed with DTrace was that we
wouldn't update any of the pool's dirty data accounting whenever we
would dirty and/or undirty MOS data. In addition, every time we would
change the size of a dbuf through
dbuf_new_size()
we wouldn't updatethe accounted space dirtied in the appropriate dirty record, so when
ZIOs are done we would undirty less that we dirtied from the pool's
accounting point of view.
For the first two issues observed (sync & dedup) this patch ensures
that we still update the pool's accounting when we undirty data,
regardless of the write being physical or not.
For changes in the MOS, we first ensure to zero out the pool's dirty
data accounting in
dsl_pool_sync()
after we synced the MOS. Then wecan go ahead and enable the update of the pool's dirty data accounting
wheneve we change MOS data.
Another fix is that we now update the accounting explicitly for
counting errors in
dbuf_write_done()
.Finally,
dbuf_new_size()
updates the accounted space of theappropriate dirty record correctly now.
The problem is that we still don't know how the bug came up in the
issue filled. That said the issues fixed seem to be very relevant, so
instead of going with the broadcasting solution right away,
I decided to leave this patch as is.
Delphix Upstream Bug: DLPX-47285
Signed-off-by: Serapheim Dimitropoulos serapheim@delphix.com
Types of changes
Checklist:
Signed-off-by
.