Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent metaslab_sync panic due to spa_final_dirty_txg #9231

Merged
merged 2 commits into from
Aug 30, 2019

Conversation

pcd1193182
Copy link
Contributor

This change has a cherry-picked commit from #9209 as well, since it will be integrated after that commit. It also reverts 'ZTS: Temporarily disable several upgrade tests', since it fixes the issues that was intended to work around.

Motivation and Context

See #9186

Description

If a pool enables the SPACEMAP_HISTOGRAM feature shortly before being
exported, we can enter a situation that causes a kernel panic. Any metaslabs
that are loaded during the final dirty txg and haven't already been condensed
will cause metaslab_sync to proceed after the final dirty txg so that the
condense can be performed, which there are assertions to prevent. Because of
the nature of this issue, there are a number of ways we can enter this
state. Rather than try to prevent each of them one by one, potentially missing
some edge cases, we instead cut it off at the point of intersection; by
preventing metaslab_sync from proceeding if it would only do so to perform a
condense and we're past the final dirty txg, we preserve the utility of the
existing asserts while preventing this particular issue.

How Has This Been Tested?

zfs-test and zloop

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

If a pool enables the SPACEMAP_HISTOGRAM feature shortly before being
exported, we can enter a situation that causes a kernel panic. Any metaslabs
that are loaded during the final dirty txg and haven't already been condensed
will cause metaslab_sync to proceed after the final dirty txg so that the
condense can be performed, which there are assertions to prevent. Because of
the nature of this issue, there are a number of ways we can enter this
state. Rather than try to prevent each of them one by one, potentially missing
some edge cases, we instead cut it off at the point of intersection; by
preventing metaslab_sync from proceeding if it would only do so to perform a
condense and we're past the final dirty txg, we preserve the utility of the
existing asserts while preventing this particular issue.

Signed-off-by: Paul Dagnelie <pcd@delphix.com>
@pcd1193182 pcd1193182 added the Status: Code Review Needed Ready for review and testing label Aug 29, 2019
@pcd1193182 pcd1193182 changed the title Condense Prevent metaslab_sync final_dirty_txg panic Aug 29, 2019
@pcd1193182 pcd1193182 changed the title Prevent metaslab_sync final_dirty_txg panic @pcd1193182 Prevent metaslab_sync panic due to spa_final_dirty_txg Aug 29, 2019
@ahrens ahrens changed the title @pcd1193182 Prevent metaslab_sync panic due to spa_final_dirty_txg Prevent metaslab_sync panic due to spa_final_dirty_txg Aug 29, 2019
module/zfs/metaslab.c Outdated Show resolved Hide resolved
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you split the ZTS revert in to a separate commit in this PR. There's still some discussion going on in #9209 regarding the best way to handle that case. Until both cases are resolved we can't re-enable the tests, but that shouldn't prevent us from integrating this fix.

Signed-off-by: Paul Dagnelie <pcd@delphix.com>
@ahrens
Copy link
Member

ahrens commented Aug 29, 2019

I think that we actually don't need #9209. This change should be sufficient to fix the bug and re-enable the tests.

@behlendorf
Copy link
Contributor

I'll make sure this gets run through the CI a few times to confirm that's the case.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Aug 29, 2019
@codecov
Copy link

codecov bot commented Aug 29, 2019

Codecov Report

Merging #9231 into master will decrease coverage by 2.26%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9231      +/-   ##
==========================================
- Coverage   79.08%   76.82%   -2.27%     
==========================================
  Files         400      373      -27     
  Lines      122034   118922    -3112     
==========================================
- Hits        96516    91365    -5151     
- Misses      25518    27557    +2039
Flag Coverage Δ
#kernel 76.03% <100%> (-3.68%) ⬇️
#user 66.07% <50%> (-0.93%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f66ad58...2d5d5df. Read the comment docs.

@behlendorf
Copy link
Contributor

This looks good. The issue hasn't been reproduced by the CI after over 15 full ZTS runs.

@behlendorf behlendorf merged commit 475aa97 into openzfs:master Aug 30, 2019
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Dec 24, 2019
If a pool enables the SPACEMAP_HISTOGRAM feature shortly before being
exported, we can enter a situation that causes a kernel panic. Any metaslabs
that are loaded during the final dirty txg and haven't already been condensed
will cause metaslab_sync to proceed after the final dirty txg so that the
condense can be performed, which there are assertions to prevent. Because of
the nature of this issue, there are a number of ways we can enter this
state. Rather than try to prevent each of them one by one, potentially missing
some edge cases, we instead cut it off at the point of intersection; by
preventing metaslab_sync from proceeding if it would only do so to perform a
condense and we're past the final dirty txg, we preserve the utility of the
existing asserts while preventing this particular issue.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes openzfs#9185
Closes openzfs#9186
Closes openzfs#9231
Closes openzfs#9253
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Dec 27, 2019
If a pool enables the SPACEMAP_HISTOGRAM feature shortly before being
exported, we can enter a situation that causes a kernel panic. Any metaslabs
that are loaded during the final dirty txg and haven't already been condensed
will cause metaslab_sync to proceed after the final dirty txg so that the
condense can be performed, which there are assertions to prevent. Because of
the nature of this issue, there are a number of ways we can enter this
state. Rather than try to prevent each of them one by one, potentially missing
some edge cases, we instead cut it off at the point of intersection; by
preventing metaslab_sync from proceeding if it would only do so to perform a
condense and we're past the final dirty txg, we preserve the utility of the
existing asserts while preventing this particular issue.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes openzfs#9185
Closes openzfs#9186
Closes openzfs#9231
Closes openzfs#9253
tonyhutter pushed a commit that referenced this pull request Jan 23, 2020
If a pool enables the SPACEMAP_HISTOGRAM feature shortly before being
exported, we can enter a situation that causes a kernel panic. Any metaslabs
that are loaded during the final dirty txg and haven't already been condensed
will cause metaslab_sync to proceed after the final dirty txg so that the
condense can be performed, which there are assertions to prevent. Because of
the nature of this issue, there are a number of ways we can enter this
state. Rather than try to prevent each of them one by one, potentially missing
some edge cases, we instead cut it off at the point of intersection; by
preventing metaslab_sync from proceeding if it would only do so to perform a
condense and we're past the final dirty txg, we preserve the utility of the
existing asserts while preventing this particular issue.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9185
Closes #9186
Closes #9231
Closes #9253
allanjude pushed a commit to KlaraSystems/zfs that referenced this pull request Apr 28, 2020
If a pool enables the SPACEMAP_HISTOGRAM feature shortly before being
exported, we can enter a situation that causes a kernel panic. Any metaslabs
that are loaded during the final dirty txg and haven't already been condensed
will cause metaslab_sync to proceed after the final dirty txg so that the
condense can be performed, which there are assertions to prevent. Because of
the nature of this issue, there are a number of ways we can enter this
state. Rather than try to prevent each of them one by one, potentially missing
some edge cases, we instead cut it off at the point of intersection; by
preventing metaslab_sync from proceeding if it would only do so to perform a
condense and we're past the final dirty txg, we preserve the utility of the
existing asserts while preventing this particular issue.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes openzfs#9185
Closes openzfs#9186
Closes openzfs#9231
Closes openzfs#9253
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants