Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Remove spa_namespace_lock from zpool status #16507

Closed
wants to merge 1 commit into from

Conversation

tonyhutter
Copy link
Contributor

@tonyhutter tonyhutter commented Sep 4, 2024

Motivation and Context

Prevent zpool status from hanging if the zfs module is holding the spa_namespace_lock.

Description

This commit removes spa_namespace_lock from the zpool status codepath. This means that zpool status will not hang if a pool fails while holding the spa_namespace_lock.

Background:

The spa_namespace_lock was originally meant to protect the spa_namespace_avl AVL tree. The spa_namespace_avl tree holds the mappings from pool names to their spa_t. So if you wanted to lookup the spa_t for the "tank" pool, you would do an AVL search for "tank" while holding spa_namespace_lock.

Over time though the spa_namespace_lock was re-purposed to protect other critical codepaths in the spa subsystem. In many cases we don't know what the original authors meant to protect with it, or if they needed it for read-only or read-write protection. It is simply "too big and risky to fix properly".

The workaround is to add a new lightweight version of the spa_namespace_lock called spa_namespace_lite_lock. spa_namespace_lite_lock only protects the AVL tree, and nothing else. It can be used for read-only access to the AVL tree without requiring the spa_namespace_lock. Calls to spa_lookup_lite() and spa_next_lite() only need to acquire a reader lock on spa_namespace_lite_lock; they do not need to also acquire the old spa_namespace_lock. This allows us to still run zpool status even if the zfs module has spa_namespace_lock held. Note that these AVL tree locks only protect the tree, not the actual spa_t contents.

How Has This Been Tested?

I added a new module param spa_namespace_delay_ms to introduce an artificial delay right after acquiring the spa_namespace_lock. I added a one-second delay and could see zpool create, zpool destory, and zpool import take at least one second longer. At the same time, I could see zpool status, zpool get and zpool list run instantaneously. I added a test case for this as well.

Marking this as WIP since I want to do some more manual testing

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@tonyhutter tonyhutter added the Status: Work in Progress Not yet ready for general review label Sep 4, 2024
spa_t *
spa_lookup_lite(const char *name)
{
static spa_t search; /* spa_t is large; don't allocate on stack */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this burn the metaphorical house down on running multiple calls to it at once, or do I not understand how static declarations work in the kernel?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose, less obliquely.

What does a second lock buy you that just making spa_namespace_lock a rwlock doesn't? I get "it's too big to fix all at once", but if you just declare them all writers except for the codepaths you're only taking lite on here, it seems to produce the same outcomes, no?

And if not, I would predict similar confusion on people trying to consume this interface.

Copy link
Contributor Author

@tonyhutter tonyhutter Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this burn the metaphorical house down on running multiple calls to it at once

It's ok in this case since it's protected by spa_namespace_lite_lock.

declare them all writers except for the codepaths you're only taking lite on here, it seems to produce the same outcomes, no?

Unfortunately, if one of the pools gets hosed up while holding spa_namespace_lock as writer, zpool status will hang trying to acquire the reader lock.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that still true on the lite lock, then?

Copy link
Contributor Author

@tonyhutter tonyhutter Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you're making me think about it, we may not actually need a separate spa_namespace_lite_avl tree. We may just need the new lock. The two separate trees make it conceptually easier to understand, but I don't know if it's functionally needed, since the trees must always be exactly the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that still true on the lite lock, then?

Technically yes, but realistically no. The lite lock is only held as writer when adding/removing from the AVL tree (which is a very short, non-blocking operation):

       rw_enter(&spa_namespace_lite_lock, RW_WRITER);                            
       avl_remove(&spa_namespace_avl, spa);             
       rw_exit(&spa_namespace_lite_lock);
...
       rw_enter(&spa_namespace_lite_lock, RW_WRITER);                            
       avl_add(&spa_namespace_avl, spa);  
       rw_exit(&spa_namespace_lite_lock);

All the other times the lite lock is taken as reader.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the two separate trees save you anything, since I think the only time they'd be useful is if you wanted to access the ro tree while the rw tree had a lock held, and that isn't allowed with this dance anyway.

Copy link
Contributor Author

@tonyhutter tonyhutter Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the lite tree and my local testing worked. The lite tree is removed in my latest push.

This commit removes spa_namespace_lock from the zpool status codepath.
This means that zpool status will not hang if a pool fails while holding
the spa_namespace_lock.

Background:

The spa_namespace_lock was originally meant to protect the
spa_namespace_avl AVL tree.  The spa_namespace_avl tree held the
mappings from pool names to spa_t's.  So if you wanted to lookup the
spa_t for the "tank" pool, you would do an AVL search for "tank" while
holding spa_namespace_lock.

Over time though the spa_namespace_lock was re-purposed to protect other
critical codepaths in the spa subsystem as well.  In many cases we don't
know what the original authors meant to protect with it, or if they
needed it for read-only or read-write protection.  It is simply "too big
and risky to fix properly".

The workaround is to add a new lightweight version of the
spa_namespace_lock called spa_namespace_lite_lock.
spa_namespace_lite_lock only protects the AVL tree, and nothing else.
It can be used for read-only access to the AVL tree without requiring
the spa_namespace_lock.  Calls to spa_lookup_lite() and spa_next_lite()
only need to acquire a reader lock on spa_namespace_lite_lock; they do
not need to also acquire the old spa_namespace_lock.  This allows us to
still run zpool status even if the zfs module has spa_namespace_lock
held.  Note that these AVL tree locks only protect the tree, not the
actual spa_t contents.

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
@tonyhutter
Copy link
Contributor Author

Testing is showing that not all users of the spa_namespace_lock are properly locking down their spa_t after spa_lookup()/spa_next(). They should be calling spa_config_enter() with the specific per-spa_t locks they care about. For example, spa_open_common() takes the namespace lock, looks up the spa_t, and then just starts operating on the spa_t willy-nilly.

The code pattern looks like this:

// protect the AVL tree... and in our imagination, also protect the spa_t we get back...
mutex_enter(&spa_namespace_lock)
spa = spa_lookup(poolname)

< possibly call spa_config_enter(spa, ...) to lock certain fields in spa, possibly not... >
< do stuff with spa>
< possibly call spa_config_exit(spa, ...), possibly not... >

mutex_exit(&spa_namespace_lock)

So we're going to need more than simply the addition of the spa_namespace_lite_lock. We're also going to need a new per-spa_t lock that serves as a placeholder for the "I took the spa_namespace_lock so I'm invincible" cases - call it spa->spa_namespace_legacy_lock. So image if we totally remove spa_namespace_lock, and replace the code snippit above with this:

// Protect the AVL tree from modification
rw_enter(&spa_namespace_lite_lock, RW_READER)

// lookup our spa_t
spa = spa_lookup_lite(poolname)

// protect our spa_t
rw_enter(&spa->namespace_legacy_lock, RW_READER/RW_WRITER)

// our spa_t is protected, remove the lock on the AVL tree
rw_exit(&spa_namespace_lite_lock)

< do stuff with spa>

// all done with our spa_t
rw_exit(&spa->namespace_legacy_lock)

I'll try to put a prototype of this together and see how it shakes out.

@tonyhutter
Copy link
Contributor Author

It's going to be some time before I'll be able to get back to this. I'll close it for now so it's not a zombie PR.

@tonyhutter tonyhutter closed this Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Work in Progress Not yet ready for general review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants