Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use-after-free when aborting already-closed exchange #19747

Closed
bzbarsky-apple opened this issue Jun 18, 2022 · 4 comments · Fixed by #19780
Closed

Use-after-free when aborting already-closed exchange #19747

bzbarsky-apple opened this issue Jun 18, 2022 · 4 comments · Fixed by #19780

Comments

@bzbarsky-apple
Copy link
Contributor

bzbarsky-apple commented Jun 18, 2022

Problem

I tried using the code added in #19328 in RemoveFabric, and ran into crashes in the following scenario:

  1. There is an exchange on the server that has been closed but is still waiting for an ack. It's held alive by a single reference from the retransmission table.
  2. A new command comes in that calls into the AbortExchangesForFabricExceptOne API from Add API to invalid sessions/exchanges for UpdateNOC command #19328 (in my case RemoveFabric).

AbortExchangesForFabricExceptOne ends up looping through all live exchanges, and the exchange waiting for an ack is live. It calls Abort() on the exchanges. Abort() does the following, in order:

  1. Notifiies the delegate.
  2. Flushes acks.
  3. Removes the exchange from the retrans table.
  4. Tries to cancel the response timer, which uses members of the exchange.
  5. Calls Release().

In this case, after step 3 our object is dead because the last ref got dropped, and at step 4 we crash. But even if we held an extra ref on the stack to keep it alive through step 4, the unbalanced Release() in step 5 would mean we end up refcounting a dead object at some point.

Proposed Solution

I think we need to do something smarter than calling Abort on all existing exchanges for the fabric, in terms of managing the exchange refcounts. In particular, I would propose we do the following:

  1. For exchanges that are not on the one session we are keeping around, do nothing, because marking those sessions for removal will deal with the exchanges on them correctly already.
  2. For exchanges that are on the session we are keeping around, ideally call SessionReleased on their SessionHolder if we can manage it, to mimic what would happen if the session were actually getting released. If we can't manage that, call OnSessionReleased on the exchange itself, which I think might be good enough: it might retain the reference to the session if it does not die (because the app is holding on to it), but future OnSessionReleased calls on it would be no-ops because it won't be in the retrans table anymore. In particular, OnSessionReleased already handles the various refcounting scenarios correctly.

Another possibility is that instead of having an API on the exchange manager here we have an API on secure session which is basically "notify all holders except this given one that you are going away", plus an API on exchange that:

  1. Nukes all sessions except the exchange's session for the exchange's fabric, using the session manager API from Add API to invalid sessions/exchanges for UpdateNOC command #19328.
  2. Calls the session API to notify all holders except the exchange's holder.
  3. Flags the exchange itself to mark the session for removal when the exchange is destroyed.

@kghost @mrjerryjohns I actually kind of like that last idea... It funnels as much as possible through existing, tested, code paths in terms of the exchange-termination behavior.

@kghost
Copy link
Contributor

kghost commented Jun 18, 2022

Abort and Close need to be reentrant

@bzbarsky-apple
Copy link
Contributor Author

The can't be "reentrant" because they mess with refcounts.

@kghost
Copy link
Contributor

kghost commented Jun 18, 2022

The can't be "reentrant" because they mess with refcounts.

Let me rephrase my word. We need a SafetyAbort function, which can be called on an exchange disregarding what state the exchange is. I think currently ExchangeContext::OnSessionReleased implementation is just what we need for SafetyAbort.

So my solution is:

  1. Rename ExchangeContext::OnSessionReleased to SafetyAbort
  2. In ExchangeManager::AbortExchangesForFabricExceptOne use SafetyAbort instead of Abort

After this change, I think people will always prefer SafetyAbort rather than Abort, and it becomes our reentrant version Abort

@bzbarsky-apple
Copy link
Contributor Author

bzbarsky-apple commented Jun 18, 2022

That seems viable (with a better name for SafetyAbort; I'll think about the naming a bit; maybe this API should just be called Abort and the current Abort renamed to something else, for example). I think exposing such a function on an exchange is generally a good idea.

As for whether we should use it from ExchangeManager::AbortExchangesForFabricExceptOne, but I am still a little worried about the complexity of having different logic flows between the "normal" way sessions go away and the setup we have now with explicit Abort/SafetyAbort calls. In particular, I'd rather not introduce an extra state where we have exchanges still around because someone is holding refs to them but with a pointer to a session that is in a state that should generally not allow exchanges... If we go down the normal session removal codepath via the holders, the holders will get cleared out in the process.

Three last thoughts for the moment:

  1. For the bits I proposed above, if we want to do them, this bit:

    we have an API on secure session which is basically "notify all holders except this given one that you are going away"

    can in fact be just MarkInactive which would be changed to claim the session is going away to all holders except a single one passed in.... Unless we have use cases for marking a session inactive but having various session holders that are not exchanges hold on to it, and possibly allowing multiple exchanges to keep using it? Which I don't think we do.

  2. The more I think about it, the more I like the idea of making the "remove the session destruction" bits on exchange private to the exchange and only reachable via a high-level API that makes sure we are in fact in the right state (in particular that the session has been marked inactive)...

  3. In ~ExchangeContext, when we mark the session for removal we should probably move it out of the holder into a stack ref first, so the call will not call back into the exchange in the middle of its destructor. That's sort of independent of what else we do here.

bzbarsky-apple added a commit to bzbarsky-apple/connectedhomeip that referenced this issue Jun 20, 2022
Our "close all exchanges except this special one for this fabric" API
messes up exchange refcounting, leading to use-after-free.

The fix is to reuse, as much as possible, the normal "session is going
away" flow to notify exchanges, and other session consumers, that the
sessions are in fact going away.

Fixes project-chip#19747
bzbarsky-apple added a commit to bzbarsky-apple/connectedhomeip that referenced this issue Jun 21, 2022
Our "close all exchanges except this special one for this fabric" API
messes up exchange refcounting, leading to use-after-free.

The fix is to reuse, as much as possible, the normal "session is going
away" flow to notify exchanges, and other session consumers, that the
sessions are in fact going away.

Fixes project-chip#19747
bzbarsky-apple added a commit to bzbarsky-apple/connectedhomeip that referenced this issue Jun 21, 2022
Our "close all exchanges except this special one for this fabric" API
messes up exchange refcounting, leading to use-after-free.

The fix is to reuse, as much as possible, the normal "session is going
away" flow to notify exchanges, and other session consumers, that the
sessions are in fact going away.

Fixes project-chip#19747
bzbarsky-apple added a commit to bzbarsky-apple/connectedhomeip that referenced this issue Jun 21, 2022
Our "close all exchanges except this special one for this fabric" API
messes up exchange refcounting, leading to use-after-free.

The fix is to reuse, as much as possible, the normal "session is going
away" flow to notify exchanges, and other session consumers, that the
sessions are in fact going away.

Fixes project-chip#19747
woody-apple pushed a commit that referenced this issue Jun 22, 2022
* Fix crashes when closing all exchanges for fabric.

Our "close all exchanges except this special one for this fabric" API
messes up exchange refcounting, leading to use-after-free.

The fix is to reuse, as much as possible, the normal "session is going
away" flow to notify exchanges, and other session consumers, that the
sessions are in fact going away.

Fixes #19747

* Address review comments.

* Updates to fix fallout from #19502.

We need to allow messages on inactive sessions to reach the exchange manager,
because such sessions need to be able to deliver an MRP ack to an exchange
waiting for one.

We also don't want to crash on an attempt to transition from Inactive to Defunct
state; the transition should just be ignored.  This way if we start trying to
transitionin to Defunct on MRP delivery failures we will not start crashing if
such a failure happens on an Inactive session.

* Address review comment

* Address review comments

* Address Jerry's review comments.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants