-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Fix use of timers when managing capability background work loops #5523
Conversation
471d5c0
to
c9b934b
Compare
Codecov Report
@@ Coverage Diff @@
## master #5523 +/- ##
=========================================
+ Coverage 61.78% 61.8% +0.02%
=========================================
Files 344 344
Lines 28685 28692 +7
Branches 3263 3261 -2
=========================================
+ Hits 17722 17733 +11
+ Misses 9802 9800 -2
+ Partials 1161 1159 -2 |
ded34f0
to
c6accbe
Compare
A possible "optimization" to these changes - currently capability background work is scheduled by the capability scheduling its |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest passing timer interval as a parameter of scheduleBackgroundWork
and removing CapDesc
parameter from postWork
.
Also too many offtopic changes - I recommend not to do using namespace std
at all.
Re optimization - I think it sounds ok in general. A bit disappointing is that now the API of libp2p gets less flexible/less general - e.g. capabilities can't have more that one timer, if they would need it. The simple way to do this optimization would be to require them to always provide a function for background work - i.e. no option to not have one... But I guess it's ok to get rid of this flexibility, as it actually is not needed at the moment, and we don't plan any new capabilities soon. |
Removing the std:: prefix added a lot of churn, I've undone those changes (to non-test files). Sorry about that! |
28a7766
to
18b5273
Compare
cc @gumb0 |
18b5273
to
47aa94a
Compare
47aa94a
to
e4eb475
Compare
49290af
to
b15b6bd
Compare
Squashed some commits |
Make static constants constexpr and move to anonymous namespace. Update license messages and add a few comments.
The problem was with how the host garbage-collected expired timers (m_networkTimers) in Host::run - it checked a timer's expiration timer to see if it was expired and deleted "expired" timers, but there's a race condition where a timer's expiration time has been met and the handler is in the work queue but hasn't been executed yet, in which case the timer is deleted and the handler is run with error code operation_aborted. To fix this, I create a dedicated steady timer (I chose steady timer over deadline timer due to its compatibility with std::chrono) for each capability in the host's m_capabilities map when registering a new capability and use that timer to schedule the capability's work loop. I also replaced the host::scheduleExecution function with host::scheduleCapabilityBackgroundWork which is used by the capabilities to schedule their background work loop and added host::postCapabilityWork so capabilities can execute work on the network thread. I also cancel timers on capability shutdown.
Change capability background work intervals from constexpr in unnamed namespace to static constexpr class members since this makes more conceptual sense given that the values are exposed via a function (backgroundWorkInterval()), updated some license messages, removed unnecessary class names when creating CapDesc instances, changed Host run timer from deadline timer to steady timer so we can use std::chrono, and added some comments.
Cancel capability timers on the network thread by setting expiration time to c_steadyClockMin to avoid race condition where cancellation fails because one attempts to cancel timers which have already expired. This also enables the removal of CapabilityFace::onStopping and the capability member m_backgroundWorkEnabled
The host now completely manages capability work loops via the new Host::runCapabilityWorkLoop() function which schedules itself via a capability's timer. This enabled the removal of Host::scheduleCapabilityBackgroundWork and CapabilityFace::onStarting I've also addressed some PR feedback, namely remove redundant "static" for constexpr in unnamed namespace (since they have static storage duration anyway), named static constants with c_ prefix, and made the Host::postCapabilityWork() function more general (remove the unnecessary CapDesc parameter and rename the function to simply Host::postWork())
b15b6bd
to
5174079
Compare
Rebased |
Remove unnecessary << operator overloading for CapDesc and update Host::startCapabilities to schedule each capability (via Host::scheduleCapabilityWorkLoop) rather post the work to the network thread (since the ~1s we save on network start via the post isn't noticeable in practice). Also revert "using namespace std" changes since it turns out that we don't want to import the std:: namespace by default (despite many files doing this) Also update license message in CapabilityHost files and remove unnecessary headers from EthereumCapability / WarpCapability
5174079
to
8b11d40
Compare
This test is failing due to a UAF:
I've investigated and the problem is that there is still a capability handler in the run queue when the Host is destroyed, which results in the Capability(Face) instance staying alive (since the handler captures a Crash stack:
I can think we can address this by as follows:
|
A use-after-free can occur if capability handlers are still in the boost ioservice handler queue when the Host is destroyed, because capability handlers capture a shared ptr to the capability and destroying the capability results in Host::forEachPeer() being called which iterates over sessions which have already been destroyed (since the Host's session map - m_sessions - is defined after the io service in Host.h). I think the root of the issue is that the capability handlers capture a shared_ptr of the capability - this doesn't make sense since the host owns the capability lifetime and manages resources required by the capabilities (e.g. network sockets/sessions), so it doesn't make sense to use a shared_ptr in the handler to keep capabilities alive longer than the host. As such, I've changed the shared_ptr reference to a raw pointer reference. This is safe with no possibility of UAF because we check the capability's timer expiration time (and the timer is a captured shared_ptr so it's guaranteed to be alive) before calling into the capability's background work function and the capability timers are cancelled on host shutdown, so we will avoid calling into the capability when it has been destroyed. Another issue is that the timer cancellations aren't processed on shutdown if there aren't any peers or incoming connections, because the capabilities are cancelled (via lambdas posted to the network thread) after the io service has stopped and Aleth only polls the io service to clear out peers / connections. To address this, I've added a ioservice post call after stopCapabilities() in Host::doneWorking() to ensure that the cancellations are processed.
I didn't quite get why this is needed. What happens if the cancellations aren't processed and sit in the handler queue? If we don't poll io_service at this point anymore, it shouldn't affect anything? |
libp2p/Host.h
Outdated
@@ -307,8 +307,7 @@ class Host: public Worker | |||
void startCapabilities(); | |||
|
|||
/// Schedule's a capability's work loop on the network thread | |||
void scheduleCapabilityWorkLoop( | |||
std::shared_ptr<CapabilityFace> _cap, std::shared_ptr<ba::steady_timer> _timer); | |||
void scheduleCapabilityWorkLoop(CapabilityFace* _cap, std::shared_ptr<ba::steady_timer> _timer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to pass it by reference, than by raw pointer.
@gumb0 : We don’t need to cancel the capability background work loops to address the UAF as long as we capture a raw capability pointer rather than a smart pointer in the background work loop handler. I don’t know if shutting down the host without canceling the work loops would cause other issues though (I don’t know enough about syncing and how the capabilities are involved). What are your thoughts, would this be safe to do? |
But cancelling the work loops won't affect anything unless
This cancelling is now happens inside |
But the IO service can still be polled after we stop the capabilities if we are disconnecting peers or pending handshakes: Lines 220 to 251 in f427584
|
Also capture the passed capability by reference in the capability timer lambda.
Right, in case it is polled - timers will be cancelled anyway by these |
I'll merge it now, please add a changelog item and remove |
@gumb0 : Ah, I see what you’re saying - we only need to cancel the timers if we are going to schedule more capability work, and more capability work is only scheduled if poll is called while disconnecting pending handshakes / peers. If we post the timer cancellations before we disconnect peers / handshakes the polling will cancel the timers. So there’s no need to explicitly call poll after calling stopCapabities. My confusion was because I wasn't sure if we needed to cancel the timers at all. I’ll remove the poll call and update the change log in a new PR. |
Add message for removal of ioservice polling and a message for the changes which fixed the capability timer race condition (#5523)
Reference removal of ioservice polling and a message for the changes which fixed the capability timer race condition (#5523)
Fix #5508
Fix timer race condition which can result in the capability background work loops being prematurely terminated and also do some light code cleanup in related areas and update license messages.
I've fixed the race condition by giving each capability their own dedicated
steady_timer
in the host (stored in the newCapabilityRuntime
struct stored in them_capabilities
map). I've removedHost::m_networkTimers
and the associatedHost::scheduleExecution
since there was no need for general-purpose work scheduling via the host since we only ended up doing it when scheduling the capability background work loops or for capability work that we needed to execute on the network thread - capabilities now call the newHost::scheduleCapabilityBackgroundWork
function to schedule their next work loop iteration and can post work to the network thread viaHost::postCapabilityWork
. Both of these functions require aCapDesc
be supplied and verify that theCapDesc
is in the registered capabilities map.Capability shutdown is now done via timer cancellation, which means that there's no longer a need for
CapabilityFace::onStopping
or them_backgroundWorkEnabled
member in each capability implementation - as such, I've removed them.I've verified syncing still works on Windows and Linux (Ubuntu).