Skip to content
This repository has been archived by the owner on Oct 28, 2021. It is now read-only.

Fix use of timers when managing capability background work loops #5523

Merged
merged 9 commits into from
Mar 26, 2019

Conversation

halfalicious
Copy link
Contributor

@halfalicious halfalicious commented Mar 16, 2019

Fix #5508

Fix timer race condition which can result in the capability background work loops being prematurely terminated and also do some light code cleanup in related areas and update license messages.

I've fixed the race condition by giving each capability their own dedicated steady_timer in the host (stored in the new CapabilityRuntime struct stored in the m_capabilities map). I've removed Host::m_networkTimers and the associated Host::scheduleExecution since there was no need for general-purpose work scheduling via the host since we only ended up doing it when scheduling the capability background work loops or for capability work that we needed to execute on the network thread - capabilities now call the new Host::scheduleCapabilityBackgroundWork function to schedule their next work loop iteration and can post work to the network thread via Host::postCapabilityWork. Both of these functions require a CapDesc be supplied and verify that the CapDesc is in the registered capabilities map.

Capability shutdown is now done via timer cancellation, which means that there's no longer a need for CapabilityFace::onStopping or the m_backgroundWorkEnabled member in each capability implementation - as such, I've removed them.

I've verified syncing still works on Windows and Linux (Ubuntu).

@halfalicious halfalicious force-pushed the capability-timers branch 3 times, most recently from 471d5c0 to c9b934b Compare March 16, 2019 20:49
@codecov-io
Copy link

codecov-io commented Mar 16, 2019

Codecov Report

Merging #5523 into master will increase coverage by 0.02%.
The diff coverage is 57.53%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #5523      +/-   ##
=========================================
+ Coverage   61.78%   61.8%   +0.02%     
=========================================
  Files         344     344              
  Lines       28685   28692       +7     
  Branches     3263    3261       -2     
=========================================
+ Hits        17722   17733      +11     
+ Misses       9802    9800       -2     
+ Partials     1161    1159       -2

@halfalicious halfalicious force-pushed the capability-timers branch 5 times, most recently from ded34f0 to c6accbe Compare March 18, 2019 00:10
@halfalicious halfalicious changed the title [WIP] Fix use of timers when managing capability background work loops Fix use of timers when managing capability background work loops Mar 18, 2019
@halfalicious halfalicious requested review from gumb0 and chfast and removed request for gumb0 March 18, 2019 05:40
@halfalicious
Copy link
Contributor Author

A possible "optimization" to these changes - currently capability background work is scheduled by the capability scheduling its doBackgroundWork function via Host::scheduleCapabilityBackgroundWork()...so the capability will call into the host who will schedule the capability function via the timer. I'm thinking this isn't needed and the entire thing can be driven by the host which results in a slightly more straightforward execution flow - the host can directly schedule capability work via the timer (probably by capturing the capability instance in a lambda passed to async_wait and calling doBackgroundWork). This would also enable me to remove the onStarting function from each capability. Thoughts?

Copy link
Member

@gumb0 gumb0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest passing timer interval as a parameter of scheduleBackgroundWork and removing CapDesc parameter from postWork.

Also too many offtopic changes - I recommend not to do using namespace std at all.

libp2p/CapabilityHost.h Outdated Show resolved Hide resolved
libp2p/Capability.h Show resolved Hide resolved
libp2p/Host.cpp Outdated Show resolved Hide resolved
libp2p/CapabilityHost.h Outdated Show resolved Hide resolved
libp2p/Host.h Outdated Show resolved Hide resolved
libethereum/WarpCapability.cpp Outdated Show resolved Hide resolved
libethereum/WarpCapability.h Outdated Show resolved Hide resolved
libethereum/EthereumCapability.h Outdated Show resolved Hide resolved
libethereum/EthereumCapability.cpp Outdated Show resolved Hide resolved
@gumb0
Copy link
Member

gumb0 commented Mar 18, 2019

Re optimization - I think it sounds ok in general.

A bit disappointing is that now the API of libp2p gets less flexible/less general - e.g. capabilities can't have more that one timer, if they would need it. The simple way to do this optimization would be to require them to always provide a function for background work - i.e. no option to not have one...

But I guess it's ok to get rid of this flexibility, as it actually is not needed at the moment, and we don't plan any new capabilities soon.

@halfalicious
Copy link
Contributor Author

halfalicious commented Mar 19, 2019

I suggest passing timer interval as a parameter of scheduleBackgroundWork and removing CapDesc parameter from postWork.

Also too many offtopic changes - I recommend not to do using namespace std at all.

Removing the std:: prefix added a lot of churn, I've undone those changes (to non-test files). Sorry about that!

@halfalicious
Copy link
Contributor Author

cc @gumb0

libp2p/Common.h Outdated Show resolved Hide resolved
libp2p/Host.cpp Outdated Show resolved Hide resolved
libethereum/EthereumCapability.cpp Outdated Show resolved Hide resolved
@halfalicious halfalicious force-pushed the capability-timers branch 2 times, most recently from 49290af to b15b6bd Compare March 21, 2019 04:15
@halfalicious
Copy link
Contributor Author

Squashed some commits

Make static constants constexpr and move to anonymous namespace. Update license messages and add a few comments.
The problem was with how the host garbage-collected expired timers (m_networkTimers) in Host::run - it checked a timer's expiration timer to see if it was expired and deleted "expired" timers, but there's a race condition where a timer's expiration time has been met and the handler is in the work queue but hasn't been executed yet, in which case the timer is deleted and the handler is run with error code operation_aborted.

To fix this, I create a dedicated steady timer (I chose steady timer over deadline timer due to its compatibility with std::chrono) for each capability in the host's m_capabilities map when registering a new capability and use that timer to schedule the capability's work loop. I also replaced the host::scheduleExecution function with host::scheduleCapabilityBackgroundWork which is used by the capabilities to schedule their background work loop and added host::postCapabilityWork so capabilities can execute work on the network thread. I also cancel timers on capability shutdown.
Change capability background work intervals from constexpr in unnamed namespace to static constexpr class members since this makes more conceptual sense given that the values are exposed via a function (backgroundWorkInterval()), updated some license messages, removed unnecessary class names when creating CapDesc instances, changed Host run timer from deadline timer to steady timer so we can use std::chrono, and added some comments.
Cancel capability timers on the network thread by setting expiration time to c_steadyClockMin to avoid race condition where cancellation fails because one attempts to cancel timers which have already expired. This also enables the removal of CapabilityFace::onStopping and the capability member m_backgroundWorkEnabled
The host now completely manages capability work loops via the new Host::runCapabilityWorkLoop() function which schedules itself via a capability's timer. This enabled the removal of Host::scheduleCapabilityBackgroundWork and CapabilityFace::onStarting

I've also addressed some PR feedback, namely remove redundant "static" for constexpr in unnamed namespace (since they have static storage duration anyway), named static constants with c_ prefix, and made the Host::postCapabilityWork() function more general (remove the unnecessary CapDesc parameter and rename the function to simply Host::postWork())
@halfalicious
Copy link
Contributor Author

Rebased

Remove unnecessary << operator overloading for CapDesc and update Host::startCapabilities to schedule each capability (via Host::scheduleCapabilityWorkLoop) rather post the work to the network thread (since the ~1s we save on network start via the post isn't noticeable in practice). Also revert "using namespace std" changes since it turns out that we don't want to import the std:: namespace by default (despite many files doing this)

Also update license message in CapabilityHost files and remove unnecessary headers from EthereumCapability / WarpCapability
@halfalicious
Copy link
Contributor Author

halfalicious commented Mar 23, 2019

This test is failing due to a UAF:

531 - JsonRpcSuite/jsonrpc_isListening (Failed)

I've investigated and the problem is that there is still a capability handler in the run queue when the Host is destroyed, which results in the Capability(Face) instance staying alive (since the handler captures a shared_ptr to the capability) until the io service is destroyed...destroying the Capability releases the associated BlockchainSync (held via shared_ptr) which results in the BlockchainSync dtor executing what looks like sync cleanup code which calls into CapabilityHost::foreachPeer which calls Host::foreachPeer which iterates over sessions but the sessions have already been destroyed (they are declared after the io service in Host.h). Note that CapabilityHost has a reference to the Host, not a shared_ptr, which is why the Host can be destroyed in the first place.

Crash stack:

00000030`975fb5b0 00007fff`e2727e70 : 00000000`00000000 00000030`975ff5c0 00005693`00005693 00007ff8`0a1645c7 : vrfcore!VerifierStopMessageEx+0x827
00000030`975fb900 00007fff`c95c6439 : 00000030`975fbec0 00007ff6`3752f243 00000030`975fc3b0 000001dd`90b12fa0 : vrfcore!VfCoreRedirectedStopMessage+0x90
00000030`975fb990 00007ff8`0a23cf56 : 00000030`975fc3b0 00007fff`c91b27f0 00000030`975fc050 00007fff`c91b27d0 : verifier!VerifierStopMessage+0xb9
00000030`975fba40 00007fff`c91925e5 : 00000030`975fc220 00007ff8`0a16477c 00007ff6`36927e7c 00000030`00000000 : ntdll!RtlApplicationVerifierStop+0x96
00000030`975fbab0 00007fff`c919322e : 00000030`975fc3b0 000001dd`8090aff0 00000000`00000000 00007ff8`0a1645c7 : vfbasics!VerifierStopMessage+0x245
00000030`975fbb10 00007fff`c91928da : 00000030`975fbc30 00000000`00000000 00007ff6`36928a9c 00000030`00000000 : vfbasics!AVrfpCheckFirstChanceException+0x136
00000030`975fbba0 00007ff8`0a1c6b06 : 00007fff`c91928c0 00000030`975fc390 00000030`975fc640 00000030`975fc3a0 : vfbasics!AVrfpVectoredExceptionHandler+0x1a
00000030`975fbbf0 00007ff8`0a164849 : 00000030`975fc3b0 00000030`975fbec0 00000000`00000000 00007ff8`0a1645c7 : ntdll!RtlpCallVectoredHandlers+0x106
00000030`975fbc90 00007ff8`0a2033fe : 00000000`00000000 00000000`00000000 00007ff8`063eb570 00000000`00000000 : ntdll!RtlDispatchException+0x69
00000030`975fbec0 00007ff6`3752f243 : 000001dd`90b12fa0 00007ff6`369df1eb 000001dd`909c2e88 ffffffff`fffffffe : ntdll!KiUserExceptionDispatch+0x2e
00000030`975fc660 00007ff6`3752ef54 : 000001dd`909c2e50 00000030`975fc6f8 ffffffff`fffffffe 00000030`975fc930 : testeth!std::list<std::pair<dev::FixedHash<64> const ,std::weak_ptr<dev::p2p::SessionFace> >,std::allocator<std::pair<dev::FixedHash<64> const ,std::weak_ptr<dev::p2p::SessionFace> > > >::begin+0x43
00000030`975fc6a0 00007ff6`374f5861 : 000001dd`909c2e48 00000030`975fc6f8 00000030`975fc740 00000030`975fc868 : testeth!std::_Hash<std::_Umap_traits<dev::FixedHash<64>,std::weak_ptr<dev::p2p::SessionFace>,std::_Uhash_compare<dev::FixedHash<64>,std::hash<dev::FixedHash<64> >,std::equal_to<dev::FixedHash<64> > >,std::allocator<std::pair<dev::FixedHash<64> const ,std::weak_ptr<dev::p2p::SessionFace> > >,0> >::begin+0x24
00000030`975fc6d0 00007ff6`3754415e : 000001dd`909c2a70 00000030`975fc968 00000030`975fc868 00007ff6`37db3e58 : testeth!dev::p2p::Host::forEachPeer+0x81
00000030`975fc830 00007ff6`3719ae95 : 000001dd`90b1eff0 00000030`975fc968 00000030`975fc928 ffffffff`fffffffe : testeth!dev::p2p::`anonymous namespace'::CapabilityHost::foreachPeer+0x5e
00000030`975fc8c0 00007ff6`3719abbd : 000001dd`fec0bd40 000001dd`fec0bd60 000001dd`fec0bd60 00007ff6`369df1eb : testeth!dev::eth::BlockChainSync::continueSync+0xd5
00000030`975fc9a0 00007ff6`37197bdd : 000001dd`fec0bd40 000001dd`fec0bd60 000001dd`fec0bd60 000001dd`fab70000 : testeth!dev::eth::BlockChainSync::onPeerAborting+0x3d
00000030`975fc9e0 00007ff6`37197a72 : 000001dd`fec0bd40 000001dd`fec0bd60 00000000`00000018 000001dd`fec57df0 : testeth!dev::eth::BlockChainSync::abortSync+0x3d
00000030`975fca20 00007ff6`37161b17 : 000001dd`fec0bd40 00000000`00000000 00000030`975fd7d8 00000030`975ff5c0 : testeth!dev::eth::BlockChainSync::~BlockChainSync+0x42
00000030`975fca60 00007ff6`37162fa8 : 000001dd`fec0bd40 000001dd`00000001 000001dd`fec59fe0 00007ff6`3715eb01 : testeth!dev::eth::BlockChainSync::`scalar deleting destructor'+0x17
00000030`975fca90 00007ff6`369c4592 : 000001dd`feb37fe0 00007ff6`3716142e 000001dd`fec59fe0 00000000`00000001 : testeth!std::_Ref_count<dev::eth::BlockChainSync>::_Destroy+0x38
00000030`975fcae0 00007ff6`37162bc3 : 000001dd`feb37fe0 000001dd`fec59fe0 000001dd`fec59fe0 00000000`00000003 : testeth!std::_Ref_count_base::_Decref+0x32
00000030`975fcb10 00007ff6`3715ee8c : 000001dd`80000ba8 000001dd`fec59fe0 000001dd`80000bd0 00000000`00000000 : testeth!std::_Ptr_base<dev::eth::BlockChainSync>::_Decref+0x23
00000030`975fcb40 00007ff6`3715f0fb : 000001dd`80000ba8 000001dd`fab70000 000001dd`fab70000 00000000`00004a40 : testeth!std::shared_ptr<dev::eth::BlockChainSync>::~shared_ptr<dev::eth::BlockChainSync>+0x1c
00000030`975fcb80 00007ff6`37161b67 : 000001dd`80000ae0 00007fff`c9196516 00000000`00000071 00007ff8`0a17267d : testeth!dev::eth::EthereumCapability::~EthereumCapability+0x7b
00000030`975fcbb0 00007ff6`3713e3a3 : 000001dd`80000ae0 00007fff`00000000 00000000`00000000 00000000`00000000 : testeth!dev::eth::EthereumCapability::`scalar deleting destructor'+0x17
00000030`975fcbe0 00007ff6`369c4592 : 000001dd`80000ad0 00000030`00000000 000001dd`92f3ff80 00007ff6`36e66101 : testeth!std::_Ref_count_obj<dev::eth::EthereumCapability>::_Destroy+0x33
00000030`975fcc20 00007ff6`36e94923 : 000001dd`80000ad0 00007ff6`00000000 000001dd`92f3ff80 00007ff6`3751666f : testeth!std::_Ref_count_base::_Decref+0x32
00000030`975fcc50 00007ff6`36e922ac : 00000030`975fcd88 00007ff6`36e6b69f 00000000`00000000 000001dd`92f3ff80 : testeth!std::_Ptr_base<dev::p2p::CapabilityFace>::_Decref+0x23
00000030`975fcc80 00007ff6`37518aea : 00000030`975fcd88 00000000`00000070 00000030`975fcd70 00000000`00000000 : testeth!std::shared_ptr<dev::p2p::CapabilityFace>::~shared_ptr<dev::p2p::CapabilityFace>+0x1c
00000030`975fccc0 00007ff6`375192e6 : 00000030`975fcd70 00000000`00000070 00000030`975fcd70 00000030`975fd7d8 : testeth!<lambda_a3ed90a066e17ec32ad1f9c898b2ce87>::~<lambda_a3ed90a066e17ec32ad1f9c898b2ce87>+0x1a
00000030`975fccf0 00007ff6`37532664 : 00000030`975fcd70 000001dd`92f3ffc8 000001dd`92f3ffb8 000001dd`83eb7fc0 : testeth!boost::asio::detail::binder1<<lambda_a3ed90a066e17ec32ad1f9c898b2ce87>,boost::system::error_code>::~binder1<<lambda_a3ed90a066e17ec32ad1f9c898b2ce87>,boost::system::error_code>+0x16
00000030`975fcd20 00007ff6`368e74d8 : 00000000`00000000 000001dd`92f3ff80 00000030`975fcde0 00000000`00000000 : testeth!boost::asio::detail::wait_handler<<lambda_a3ed90a066e17ec32ad1f9c898b2ce87> >::do_complete+0xd4
00000030`975fcdc0 00007ff6`36e7c4d0 : 000001dd`92f3ff80 000001dd`fb81cff0 00000030`975ff5c0 ffffffff`fffffffe : testeth!boost::asio::detail::win_iocp_operation::destroy+0x28
00000030`975fce00 00007ff6`36e59368 : 000001dd`fb81cf60 00007ff6`3752bb56 000001dd`fb820fc8 00007ff6`38546201 : testeth!boost::asio::detail::win_iocp_io_service::shutdown_service+0x120
00000030`975fce80 00007ff6`36e5f237 : 000001dd`fb9a9fc0 00007ff6`36e57196 000001dd`909c2c00 000001dd`909c2c00 : testeth!boost::asio::detail::service_registry::~service_registry+0x38
00000030`975fced0 00007ff6`36e58f78 : 000001dd`fb9a9fc0 00007ff6`00000001 000001dd`909c2d48 00007ff6`36e57214 : testeth!boost::asio::detail::service_registry::`scalar deleting destructor'+0x17
00000030`975fcf00 00007ff6`374f271e : 000001dd`909c2be8 00007fff`00000048 ffffffff`fffffffe 00007ff6`36abb670 : testeth!boost::asio::io_service::~io_service+0x38
00000030`975fcf50 00007ff6`373a6cb4 : 000001dd`909c2a70 00007ff8`0630c7eb 000001dd`fab70000 00007ff6`36ab8d01 : testeth!dev::p2p::Host::~Host+0x19e
00000030`975fcf90 00007ff6`373a8167 : 000001dd`909c2a40 00000000`00000000 00000000`00000348 00007ff6`36d890e6 : testeth!dev::WebThreeDirect::~WebThreeDirect+0x44
00000030`975fcfd0 00007ff6`36d88fde : 000001dd`909c2a40 00007ff6`00000001 00000030`975fd038 00007ff6`36d89326 : testeth!dev::WebThreeDirect::`scalar deleting destructor'+0x17
00000030`975fd000 00007ff6`36d88851 : 00000030`975fd7e0 000001dd`909c2a40 00000030`975fd7e8 00000000`00000030 : testeth!std::default_delete<dev::WebThreeDirect>::operator()+0x3e
00000030`975fd050 00007ff6`36f15d9f : 00000030`975fd7e0 00000030`975fd0c0 ffffffff`fffffffe 00000030`975fd0d0 : testeth!std::unique_ptr<dev::WebThreeDirect,std::default_delete<dev::WebThreeDirect> >::~unique_ptr<dev::WebThreeDirect,std::default_delete<dev::WebThreeDirect> >+0x41
00000030`975fd090 00007ff6`36f16303 : 00000030`975fd7e0 000001dd`840b3ff2 00007ff6`37e223a0 00007ff6`37e223ed : testeth!`anonymous namespace'::JsonRpcFixture::~JsonRpcFixture+0xbf
00000030`975fd0c0 00007ff6`36ebbeee : 00000030`975fd7e0 00000030`975fd278 00000000`000000d3 00000030`975fd268 : testeth!JsonRpcSuite::jsonrpc_isListening::~jsonrpc_isListening+0x13
00000030`975fd0f0 00007ff6`369de50a : 000001dd`fab70000 00007ff8`0a2cd564 000001dd`fab70000 00000000`00000000 : testeth!JsonRpcSuite::jsonrpc_isListening_invoker+0x51e
00000030`975fdb30 00007ff6`36bc7042 : 000001dd`84be4fe0 00000030`975fdc10 00000000`0000001d 00007ff8`0a277056 : testeth!boost::detail::function::void_function_invoker0<void (__cdecl*)(void),void>::invoke+0x1a
00000030`975fdb70 00007ff6`36bc7ca6 : 000001dd`84be4fd8 00007ff8`0a25d987 0000001d`00000000 00000030`975ffad0 : testeth!boost::function0<void>::operator()+0x72
00000030`975fdbe0 00007ff6`36c05303 : 00000030`975fe278 00007ff8`0a160000 0000ddc4`001ed000 000001dd`f9370000 : testeth!boost::detail::forward::operator()+0x16
00000030`975fdc10 00007ff6`36bc6fa2 : 00000030`975fe278 00007ff8`0a2cd564 00000030`975fe260 00007ff8`0a25dd5d : testeth!boost::detail::function::function_obj_invoker0<boost::detail::forward,int>::invoke+0x33
00000030`975fdc50 00007ff6`36b96c33 : 00000030`975fe270 00007ff8`06323889 00000000`00000000 00000000`00000000 : testeth!boost::function0<int>::operator()+0x72
00000030`975fdcc0 00007ff6`36b3c91d : 00007ff6`385c9c88 00000030`975fe270 00000030`975ff4c8 00000000`00000000 : testeth!boost::detail::do_invoke<boost::shared_ptr<boost::detail::translator_holder_base>,boost::function<int __cdecl(void)> >+0x53
00000030`975fdd00 00007ff6`36b3c73d : 00007ff6`385c9c78 00000030`975fe270 00000000`00000004 00007ff6`36bc7ad2 : testeth!boost::execution_monitor::catch_signals+0xbd
00000030`975fddb0 00007ff6`36b3c835 : 00007ff6`385c9c78 00000030`975fe270 00007ff6`00000000 00007ff6`36ba05fc : testeth!boost::execution_monitor::execute+0x9d
00000030`975fe240 00007ff6`36b40134 : 00007ff6`385c9c78 000001dd`84be4fd8 00007ff6`385c9a80 00007ff6`36bd42b6 : testeth!boost::execution_monitor::vexecute+0x55
00000030`975fe2b0 00007ff6`36bfe0a8 : 00007ff6`385c9c78 000001dd`84be4fd8 000001dd`00000000 00007ff6`368e8500 : testeth!boost::unit_test::unit_test_monitor_t::execute_and_translate+0x174
00000030`975fe3b0 00007ff6`36bfdbf0 : 00007ff6`385c83f0 00000030`000101aa 00000000`00000000 00000000`00000000 : testeth!boost::unit_test::framework::state::execute_test_tree+0x1368
00000030`975fe860 00007ff6`36bfdbf0 : 00007ff6`385c83f0 00000030`00000038 00000000`00000000 00000000`00000000 : testeth!boost::unit_test::framework::state::execute_test_tree+0xeb0
00000030`975fed10 00007ff6`36b360a5 : 00007ff6`385c83f0 00007ff6`00000001 000001dd`00000000 00000000`00000000 : testeth!boost::unit_test::framework::state::execute_test_tree+0xeb0
00000030`975ff1c0 00007ff6`36b3ac9f : 00000030`00000001 00007ff6`385c5c01 000001dd`837f8f80 00000000`00000000 : testeth!boost::unit_test::framework::run+0xa95
00000030`975ff630 00007ff6`36b4c6b3 : 00007ff6`36b53bc0 00000030`00000006 000001dd`837f8f80 00007fff`c9194ae2 : testeth!boost::unit_test::unit_test_main+0x42f
00000030`975ff8b0 00007ff6`37a27171 : 00007ff6`00000006 000001dd`837f8f80 00000000`00000000 00000000`00000000 : testeth!main+0x4e3
00000030`975ffa60 00007ff8`093e81f4 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : testeth!__scrt_common_main_seh+0x11d
00000030`975ffaa0 00007ff8`0a1ca251 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : KERNEL32!BaseThreadInitThunk+0x14
00000030`975ffad0 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x21

I can think we can address this by as follows:

  • Don't capture shared_ptr to capabilities in the capability handlers. I don't think this makes sense to do anyway since the host manages the capability lifetime and the capabilities have a lot of dependencies on host functionality (e.g. network communication) so it doesn't make sense for capabilities to have different lifetimes than the host. Additionally, it's safe to capture a raw pointer instead because we cancel the capability timers on shutdown and check the timer status in the capability handler (which is captured via shared_ptr) before calling into the capability background work function, so if the capability has been destroyed we'd exit the handler early.
  • Poll the IO service once after we've posted the capability timer cancellations...I think this should be enough to process the timer cancellations and the cancelled capability handlers which would stop their background work loops. We need to do this because we cancel capabilities after we've stopped the io service and we don't do any polling unless there are peers or incoming connections, which means that the cancellations wouldn't be processed and just sit in the handler queue.

A use-after-free can occur if capability handlers are still in the boost ioservice handler queue when the Host is destroyed, because capability handlers capture a shared ptr to the capability and destroying the capability results in Host::forEachPeer() being called which iterates over sessions which have already been destroyed (since the Host's session map  - m_sessions - is defined after the io service in Host.h).

I think the root of the issue is that the capability handlers capture a shared_ptr of the capability - this doesn't make sense since the host owns the capability lifetime and manages resources required by the capabilities (e.g. network sockets/sessions), so it doesn't make sense to use a shared_ptr in the handler to keep capabilities alive longer than the host. As such, I've changed the shared_ptr reference to a raw pointer reference. This is safe with no possibility of UAF because we check the capability's timer expiration time (and the timer is a captured shared_ptr so it's guaranteed to be alive) before calling into the capability's background work function and the capability timers are cancelled on host shutdown, so we will avoid calling into the capability when it has been destroyed.

Another issue is that the timer cancellations aren't processed on shutdown if there aren't any peers or incoming connections, because the capabilities are cancelled (via lambdas posted to the network thread) after the io service has stopped and Aleth only polls the io service to clear out peers / connections. To address this, I've added a ioservice post call after stopCapabilities() in Host::doneWorking() to ensure that the cancellations are processed.
@halfalicious
Copy link
Contributor Author

cc @gumb0 / @chfast

@gumb0
Copy link
Member

gumb0 commented Mar 25, 2019

We need to do this because we cancel capabilities after we've stopped the io service and we don't do any polling unless there are peers or incoming connections, which means that the cancellations wouldn't be processed and just sit in the handler queue.

I didn't quite get why this is needed. What happens if the cancellations aren't processed and sit in the handler queue? If we don't poll io_service at this point anymore, it shouldn't affect anything?

libp2p/Host.h Outdated
@@ -307,8 +307,7 @@ class Host: public Worker
void startCapabilities();

/// Schedule's a capability's work loop on the network thread
void scheduleCapabilityWorkLoop(
std::shared_ptr<CapabilityFace> _cap, std::shared_ptr<ba::steady_timer> _timer);
void scheduleCapabilityWorkLoop(CapabilityFace* _cap, std::shared_ptr<ba::steady_timer> _timer);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to pass it by reference, than by raw pointer.

@halfalicious
Copy link
Contributor Author

halfalicious commented Mar 25, 2019

We need to do this because we cancel capabilities after we've stopped the io service and we don't do any polling unless there are peers or incoming connections, which means that the cancellations wouldn't be processed and just sit in the handler queue.

I didn't quite get why this is needed. What happens if the cancellations aren't processed and sit in the handler queue? If we don't poll io_service at this point anymore, it shouldn't affect anything?

@gumb0 : We don’t need to cancel the capability background work loops to address the UAF as long as we capture a raw capability pointer rather than a smart pointer in the background work loop handler. I don’t know if shutting down the host without canceling the work loops would cause other issues though (I don’t know enough about syncing and how the capabilities are involved). What are your thoughts, would this be safe to do?

@gumb0
Copy link
Member

gumb0 commented Mar 25, 2019

But cancelling the work loops won't affect anything unless m_ioService.poll() or m_ioService.run() is called after that (i.e. unless the work loops themselves are called). So it seems that that change won't affect anything.

I don’t know if shutting down the host without canceling the work loops would cause other issues though (I don’t know enough about syncing and how the capabilities are involved).

This cancelling is now happens inside Host, capabilities shouldn't be concerned with this "cancelling" event.
(Previously there was onStopping callback, but now you removed it, because it's not really useful)

@halfalicious
Copy link
Contributor Author

But cancelling the work loops won't affect anything unless m_ioService.poll() or m_ioService.run() is called after that (i.e. unless the work loops themselves are called). So it seems that that change won't affect anything.

But the IO service can still be polled after we stop the capabilities if we are disconnecting peers or pending handshakes:

aleth/libp2p/Host.cpp

Lines 220 to 251 in f427584

// disconnect pending handshake, before peers, as a handshake may create a peer
for (unsigned n = 0;; n = 0)
{
DEV_GUARDED(x_connecting)
for (auto const& i: m_connecting)
if (auto h = i.lock())
{
h->cancel();
n++;
}
if (!n)
break;
m_ioService.poll();
}
// disconnect peers
for (unsigned n = 0;; n = 0)
{
DEV_RECURSIVE_GUARDED(x_sessions)
for (auto i: m_sessions)
if (auto p = i.second.lock())
if (p->isConnected())
{
p->disconnect(ClientQuit);
n++;
}
if (!n)
break;
// poll so that peers send out disconnect packets
m_ioService.poll();
}

Also capture the passed capability by reference in the capability timer lambda.
@gumb0
Copy link
Member

gumb0 commented Mar 26, 2019

But the IO service can still be polled after we stop the capabilities if we are disconnecting peers or pending handshakes

Right, in case it is polled - timers will be cancelled anyway by these poll() calls. In case it is not polled - it doesn't matter.

@gumb0
Copy link
Member

gumb0 commented Mar 26, 2019

I'll merge it now, please add a changelog item and remove m_ioService.poll(); call in a separate PR if you agree with my reasoning.

@gumb0 gumb0 merged commit 3c6c4e5 into master Mar 26, 2019
@gumb0 gumb0 deleted the capability-timers branch March 26, 2019 10:31
@halfalicious
Copy link
Contributor Author

halfalicious commented Mar 26, 2019

I'll merge it now, please add a changelog item and remove m_ioService.poll(); call in a separate PR if you agree with my reasoning.

@gumb0 : Ah, I see what you’re saying - we only need to cancel the timers if we are going to schedule more capability work, and more capability work is only scheduled if poll is called while disconnecting pending handshakes / peers. If we post the timer cancellations before we disconnect peers / handshakes the polling will cancel the timers. So there’s no need to explicitly call poll after calling stopCapabities.

My confusion was because I wasn't sure if we needed to cancel the timers at all.

I’ll remove the poll call and update the change log in a new PR.

halfalicious added a commit that referenced this pull request Mar 27, 2019
Add message for removal of ioservice polling and a message for the changes which fixed the capability timer race condition (#5523)
chfast pushed a commit that referenced this pull request Mar 27, 2019
Reference removal of ioservice polling and a message for the changes which fixed the capability timer race condition (#5523)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ethereum capability work loop can be cancelled prematurely
3 participants