callback support for checktime timer expiry #7803

spoonincode · 2019-08-26T15:05:32Z

Change Description

When a transaction's checktimer timer expires, call a user specified function. This can allow potential future wasm runtimes to be notified of expiry in a very low overhead manner (they don't need to poll expiry).

This PR tries to KISS and just adds support for a single callback by storing two void*s in the platform_timer. Since the callback could be called within a signal handling context I didn't want to complicate manners with std::functions or anything that required resizing to support more than one.

Usage could be something like this:

super_fast_wasm_runtime::execute_action(apply_context& context, ...) {
   context.trx_context.transaction_timer.set_expiry_callback([](void* user) {
      //this code could be called from signal handling context!
      super_fast_wasm_runtime* self = (super_fast_wasm_runtime*)user;
      self->stop_wasm_execution();
   }, this);
   context.trx_context.checktime(); //catch any expiration that might have occurred before setting up callback

   //when wasm code is done, disable callback
   auto reset_expiry_cb = fc::make_scoped_exit([&tt=context.trx_context.transaction_timer](){tt.set_expiry_callback(nullptr, nullptr);});

   run_wasm();
}

You will notice what technically amounts to a very slight race condition here. If the timer expires at the very same time that the callback is attempting to be disabled (by calling set_expiry_callback(nullptr, nullptr)) it's conceivable that the callback will still be called even though it has been disabled at that point.

It's not clear what the best way is to solve that problem. The expiry code being run from a signal handling context ties my hands on many typical solutions. Any thoughts on this are welcome.

However, this issue isn't a problem for potential new wasm runtime given the current state of single threaded execution -- slight late execution of the callback even after disabling it doesn't cause a problem (as long it isn't so late that it is called after the next transaction's wasm starts executing, which would be absurd in practice). In fact, the only reason potential new wasm runtime needs to disable the callback at all is just to make platform_timer's assert() happy that no one is trying to pile up more than one callback.

Consensus Changes

Consensus Changes

API Changes

API Changes

Documentation Additions

Documentation Additions

b1bart · 2019-08-27T11:47:19Z

libraries/chain/include/eosio/chain/platform_timer.hpp

+      result in an exception. Setting to nullptr is allowed as a "no callback" hint */
+   void set_expiry_callback(void(*func)(void*), void* user) {
+      EOS_ASSERT(!(func && _expiration_callback), misc_exception, "Setting a platform_timer callback when one already exists");
+      _expiration_callback = func;


I'm worried that there is more than just the one "Race" that you mentioned in the PR here. Given FuncT1 DataT1 FuncT2 DataT2 (where one pair may be nullptr) Depending on how the compiler orders these instructions on both set_expiry_callback and call_expiration_callback you can at least get any of the following cases:

if (FuncT1) FuncT1(DataT1);
if (FuncT1) FuncT2(DataT1)
if (FuncT1) FuncT1(DataT2)
if (FuncT1) FuncT2(DataT2)
if (FuncT2) FuncT2(DataT1)
if (FuncT2) FuncT2(DataT2)

I think I understand in that in the current use case where T2 is practically going to be nullptr that boils down to:
if (FuncT1) FuncT1(DataT1);
if (FuncT1) nullptr(DataT1)
if (FuncT1) FuncT1(nullptr)
if (FuncT1) nullptr(nullptr)
if (nullptr)...
if (nullptr)...

so, 2 out of 6 of those are crashes, 1 out of 6 calls FuncT1 in an unexpected way. 1 out of 6 calls FuncT1 "spuriously" (the noted race in the PR) and the remaining 2 operate as expected.

We can improve this dramatically if we can guarantee that the pair is atomically set and queried. Does the signal handling context allow us to use atomic primitives like CAS?

Or can we double buffer the function+data pair with some sort of data barrier and then "swap" the active buffer with a single index update?

spoonincode · 2019-08-27T15:23:33Z

Good points. We could probably mitigate much of the reordering by simply asking the compiler not to. But, since we're dealing with two pointers here there will still be the possibility of mixing something up.

CAS should be safe here but I'm curious what construct you have in mind. Anything that spins, like a mutex, could result in a deadlock if the signal is delivered to the same thread that is trying to modify the pointers. This is potentially solvable by delivering the timer signal to a dedicated thread. But doing that is an abstraction nightmare -- in POSIX threads inherit the signal mask of the creating thread. This means we'd have to mask off the timer signal early in main() before net threads, chain threads, etc get created.

Another possibility here is to do an atomic read/write of 16 bytes probably via a CAS of 16 bytes. It's not clear to be how easily this is done in gcc/clang. For example, on x86_64 this is the CMPXCHG16B instruction. But on ARM8??

I tried compiling the following

   std::cout << std::atomic<__int128>::is_always_lock_free << std::endl;

This returns 1 and indeed if the -mcx16 option is used CMPXCHG16B is used on x86_64 when loading/storing to an atomic this way. However on ARM8 this is returning 0 for me and I'm not sure if the instruction set supports 128-bit atomics.

b1bart · 2019-08-27T15:36:09Z

yeah, I'm not sure what the CAS construct would be that is safe. I think the safe "baseline" implementation may be double buffering as it should be pretty easy (maybe slow) to guarantee atomic updates of the pair and to put a memory barrier in on platforms that need it (so that the updates to the func and data precede the update to the index).

If we want to get fancy on platforms that have (pointer-bits * 2) bit atomic writes then, I'm okay with it. Having a dirt simple fallback is worthwhile.

… callback pointers

spoonincode · 2019-08-27T19:46:15Z

I took a stab at "double buffering"; which can be simplified because the code never lets you overwrite the callback with a different callback. This code still probably gets an F in a concurrency textbook but I believe it no longer has the defect where only 1 of the 2 variables may be set or weirdness due to reordering.

I had to use __sync constructs because std::atomic<> was always resolving to a function call for me 😠

b1bart · 2019-08-28T11:28:56Z

libraries/chain/include/eosio/chain/platform_timer.hpp

@@ -32,11 +33,17 @@ struct platform_timer {
   fc::fwd<impl,fwd_size> my;

   void call_expiration_callback() {
-      if(_expiration_callback)
-         _expiration_callback(_expiration_callback_data);
+      if(__atomic_load_n(&_callback_enabled, __ATOMIC_CONSUME)) {


I doubt we'd ever see this as problematic given our use case but, as these are fences, not locks there is no guarantee that between line 36 and 37 and 38, we don't see different states of both _expiration_callback and _expiration_callback_data. Consider that if a thread "pauses" after each line and another calls the appropriate permutations of set_expiry_callback we could get the same set of problem cases as before and still have _callback_enabled reset by line 39.

I think something like a 4 stage atomic enumerated int (unset, writing, set, reading) with:

set_expiry_callback spinlock until it gets a definitive "set" and fails or can CAS an "unset" for a "writing" and upates the values with a memory fence atomic write to "set" at the end.

a specialization clear_expiry_callback spinlock until it gets a definitive "unset" and fails or can CAS a "set" for an "unset"

and a call_expiry_callback that doesnt spinlock it opportunistically tries once to CAS a "set" for a "reading" and then pull out the values, atomically write back to "set" with a fence and call the callback. Any other result from the CAS is treated as "unset".

reduces our possible damage to

the race between "releasing" the spinlock in call_expiry_callback and the consistent but perhaps stale callback being called. We could alleviate this by calling the callback holding the spin lock but that may very well deadlock.

the race between two call_expiry_callback invocations that cause one to erroneously fail as it treats a conflicting "reading" value as an "unset" and therefore does nothing.

there is no guarantee that between line 36 and 37 and 38 ..

Yeah it's a variation of an ABA problem; but it would require someone to do set_callback(nullptr), set_callback(0x...) back to back right? Something that is pretty far fetched for our use case (it would effectively be an action ending and starting without anything in between).

I have no qualms about hardening this further; as the complexity increases it starts to smell more like something that should be improved with a 128-bit CAS for at least x86_64 though

something that should be improved with a 128-bit CAS for at least x86_64 though

As long as we have a base implementation that is safe, I support platforms dabbling with faster versions (which should also be safe) where available. In this case we just need 128bit atomic load/store right? It would effectively become last-write-wins but with the concession that we are only trying to guarantee consistency of callback/data pointers thats "safe"

Right, just need an atomic 128bit load/store to meet the same safety criteria it seems. Which surprisingly (to me) on x86_64 only means CMPXCHG16B -- from what I can from docs even wide SSE/AVX load/stores are not guaranteed to be atomic.

anyways, it's worth benchmarking the impact of the base impl before optimizing this immediately.

once again, I've failed at PR review, you were right to need a 128 bit CAS as the result of the comparison allows you to maintain the expectation that you cannot overwrite an existing callback (that throws)

b1bart · 2019-08-28T11:29:45Z

libraries/chain/include/eosio/chain/platform_timer.hpp

@@ -17,11 +17,12 @@ struct platform_timer {

   /* Sets a callback for when timer expires. Be aware this could might fire from a signal handling context and/or
      on any particular thread. Only a single callback can be registered at once; trying to register more will
-      result in an exception. Setting to nullptr is allowed as a "no callback" hint */
+      result in an exception. Setting to nullptr disables any current set callback */
   void set_expiry_callback(void(*func)(void*), void* user) {


to remain consistent with the name of the member variable and the call_expiration_callback method this should probably be renamed to set_expiration_callback

heifner · 2019-08-28T14:05:26Z

libraries/chain/include/eosio/chain/platform_timer.hpp

+   bool _callback_enabled = false;
+   static_assert(__atomic_always_lock_free(sizeof(_callback_enabled), 0), "eosio requires an always lock free bool type");
+   void(*_expiration_callback)(void*);
+   void* _expiration_callback_data;
 };


I don't see a "carries a dependency" relationship here between _callback_enabled and _expiration_callback so __ATOMIC_CONSUME will not work correctly. Did you mean __ATOMIC_ACQUIRE ? With atomic acquire this should work just fine and the second __atomic_load_n is not needed.

This https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html indicates __ATOMIC_CONSUME is implemented as __ATOMIC_ACQUIRE anyway.

https://en.cppreference.com/w/cpp/atomic/memory_order

The specification of release-consume ordering is being revised, and the use of memory_order_consume is temporarily discouraged. (since C++17)

b1bart

I think I sent you down a rabbit hole and overcomplicated things with the 4-state suggestion. I apologize. It seems like a simple 2-state lock would be simpler to reason about and likely safer.

I like that the "read" in call_expiration_callback is optimistic and non-blocking/spinning.

the "write" in set_expiration_callback should probably just spin until it can CAS the state from "unlocked" to "locked", do all of the checks and possibly manipulation, store the state back to "unlocked" and then throw any errors it would have thrown in the middle. We get no benefit from the read/write paradigm I suggested and its muddied the waters significantly.

I plead guilty to the crimes of over-engineering and solutioning in PRs ☹️

b1bart · 2019-08-29T11:23:16Z

libraries/chain/include/eosio/chain/platform_timer.hpp

+      callback_state current_state;
+      while((current_state = _callback_state.load(std::memory_order_acquire)) == READING) {} //wait for call_expiration_callback() to be done with it
+      if(func)
+         EOS_ASSERT(current_state == UNSET, misc_exception, "Setting a platform_timer callback when one already exists");


I think this assert will sometimes fire and sometimes not, such that you cannot make assumptions in the remaining code that _callback_state is eitherSET or func is nullptr as the _callback_state can change between this line and later lines where it is modified.

b1bart · 2019-08-29T11:26:09Z

libraries/chain/include/eosio/chain/platform_timer.hpp

+         return;
+
+      //prepare for loading in new values by moving from UNSET->WRITING
+      while(!atomic_compare_exchange_strong(&_callback_state, &current_state, WRITING))


I'm having trouble working out the control flow in here. There are ideally two expectations here: we are "clearing" the callback and the _callback_state is SET or we are "setting" the callback and _callback_state is UNSET`

After the first iteration of the loop, we change our expected _callback_state to UNSET even if we are going to clear. The code after is safe but wont this spinlock spin until something else clears the callback if it misses its first time through the loop (say if a callback fired at a bad time)?

spoonincode · 2019-08-29T12:58:03Z

simpler to reason about and likely safer

At this point it's probably easier & safer to just use the threaded approach. Since we could safely use a real lock it would allow solving all problems (even prevent late firing of the callback).

spoonincode · 2019-08-29T18:48:15Z

Unfortunately using a dedicated thread for signal handling causes high variation in timer quality. Probably due to a dormant thread needing to be completely woke up and scheduled when the timer expires. No go here.

callback support for the checktime timer

357e30c

b1bart suggested changes Aug 27, 2019

View reviewed changes

double check if a timer callback is active after double buffering the…

c072a01

… callback pointers

b1bart suggested changes Aug 28, 2019

View reviewed changes

heifner reviewed Aug 28, 2019

View reviewed changes

refactor platform timer callback "thread" safety further

f959c5a

b1bart suggested changes Aug 29, 2019

View reviewed changes

further simplification to timer callbacks

c79c769

b1bart approved these changes Aug 30, 2019

View reviewed changes

spoonincode merged commit 53e19b9 into develop Aug 30, 2019

spoonincode deleted the checktime_timer_callback branch August 30, 2019 14:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

callback support for checktime timer expiry #7803

callback support for checktime timer expiry #7803

spoonincode commented Aug 26, 2019

b1bart Aug 27, 2019 •

edited

Loading

spoonincode commented Aug 27, 2019

b1bart commented Aug 27, 2019

spoonincode commented Aug 27, 2019

b1bart Aug 28, 2019

spoonincode Aug 28, 2019

b1bart Aug 28, 2019

spoonincode Aug 28, 2019

b1bart Aug 29, 2019

b1bart Aug 28, 2019

heifner Aug 28, 2019 •

edited

Loading

heifner Aug 28, 2019

b1bart left a comment

b1bart Aug 29, 2019

b1bart Aug 29, 2019

spoonincode commented Aug 29, 2019

spoonincode commented Aug 29, 2019

callback support for checktime timer expiry #7803

callback support for checktime timer expiry #7803

Conversation

spoonincode commented Aug 26, 2019

Change Description

Consensus Changes

API Changes

Documentation Additions

b1bart Aug 27, 2019 • edited Loading

Choose a reason for hiding this comment

spoonincode commented Aug 27, 2019

b1bart commented Aug 27, 2019

spoonincode commented Aug 27, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heifner Aug 28, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

b1bart left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spoonincode commented Aug 29, 2019

spoonincode commented Aug 29, 2019

b1bart Aug 27, 2019 •

edited

Loading

heifner Aug 28, 2019 •

edited

Loading