Restart long-running transaction with OC #1027

heifner · 2024-11-13T13:55:29Z

Restart long-running transaction with EOS VM OC when its compile completes.

Pause billing timer when calling code cache get_descriptor_for_code since it can do extra processing not related to specific action running.
Kick off new OC compiles when OC compiles complete instead of waiting for a new call to get_descriptor_for_code
Update transaction_context _deadline to match the transaction timer. This is a bit of cleanup and also makes sure resuming the timer works if we decide to put a limit on apply_block in the future.

Resolves #986

- Add callback for when async compile completes - Use to kill action if it is still running in non-oc - Kick off another queued compile when compile finishes - Add executing_action_id and queued_time to oc compile ipc_protocol - Pause billing timer when looking up and processing oc code - Move wasm_interface apply function to wasm_interface_impl

… timer. Cleanup transaction_timer so that its deadline matches _deadline of transaction_context

… a timer

…ffects have to be undone.

…structor is explicit

…oc_compile_interrupt is true

libraries/chain/include/eosio/chain/exceptions.hpp

libraries/chain/include/eosio/chain/transaction_context.hpp

libraries/chain/include/eosio/chain/wasm_interface.hpp

libraries/chain/include/eosio/chain/wasm_interface_private.hpp

libraries/chain/transaction_context.cpp

spoonincode · 2024-11-20T20:21:36Z

libraries/chain/include/eosio/chain/webassembly/eos-vm-oc/code_cache.hpp


      io_context _ctx;
-      local::datagram_protocol::socket _compile_monitor_write_socket{_ctx};
+      local::datagram_protocol::socket _compile_monitor_write_socket{_ctx}; // protected by _mtx for async


I'm not sure this really needs to be guarded by a mutex. What is the thinking around that? Regardless, it doesn't seem like it would significantly remove many of the lock()s to remove this constraint.

It is called from both the main application thread and the async thread.
For example:
main thread:
https://github.com/AntelopeIO/spring/blob/GH-1039-interrupt-block-validation/libraries/chain/webassembly/runtimes/eos-vm-oc/code_cache.cpp#L224-L224
async thread:
https://github.com/AntelopeIO/spring/blob/GH-1039-interrupt-block-validation/libraries/chain/webassembly/runtimes/eos-vm-oc/code_cache.cpp#L109-L109

I guess in this case it's technically true because native_handle() isn't defined as being thread safe (though clearly in the implementations they are, as it's just an int or HANDLE). Otherwise afaict everything else seems safe. But it doesn't seem like it really changes much to just put it behind the mutex too

spoonincode · 2024-11-20T20:25:01Z

libraries/chain/webassembly/runtimes/eos-vm-oc/code_cache.cpp

+
+      _compile_complete_func(_ctx, msg.executing_action_id, msg.queued_time);
+
+      process_queued_compiles();


I understand this was changed so that the compile queue can be worked to zero while an action is running. I haven't been able to think of any negative side effects to this during nominal operation. But for what the PR is tying to protect against, it does have a weakness where as cache memory fills up there will be no attempt to free space until the next action. But, that doesn't seem fatal since even a toofull message will interrupt execution which will bring the loop around a second time to evict some of the cache and before trying to recompile the current code a second time.

spoonincode · 2024-11-20T20:26:31Z

libraries/chain/include/eosio/chain/wasm_interface_private.hpp

+#ifdef EOSIO_EOS_VM_OC_RUNTIME_ENABLED
+      // called from async thread
+      void async_compile_complete(boost::asio::io_context& ctx, uint64_t exec_action_id, fc::time_point queued_time) {
+         if (exec_action_id == executing_action_id) { // is action still executing?


This doesn't seem sufficient to guard against what the PR is aiming to if I'm understanding it right. It only guards against the first action that uses a code hash. For example, the first action could simply be a quick noop action, and then the second be a long action. This long action wouldn't be interrupted when the compilation completes because the completion would have the "action_id" of the first quick action.

It really seems like need to compare against running code hash. I know that's undesirable since can't have a lock free 32 byte atomic, but so many other locks have been sprayed all around the code maybe it's not a big deal for another.

Alternatively, to maintain a lock free atomic, could consider checking only on the first (or last) 16 bytes of the code hash.

Nice catch! I'll give it some thought.

…identifier.

…th action id not matching currently running contract code

ericpassmore · 2024-11-22T23:30:18Z

Note:start
category: System Stability
component: Internal
summary: Always interrupt and restart long-running transactions once OC compile completes.
Note:end

spoonincode · 2024-12-02T18:53:08Z

libraries/chain/include/eosio/chain/wasm_interface_private.hpp

@@ -305,8 +306,9 @@ struct eosvmoc_tier {
      platform_timer& main_thread_timer;
      const wasm_interface::vm_type wasm_runtime_time;
      const wasm_interface::vm_oc_enable eosvmoc_tierup;
-      std::atomic<uint64_t> executing_action_id{1}; // monotonic increasing for each action apply
+      large_atomic<digest_type> executing_code_hash{};


what about sha256 prevents it being used as a atomic<T>?

undefined reference to __atomic_store

That just means we're not linking to libatomic (which is kind of surprising given the maturity of spring that it was never needed previously), I thought maybe something else about sha256 was non trivial and preventing usage of it with std::atomic.

Not sure if there is anything else that prevents it. Should we start linking with libatomic?

spoonincode · 2024-12-02T19:27:19Z

libraries/chain/include/eosio/chain/wasm_interface_private.hpp

+            auto elapsed = fc::time_point::now() - queued_time;
+            ilog("EOS VM OC tier up for ${id} compile complete ${t}ms", ("id", code_id)("t", elapsed.count()/1000));
+            auto expire_in = std::max(fc::microseconds(0), fc::milliseconds(500) - elapsed);
+            std::shared_ptr<boost::asio::steady_timer> timer = std::make_shared<boost::asio::steady_timer>(ctx);


remind me why we need to do this timer stuff instead of just immediately interrupting?

No reason. I would actually be in favor of immediately interrupting. The issue called for only interrupting if running longer than 500ms.

I guess I see my confusion -- "running longer than 500ms" : is that really what is going on here? queued_time is when the compilation was started, not how long the transaction or action has been running.

True. Seems to me we could just always interrupt and make this a bit simpler.

I think maybe it's a good thing we don't always interrupt, otherwise maybe in a worst case pathological situation the block could take 2x the time to apply (it would seem very very hard to coax this to occur but maybe I haven't really thought it through enough)

spoonincode · 2024-12-02T19:43:29Z

libraries/chain/include/eosio/chain/wasm_interface_private.hpp

+         } catch (const interrupt_exception& e) {
+            if (allow_oc_interrupt && eos_vm_oc_compile_interrupt) {
+               ++eos_vm_oc_compile_interrupt_count;
+               wlog("EOS VM OC compile complete interrupt of ${r} <= ${a}::${act} code ${h}, interrupt #${c}",


Doesn't seem like a wlog? Nothing actionable by the user.

spoonincode · 2024-12-02T20:01:20Z

libraries/chain/include/eosio/chain/wasm_interface_private.hpp

+         executing_code_hash.store(code_hash);
+         try {
+            get_instantiated_module(code_hash, vm_type, vm_version, context.trx_context)->apply(context);
+         } catch (const interrupt_exception& e) {


Is it always true that manually triggering the timer will result in an interrupt_exception under all conditions? It feels like this can't be true with the current implementation of checktime()?

Maybe I'm not understanding what you are getting at.
The main_thread_timer.expire_now() will result in a interrupt_exception if applying a block and the timer has not already been triggered. It should not be possible for the timer to have already fired when applying a block.
The code does need to be updated so that executing_code_hash is only set if allow_oc_interrupt is true; looks like I missed that when moving to using code hash.

I think maybe the last sentence resolves the path I was thinking through.

…spring into GH-986-retry-with-oc

linh2931 · 2024-12-03T23:32:24Z

libraries/chain/include/eosio/chain/exceptions.hpp

@@ -390,6 +390,8 @@ namespace eosio { namespace chain {
                                    3080010, "Read-only transaction eos-vm-oc compile permanent failure" )
      FC_DECLARE_DERIVED_EXCEPTION( interrupt_exception, resource_exhausted_exception,
                                    3080011, "Transaction interrupted by signal" )
+      FC_DECLARE_DERIVED_EXCEPTION( interrupt_oc_exception, resource_exhausted_exception,


Not a big deal, but interrupt_exception and interrupt_oc_exception are too similar. Maybe interrupt_by_signal and interrupt_by_oc_compile_exception?

No need to change.

heifner added 19 commits November 11, 2024 09:56

GH-985 Move push_trx to test_utils

29ac711

GH-985 Add ability to retroactively pause timer when resuming billing…

01c0ebd

… timer. Cleanup transaction_timer so that its deadline matches _deadline of transaction_context

GH-985 Add wasm interrupt test by oc compile

aabcfb4

GH-985 3 seconds not long enough on ci/cd

d19d718

GH-985 Remove accidental add of ${i}

b73fbc5

GH-985 Pass io_context to compile callback so it can use it to create…

49198a0

… a timer

GH-985 Add interrupt_oc_exception

2875ed9

GH-985 Restart transaction instead of action since transaction side e…

2b56de1

…ffects have to be undone.

GH-985 compiler needs explicit construction of microseconds since con…

39af386

…structor is explicit

GH-985 Update test now that id is only updated once per action call

6a126e7

GH-985 Increase to 10s for ci/cd

78761e6

GH-985 Try increase to 30s for ci/cd

c05bc48

GH-985 Remove unneeded chain::wasm_interface::test_disable_tierup.

76e2cc7

GH-985 Reset transaction_context on oc interrupt

d77773e

GH-985 Try a full minute

16c13e4

GH-985 Correctly monitor outstanding compiles

d8017a7

GH-985 Fix FC_ASSERT on exit when socket closed

5b2c116

GH-986 Simplify if as attempt_tierup should always be true if eos_vm_…

6ed7b18

…oc_compile_interrupt is true

heifner requested review from linh2931 and spoonincode November 13, 2024 13:56

heifner added the OCI Work exclusive to OCI team label Nov 13, 2024

linh2931 reviewed Nov 14, 2024

View reviewed changes

GH-986 cleanup

19f4d6c

linh2931 approved these changes Nov 15, 2024

View reviewed changes

spoonincode self-assigned this Nov 18, 2024

Merge branch 'main' into GH-986-retry-with-oc

8a772db

spoonincode reviewed Nov 20, 2024

View reviewed changes

spoonincode requested changes Nov 20, 2024

View reviewed changes

GH-986 Remove unneeded vm_version and just used code hash for unique …

be4ee80

…identifier.

heifner added 2 commits November 21, 2024 16:08

GH-986 Use code hash instead of executing action id to avoid issue wi…

966273e

…th action id not matching currently running contract code

Merge branch 'main' into GH-986-retry-with-oc

84f9886

Merge branch 'main' into GH-986-retry-with-oc

c122efa

spoonincode removed their assignment Dec 2, 2024

spoonincode reviewed Dec 2, 2024

View reviewed changes

heifner added 3 commits December 2, 2024 14:44

GH-986 Use dlog instead of wlog for interrupt message

cdbc761

GH-986 Fix logic for interrupt only when applying a block

c23f435

Merge branch 'GH-986-retry-with-oc' of https://github.com/AntelopeIO/…

651cf17

…spring into GH-986-retry-with-oc

spoonincode approved these changes Dec 3, 2024

View reviewed changes

linh2931 approved these changes Dec 3, 2024

View reviewed changes

Merge branch 'main' into GH-986-retry-with-oc

c482721

heifner merged commit 5be1d8b into main Dec 4, 2024
36 checks passed

heifner deleted the GH-986-retry-with-oc branch December 4, 2024 22:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restart long-running transaction with OC #1027

Restart long-running transaction with OC #1027

heifner commented Nov 13, 2024

spoonincode Nov 20, 2024

heifner Nov 20, 2024

spoonincode Dec 2, 2024

spoonincode Nov 20, 2024

spoonincode Nov 20, 2024

heifner Nov 20, 2024

heifner Nov 21, 2024

ericpassmore commented Nov 22, 2024

spoonincode Dec 2, 2024

heifner Dec 2, 2024 •

edited

Loading

spoonincode Dec 2, 2024

heifner Dec 2, 2024

spoonincode Dec 2, 2024

heifner Dec 2, 2024

spoonincode Dec 2, 2024

heifner Dec 2, 2024

spoonincode Dec 3, 2024

spoonincode Dec 2, 2024

spoonincode Dec 2, 2024

heifner Dec 2, 2024

spoonincode Dec 2, 2024

linh2931 Dec 3, 2024


		_compile_complete_func(_ctx, msg.executing_action_id, msg.queued_time);

		process_queued_compiles();

Restart long-running transaction with OC #1027

Restart long-running transaction with OC #1027

Conversation

heifner commented Nov 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericpassmore commented Nov 22, 2024

Choose a reason for hiding this comment

heifner Dec 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heifner Dec 2, 2024 •

edited

Loading