-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restart long-running transaction with OC #1027
Conversation
- Add callback for when async compile completes - Use to kill action if it is still running in non-oc - Kick off another queued compile when compile finishes - Add executing_action_id and queued_time to oc compile ipc_protocol - Pause billing timer when looking up and processing oc code - Move wasm_interface apply function to wasm_interface_impl
… timer. Cleanup transaction_timer so that its deadline matches _deadline of transaction_context
…ffects have to be undone.
…structor is explicit
…oc_compile_interrupt is true
|
||
io_context _ctx; | ||
local::datagram_protocol::socket _compile_monitor_write_socket{_ctx}; | ||
local::datagram_protocol::socket _compile_monitor_write_socket{_ctx}; // protected by _mtx for async |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this really needs to be guarded by a mutex. What is the thinking around that? Regardless, it doesn't seem like it would significantly remove many of the lock()
s to remove this constraint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is called from both the main application thread and the async thread.
For example:
main thread:
https://github.com/AntelopeIO/spring/blob/GH-1039-interrupt-block-validation/libraries/chain/webassembly/runtimes/eos-vm-oc/code_cache.cpp#L224-L224
async thread:
https://github.com/AntelopeIO/spring/blob/GH-1039-interrupt-block-validation/libraries/chain/webassembly/runtimes/eos-vm-oc/code_cache.cpp#L109-L109
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess in this case it's technically true because native_handle()
isn't defined as being thread safe (though clearly in the implementations they are, as it's just an int
or HANDLE
). Otherwise afaict everything else seems safe. But it doesn't seem like it really changes much to just put it behind the mutex too
|
||
_compile_complete_func(_ctx, msg.executing_action_id, msg.queued_time); | ||
|
||
process_queued_compiles(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this was changed so that the compile queue can be worked to zero while an action is running. I haven't been able to think of any negative side effects to this during nominal operation. But for what the PR is tying to protect against, it does have a weakness where as cache memory fills up there will be no attempt to free space until the next action. But, that doesn't seem fatal since even a toofull
message will interrupt execution which will bring the loop around a second time to evict some of the cache and before trying to recompile the current code a second time.
#ifdef EOSIO_EOS_VM_OC_RUNTIME_ENABLED | ||
// called from async thread | ||
void async_compile_complete(boost::asio::io_context& ctx, uint64_t exec_action_id, fc::time_point queued_time) { | ||
if (exec_action_id == executing_action_id) { // is action still executing? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem sufficient to guard against what the PR is aiming to if I'm understanding it right. It only guards against the first action that uses a code hash. For example, the first action could simply be a quick noop action, and then the second be a long action. This long action wouldn't be interrupted when the compilation completes because the completion would have the "action_id" of the first quick action.
It really seems like need to compare against running code hash. I know that's undesirable since can't have a lock free 32 byte atomic, but so many other locks have been sprayed all around the code maybe it's not a big deal for another.
Alternatively, to maintain a lock free atomic, could consider checking only on the first (or last) 16 bytes of the code hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! I'll give it some thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…th action id not matching currently running contract code
Note:start |
@@ -305,8 +306,9 @@ struct eosvmoc_tier { | |||
platform_timer& main_thread_timer; | |||
const wasm_interface::vm_type wasm_runtime_time; | |||
const wasm_interface::vm_oc_enable eosvmoc_tierup; | |||
std::atomic<uint64_t> executing_action_id{1}; // monotonic increasing for each action apply | |||
large_atomic<digest_type> executing_code_hash{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about sha256
prevents it being used as a atomic<T>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
undefined reference to __atomic_store
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That just means we're not linking to libatomic
(which is kind of surprising given the maturity of spring that it was never needed previously), I thought maybe something else about sha256 was non trivial and preventing usage of it with std::atomic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if there is anything else that prevents it. Should we start linking with libatomic
?
auto elapsed = fc::time_point::now() - queued_time; | ||
ilog("EOS VM OC tier up for ${id} compile complete ${t}ms", ("id", code_id)("t", elapsed.count()/1000)); | ||
auto expire_in = std::max(fc::microseconds(0), fc::milliseconds(500) - elapsed); | ||
std::shared_ptr<boost::asio::steady_timer> timer = std::make_shared<boost::asio::steady_timer>(ctx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remind me why we need to do this timer stuff instead of just immediately interrupting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason. I would actually be in favor of immediately interrupting. The issue called for only interrupting if running longer than 500ms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I see my confusion -- "running longer than 500ms" : is that really what is going on here? queued_time
is when the compilation was started, not how long the transaction or action has been running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. Seems to me we could just always interrupt and make this a bit simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maybe it's a good thing we don't always interrupt, otherwise maybe in a worst case pathological situation the block could take 2x the time to apply (it would seem very very hard to coax this to occur but maybe I haven't really thought it through enough)
} catch (const interrupt_exception& e) { | ||
if (allow_oc_interrupt && eos_vm_oc_compile_interrupt) { | ||
++eos_vm_oc_compile_interrupt_count; | ||
wlog("EOS VM OC compile complete interrupt of ${r} <= ${a}::${act} code ${h}, interrupt #${c}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't seem like a wlog
? Nothing actionable by the user.
executing_code_hash.store(code_hash); | ||
try { | ||
get_instantiated_module(code_hash, vm_type, vm_version, context.trx_context)->apply(context); | ||
} catch (const interrupt_exception& e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it always true that manually triggering the timer will result in an interrupt_exception
under all conditions? It feels like this can't be true with the current implementation of checktime()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm not understanding what you are getting at.
The main_thread_timer.expire_now()
will result in a interrupt_exception
if applying a block and the timer has not already been triggered. It should not be possible for the timer to have already fired when applying a block.
The code does need to be updated so that executing_code_hash
is only set if allow_oc_interrupt
is true
; looks like I missed that when moving to using code hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maybe the last sentence resolves the path I was thinking through.
@@ -390,6 +390,8 @@ namespace eosio { namespace chain { | |||
3080010, "Read-only transaction eos-vm-oc compile permanent failure" ) | |||
FC_DECLARE_DERIVED_EXCEPTION( interrupt_exception, resource_exhausted_exception, | |||
3080011, "Transaction interrupted by signal" ) | |||
FC_DECLARE_DERIVED_EXCEPTION( interrupt_oc_exception, resource_exhausted_exception, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big deal, but interrupt_exception
and interrupt_oc_exception
are too similar. Maybe interrupt_by_signal
and interrupt_by_oc_compile_exception
?
No need to change.
Restart long-running transaction with EOS VM OC when its compile completes.
get_descriptor_for_code
since it can do extra processing not related to specific action running.get_descriptor_for_code
transaction_context
_deadline
to match the transaction timer. This is a bit of cleanup and also makes sure resuming the timer works if we decide to put a limit onapply_block
in the future.Resolves #986