[5.0] always set OC's sandbox socket to blocking mode fixing random OC errors seen in CI & maybe elsewhere #2163
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
It has been relatively rare, but sometimes we've seen an unexplained
failed to compile wasm
error in EOS VM OC unittests. We've also seen at least one report from a user with a similar error. Up until now we've chalked this up to OC's subjective limits being inappropriately enabled during testing, as subjective limits could certainly explain this particular failure case where OC's sandbox fails to produce a response. These subjective limits are now disabled onrelease/5.0
andmain
so we can no longer attribute failures to that.After upgrading the CI image to the latest Linux kernel -- 6.7.1 at this time -- we've seen an explosion in the number of
failed to compile wasm
errors. I can also reproduce this issue locally with 6.7.x as well, as long as tests are run with a high-j
value; i.e. high system load.1The problem is occurring in OC's compile monitor component in its sandbox. When a compile is started an asio local datagram socket is created from a socketpair,
leap/libraries/chain/webassembly/runtimes/eos-vm-oc/compile_monitor.cpp
Lines 93 to 98 in e5d45ce
the compile monitor will then
async_wait()
on it,leap/libraries/chain/webassembly/runtimes/eos-vm-oc/compile_monitor.cpp
Lines 114 to 116 in e5d45ce
and then once the socket is indicated as ready, call
read_message_with_fds()
leap/libraries/chain/webassembly/runtimes/eos-vm-oc/compile_monitor.cpp
Line 125 in e5d45ce
which ultimately calls
recvmsg
,leap/libraries/chain/webassembly/runtimes/eos-vm-oc/ipc_helpers.cpp
Lines 36 to 41 in e5d45ce
read_message_with_fds()
is expectingrecvmsg()
to be a blocking call: no where does OC anywhere set up a socket to be non-blocking. However, asio'sasync_wait()
will always set the underlying file descriptor to be non-blocking. It doesn't seem like that would be a problem though: we only callrecvmsg()
after the kernel (through the asio abstraction) has indicatedEPOLLIN
so the "naive" (as defined shortly below) expectation is that any call torecvmsg()
will succeed with a proper message or indication of a connection failure.But on 6.7.1 sometimes the
async_wait()
indicates readiness but therecvmsg()
will return with anEAGAIN
. The current code considers that a fatal communication error with the compilation process and bails, thus giving us thefailed to compile wasm
error. And this is where some spec lawyering comes in to play. It's not clear to me what exactly is guaranteed via this interface: whenselect
returns indicating a socket is readable, is it a guarantee that a non blocking read must return a message on the next call? afaict no: it's only a guarantee that a blocking read will not block. It seems reasonable to expect a non blocking read would always return the message but I'm not convinced from the specification I've read that's guaranteed.Of course, asio isn't using
select
, it's using the linux specificepoll
... in edge triggered mode too. This further obscures what the rules ought to be, to me.It's also possible we've stumbled on a defect introduced in Linux. Unfortunately without either a small test case or bisecting where this issue started to become rampant, neither of which is a small effort, it's hard to approach the developers with it.
This PR now asks the asio socket to have its underlying file descriptor set to blocking mode after the
async_wait()
and before therecvmsg()
. This effectively removes the possibility ofEAGAIN
creeping up on us here. I also entertainedread_message_with_fds()
looping onEAGAIN
, as it already does forEINTR
. That also does work in testing. It's not immediately clear which way is better.One negative aspect of this change is that a
read_message_with_fds(asiosocket.native_handle())
will subvert the code that sets blocking mode. At the moment the only usages ofread_message_with_fds(int)
pass file descriptors that were never "asio-ified"Footnotes
I don't mean to make 6.7.x appear to be culprit even though I explicitly mention it by value multiple times in this PR. The CI image hasn't been touched in nearly a year (so a handful of kernel versions ago), and I don't run the tests with a high
-j
value locally regularly enough to know if this failure has only recently started occurring with such frequency. But it does seem very clear that within the last year some kernel behavior has shifted. ↩