connection: fix implementation #7760

perfect-daemon · 2021-06-22T18:44:10Z

No description provided.

vtnerd · 2021-06-22T19:12:31Z

Please provide more information, is this for performance increase, code clarity, or? This is going to be a big review for everyone else.

perfect-daemon · 2021-06-22T19:23:31Z

In the worst case It's just an approximation of a fix for all added tests and some other known problems.

vtnerd · 2021-06-22T20:39:30Z

Can you at least tell me (even privately) what bugs this is fixing? Otherwise this is possibly a re-write for clarity sake, bordering on a re-write to meet your favored style of callbacks.

Granted, I prefer asio coroutines or this style over the current style, but I'm not sure its worth a giant re-write. The review and testing for this is going to take a while because it impacts the p2p code too.

perfect-daemon · 2021-06-22T20:54:47Z

"Can you at least tell me (even privately) what bugs this is fixing?" at least those 3 that are triggered by added tests

perfect-daemon · 2021-06-22T20:55:30Z

Did you try to run added tests from the first commit ?

vtnerd · 2021-06-22T21:13:38Z

Did you try to run added tests from the first commit ?

Yes, I can read the tests, but was a re-write the best way to fix these issues? Can the code changes be done in piecemeal so that more reviewers can reasonably comment on this code over time? You've provided little auxiliary justification for these changes. The PR is basically asserting that your rewrite of the TCP server code, all at once, is worthwhile.

I'm going to assume by your sparse comments that you insist that an at once re-write is necessary - which is likely to delay merging due to less reviewers. My TCP shutdown code is still waiting to be merged, and its much smaller scope.

perfect-daemon · 2021-06-22T21:24:17Z

It fixes everything that is possible without changing public interface of connection.
There are many other problems that will require even bigger changes.
It doesn't even fix all problems within connection since the rest of the code relies no some internal ugliness of connection.

perfect-daemon · 2021-06-22T21:26:19Z

I'm touching more at once since there are interconnected problems that require such big scope.

perfect-daemon · 2021-06-22T22:55:52Z

I'm going to assume by your sparse comments that you insist that an at once re-write is necessary - which is likely to delay merging due to less reviewers. My TCP shutdown code is still waiting to be merged, and its much smaller scope.

That ssl shutdown patch doesn't fix segfault within ssl handshake and other problems that are located within connection.

vtnerd

That ssl shutdown patch doesn't fix segfault within ssl handshake and other problems that are located within connection.

It was supposed to address one issue, not claim to fix an unspecified amount of things in a rewrite. FWIW, I think this is going to take multiple reviews to hammer out because its likely just swapping one set of issues for another.

As an example, every outgoing write now appears to have a randomized wait_for call which should have a measurable performance hit. And the following wait_for call is still a DoS as the thread is blocked (but not busy spin looping at least). So its worse performance for ... ?

This was a light skim through, not really a review.

vtnerd · 2021-06-23T02:05:33Z

contrib/epee/include/net/abstract_tcp_server2.h

+
+        bool wait_read;
+        bool handle_read;
+        bool cancel_read;


All of these cancel_ booleans are unnecessary; the standard technique is checking ec == boost::asio::error::operation_aborted (nearly every example uses this technique). This method is preferable because it catches all cancellations, even ones added in the future (which are currently only caught if the flag is set to true).

Repeating this again since it got lost in the back-and-forth, but the cancel_ booleans are not necessary. Requesting their removal is not "an optimization request" - if you are going to re-write the entirety of the connection code then don't leave new things for others to remove and deal with.

Each boolean ensures that no function is called more times than is necessary (each action in the socket executes 1 at a time as needed). The (seemingly bloated) boolean defense provides explicit synchronization, offering complete control over the execution flow without relying on boost internals, which as evidenced by the seemingly seg fault-prone handshake, can lead to screw-ups. Error codes are still checked where needed AFAICT (for example, here).

Perhaps the booleans go against the standard way of using boost ASIO, and perhaps this approach could be considered less future proof as is, but I think given the issues this PR is solving as constructed (and the PR author hopes to solve and optimize further by building from this PR), then I'd say it's a clear improvement over the current that is worth shipping as is. He has also provided defense for the decision to include them and the bar is high enough for me to warrant inclusion given the scope of the PR's aims.

vtnerd · 2021-06-23T02:12:35Z

contrib/epee/include/net/abstract_tcp_server2.inl

+    if (state.data.write.wait_consume ||
+      state.data.write.queue.size() > ABSTRACT_SERVER_SEND_QUE_MAX_COUNT
+    )
+      state.condition.wait_for(state.lock, random_delay(),


This wait_for is going to need some justification, in the comments, and with some performance testing. Or just any explanation at all. Because this is slowing down all outgoing writes by an arbitrary amount of time.

Wait this was inside an if statement? Ugh, sorry, can we fix the style of this at least?

Ok I see why this was duplicated, but there has to be a simpler more concise technique. And seriously {} this!

vtnerd · 2021-06-23T02:13:43Z

contrib/epee/include/net/abstract_tcp_server2.inl

+    state.data.write.wait_consume = true;
+    while (!message.empty()) {
+      if (state.data.write.queue.size() > ABSTRACT_SERVER_SEND_QUE_MAX_COUNT) {
+        state.condition.wait_for(state.lock, random_delay(),


This is still a DoS case - the thread is still blocked.This should be changed to a drop (this is one of the reasons why I am pushing for a piecemeal approach ...).

Both connection and sender should be asynchronous to remove this problem.
It's partially solved now, as much as possible without changing the rest of the code.

This queue will be removed from connection later, once everything else is asynchronous.

You can't remove a queue, or at least you shouldn't. Because posting to a strand is still a queue.

AFAIK there are 3 choices: (1) close connection (drop data), (2) block a thread to wait, or (3) unbounded queue. I think this needs to transition to (1) if an entire re-write is being done.

I can, will and it's needed. But it isn't easy.

Asynchronous generator? You mean asynchronously generate bytes piecemeal? That's just a more complicated outgoing buffer. This has to be a drop or block.

What are the sources of data for outgoing queue ?

Synchronous forward of notifications with txs/blocks which can be made asynchronous with fetching from local blockchain storage.

Requests/responses required by explicit synchronization which can be generated asynchronously as soon as outgoing/incoming fixed size buffer is free.

In general, daemon sends and receives only verified data that is stored/loaded to/from blockchain db.
This is the only available unbounded buffer to store data.

I already know exactly what you are proposing, and I am pointing out that some p2p messages require a response, and a queue of asynchronous events is still a queue. The commentary by the ZeroMQ developers on Pub/Sub seems appropriate here - if the read side is behind eventually you have to drop data and/or the connection.

All of these changes don't really change the behavior of the existing problem. Multiple threads can get stuck busy waiting for a queue to clear. The previous method had the threads waiting indefinitely at a lock, whereas here they wait indefinitely in conditional variables.

This is still a DoS case - the thread is still blocked.This should be changed to a drop (this is one of the reasons why I am pushing for a piecemeal approach ...).

I didn't understand that you were talking about any action after failed conditional sleep.
I've added missing terminate(); return false.

All of these changes don't really change the behavior of the existing problem. Multiple threads can get stuck busy waiting for a queue to clear. The previous method had the threads waiting indefinitely at a lock, whereas here they wait indefinitely in conditional variables.

AFAICT this approach is a marked improvement over the current code:

Reads/writes are no longer waiting on handle_recv or callbacks to complete (which execute in the newly introduced 2nd strand_), therefore they can process quicker and clear from the strand quicker.

It uses a conditional wait that can unblock as soon as the queue is less than the max.

Without a demonstrable DoS on this new code as is which is evidently harder to DoS, I think the code is merge-able as is, and worth continuing to improve, which the PR author intends to do building from the structure of this PR's code

Additionally there is a risk in dropping the connection here, which is that it can feasibly harm honest usage of the daemon. It's going to take more work to dig deeper to make this entire section air tight to hit the optimal sweet spot imo. For now, this code is an improvement moving in a better direction.

vtnerd · 2021-06-23T02:18:49Z

contrib/epee/include/net/abstract_tcp_server2.inl

+          state.timers.throttle.out.wait_expire = false;
+          if (state.timers.throttle.out.cancel_expire) {
+            state.timers.throttle.out.cancel_expire = false;
+            if (state.status == status_t::RUNNING)


vtnerd · 2021-06-23T02:19:01Z

contrib/epee/include/net/abstract_tcp_server2.inl

+      if (state.socket.cancel_write) {
+        state.socket.cancel_write = false;
+        state.data.write.queue.clear();
+        if (state.status == status_t::RUNNING)


vtnerd · 2021-06-23T02:19:18Z

contrib/epee/include/net/abstract_tcp_server2.inl

+      state.socket.wait_shutdown = false;
+      if (state.socket.cancel_shutdown) {
+        state.socket.cancel_shutdown = false;
+        if (state.status == status_t::RUNNING)


vtnerd · 2021-06-23T02:33:33Z

contrib/epee/include/net/abstract_tcp_server2.inl

+  template<typename T>
+  void connection<T>::start_read()
+  {
+    if (state.timers.throttle.in.wait_expire || state.socket.wait_read ||


wait_read is only necessary because the callbacks invoke start_read ... again. This is unnecessary if the callback were a local struct that contains only the inner looping read calls. I think this is preferable because it separates the start and looping code, whereas this is mixing the two. It also allows for less state machines.

It will be needed later when everything is asynchronous since reader may not consume every received chunk immediately and it's possible that there is no active callbacks at all.

The flag is only needed because start_read is both starting and continuing the reads. It is possible to separate the two, removing the need for the flag. After every async read (and subsequent processing), the continue function is called.

I didn't try to optimize number of states / transitions.
It can be done later unless it's needed for significant performance increase or it's related to correctness of the code.

This is just my personal opinion obviously, but -

if you want to do a re-write of the entire connection code, then it should be "optimized" with the fewest number of states / transitions possible. Otherwise, what is the purpose of the re-write? I'm being critical because if this were done piecemeal, the code could've looked similar in the end to this PR, but the unnecessary code/states could be removed along the way easier during review. Removing the unnecessary bits makes it easier to maintain and debug.

As an example, the cancel_ state flags have to be removed, unless there is some justification why the standard way of checking for cancellation is inappropriate (none was provided above in the other comment thread). And this wait_read (that started this thread) doesn't appear to have any justification (yet), given how ASIO callbacks are typically done.

... Removing the unnecessary bits makes it easier to maintain and debug.

It depends.

if you want to do a re-write of the entire connection code, then it should be "optimized" with the fewest number of states / transitions possible. Otherwise, what is the purpose of the re-write?

The main goal is to fix known problems and prepare code for design changes, which can't be done in case of greedy early optimizations of every component.

These explicit flags are needed to not depend on internal behaviour of boost asio which doesn't allow to cancel already dispatched callback.

I've thought about this case before, and typically its not worth stressing about. In the read/write callbacks either reset the timer immediately or call cancel immediately. The timer will be in the strand queue with the appropriate error_code.

In the off chance that it queued the timer just after the other event but before the cancel could occur ... the timer was so close to executing its not worth the extra state variable and logic. Let ASIO handle all of the state tracking because its better "defensive coding" - the error_code cancel check handles every future update to how the timer is used.

It depends.

Just chop out these unnecessary states. There's a giant if ladder than get be chopped down for instance.

The main goal is to fix known problems and prepare code for design changes, which can't be done in case of greedy early optimizations of every component.

I dislike your usage of the word optimization in this context because it makes it sound like I am asking for a performance comparison. This is about making the code more concise+legible - particularly the start_read function.

"concise - short and clear, expressing what needs to be said without unnecessary words"
"optimization - the process of making something as good or effective as possible"
Non-programming language is ambiguous enough to allow misinterpretation of anything so If current code is correct then any further changes can be treated as optimization.

vtnerd · 2021-08-22T18:08:14Z

contrib/epee/include/net/abstract_tcp_server2.h

@@ -86,6 +90,181 @@ namespace net_utils
    public i_service_endpoint,
    public connection_basic
  {
+  private:
+    using string_t = std::string;


This "alias everything" style breaks from existing coding conventions, and it would preferable to revert to something closer to the existing style.

+1 on this. Things like std::string or std::mutex shouldn't be renamed at all to avoid confusion. Long epee and boost names can probably be renamed.

vtnerd · 2021-08-22T18:10:08Z

contrib/epee/include/net/abstract_tcp_server2.h

+
+        bool wait_read;
+        bool handle_read;
+        bool cancel_read;


Repeating this again since it got lost in the back-and-forth, but the cancel_ booleans are not necessary. Requesting their removal is not "an optimization request" - if you are going to re-write the entirety of the connection code then don't leave new things for others to remove and deal with.

vtnerd · 2021-08-22T18:13:35Z

contrib/epee/src/net_ssl.cpp

+  const std::string& host,
+  std::chrono::milliseconds timeout) const
+{
+  configure(socket, type, host);


There's no reason to include these SSL changes with the connection re-write. Include these changes in a new PR with an explanation as to why its being changed.

And if the issue being fixed is the racy ec = ..., this can be fixed with less changes. All that was needed is an std::atomic_flag to signal the change (a safe "happens before/after synchronization), these other changes are overkill.

It solves the seg fault in the handshake test case, and it does it in a manner that's really similar to the rest of the PR. I'd argue it's actually significantly more readable/easier to grok than the rest of the PR since it's fairly contained and isolated. It helped me as I got started reviewing the rest of the PR because of this.

I agree still that this can probably be a separate PR with just this commit or perhaps even fewer changes, but I'm not hugely opposed to it being here as is/don't consider it a PR blocker.

And if the issue being fixed is the racy ec = ..., this can be fixed with less changes. All that was needed is an std::atomic_flag to signal the change (a safe "happens before/after synchronization), these other changes are overkill.

The primary fix here ensures that upon successful completion of the handshake, the cancel_timer boolean prevents the deadline timer from executing an erroneous close sequence in the timeout handler, which also demonstrates the implausibility of the test's seg fault.

Perhaps there is a cleaner more standard solution to this with fewer changes as you mention, but it's a pretty clean section of code that seems to do the job as is I would say.

contrib/epee/include/net/abstract_tcp_server2.inl

+  template<typename T>
+  void connection<T>::start_read()
+  {
+    if (state.timers.throttle.in.wait_expire || state.socket.wait_read ||


j-berman

Working through tests 2 and 3 still. Question on test 1...

j-berman · 2022-05-26T03:56:55Z

contrib/epee/include/net/abstract_tcp_server2.inl

+    };
+    strand.post(
+      [this, self, on_shutdown]{
+        connection_basic::socket_.async_shutdown(


I notice that test 1 doesn't get in here; instead deleting a connection via the daemon invokes socket_.next_layer().shutdown inside on_terminating. Would this possibly leave the session vulnerable to a truncation attack?

I'm still building up my knowledge on the ssl protocol in general, but figure you'd have the experience to know definitively if this was an oversight and if this would be the case. Why call shutdown on the next layer in on_terminating and not on the higher level socket like you do here? What makes that case different from this one? Hard to find clear answers on this on the internet (and in the code)

The PR author's reasoning to not invoke socket_.async_shutdown when terminating a connection seems to be two-fold:

the risk of a truncation attack is not practically significant

using async_shutdown on connection close would require a timeout in the event of an unresponsive client, which arguably opens a wider & more significant surface for abuse than a truncation attack

Seems reasonable to me.

SChernykh · 2022-05-30T08:32:33Z

contrib/epee/include/net/abstract_tcp_server2.h

+      timers_t timers;
+      protocol_t protocol;
+      stat_t stat;
+      data_t data;


All these class members should have m_ prefix in their name - take src/cryptonote_basic/connection_context.h as an example of naming convention. This is to avoid confusion with local variables in the class methods.

SChernykh · 2022-05-30T08:33:12Z

contrib/epee/include/net/abstract_tcp_server2.h

+    bool local{};
+    string_t host{};
+    state_t state{};
+    handler_t handler;


Same, m_ prefix here and in other classes (structs are fine).

SChernykh · 2022-05-30T08:37:37Z

contrib/epee/include/net/abstract_tcp_server2.inl

  {
+    static lock_t hosts_mutex;
+    lock_guard_t guard(hosts_mutex);
+    static std::map<string_t, unsigned int> hosts;


hosts_mutex and hosts should be moved to connection class as static members to avoid generating initialization code inside the function. Also, this function doesn't remove an entry from hosts when val is 0, so hosts can grow indefinitely and consume all memory if there is an attacker exploiting this.

Also, if delta is 0 (which is the common use of this function), it should have code like this

if (!delta) { auto it = hosts.find(host); return (it != hosts.end()) ? it->second : 0; }

to avoid using operator[] which modifies the map and is slower.

Agree this seems like a good suggestion. The code seems to already have this issue though, the PR author just moved the code so it showed up in the diff FWIW. I'd say worth a separate PR

SChernykh · 2022-05-30T08:51:37Z

contrib/epee/include/net/abstract_tcp_server2.inl

-    return GET_IO_SERVICE(socket());
+    if (state.timers.general.wait_expire)
+      return;
+    state.timers.general.wait_expire = true;


This sequence is not thread-safe. Should it be protected by state.lock? Maybe making wait_expire atomic and

if (state.timers.general.wait_expire.exchange(true)) return;

will be enough?

SChernykh · 2022-05-30T08:53:34Z

contrib/epee/include/net/abstract_tcp_server2.inl

-    _dbg2("[" << print_connection_context_short(context) << "] fired_callback");
-    m_protocol_handler.handle_qued_callback();
-    CATCH_ENTRY_L0("connection<t_protocol_handler>::call_back_starter()", void());
+    if (not state.timers.general.wait_expire)


This is very rarely used in C++ and it's definitely not used in other parts of the Monero codebase. Change it to if (!state.timers.general.wait_expire) please

SChernykh · 2022-05-30T08:54:09Z

contrib/epee/include/net/abstract_tcp_server2.inl

+      ""
+    );
+    auto self = connection<T>::shared_from_this();
+    if (not state.ssl.forced and not state.ssl.detected) {


Same here, use normal C++ operator syntax ! &&

SChernykh · 2022-05-30T08:56:01Z

contrib/epee/include/net/abstract_tcp_server2.inl

-              boost::asio::placeholders::error,
-              boost::asio::placeholders::bytes_transferred)));
-        //_info("[sock " << socket().native_handle() << "]Async read requested.");
+    state.socket.wait_handshake = true;


This is separated from the if (state.socket.wait_handshake) check and again not thread safe. It should be done as an atomic operation and in one place, either here or in the beginning of the function.

AFAICT the lock should be held by the caller of the function here and in the other places

SChernykh · 2022-05-30T08:57:28Z

contrib/epee/include/net/abstract_tcp_server2.inl

+  {
+    if (state.timers.throttle.out.wait_expire || state.socket.wait_write ||
+      state.data.write.queue.empty() ||
+      (state.ssl.enabled && not state.ssl.handshaked)


Here you even mix && with not. Please use normal C++ syntax.

SChernykh · 2022-05-30T08:58:04Z

contrib/epee/include/net/abstract_tcp_server2.inl

+    if (state.socket.wait_shutdown)
+      return;
+    auto self = connection<T>::shared_from_this();
+    state.socket.wait_shutdown = true;


Again the same non-thread safe pattern.

SChernykh · 2022-05-30T09:02:17Z

contrib/epee/include/net/abstract_tcp_server2.inl

-            shutdown();
-            return false;
+        engine rng(seed);
+        return std::chrono::milliseconds(


This is just plain stupid nonsense. RNG must be initialized once and then used, not seeded from std::random_device every time (~2500 bytes!!!). At this stage, you could've just used std::random_device directly. This implementation will deplete system's entropy very quickly.

Yes, seeding the full state (2500 bytes) from std::random_device is totally ok and even good if you do it once. But this code must be rewritten.

This implementation will deplete system's entropy very quickly.

That's indeed bad for a crypto project, like Monero.

It turns out this code has been there for a long time. Still, since this PR is almost a full rewrite, it's better to fix it like in #8365

I agree. Large rewrites aren't tolerated here from anybody, and for valid reasons.

Plus we can't use double standards.

mj-xmr · 2022-05-30T10:13:26Z

Please fix the merge conflicts first and allow the new code to pass the unchanged previously existing tests.

There are possible exceptions to this rule of not touching the tests, but only if the given failing test was proven to be overfitting the previous solution. But it's the contributor who needs to prove this fact.

mj-xmr · 2022-06-02T07:30:53Z

Can you at least tell me (even privately) what bugs this is fixing? Otherwise this is possibly a re-write for clarity sake, bordering on a re-write to meet your favored style of callbacks.

Granted, I prefer asio coroutines or this style over the current style, but I'm not sure its worth a giant re-write. The review and testing for this is going to take a while because it impacts the p2p code too.

-- vtnerd

"Can you at least tell me (even privately) what bugs this is fixing?" at least those 3 that are triggered by added tests

-- (perfect-daemon )

@perfect-daemon
I'd like to learn more about Monero's vulnerability process.

Regarding the Zero-Day vulnerabilities, that you helped uncover for bad actors especially through your unit tests, and kept them open for the public for already a year: were they addressed through the Monero's Vulnerability Response Process?
Can you explain why the branch is not merged yet after almost a whole year already? Have you done everything you could, in order to satisfy your Reviewers' requests?
If you haven't done enough, do you need help, maybe?

selsta · 2022-06-02T08:09:01Z

@mj-xmr You claimed you are involved with reviewing this patch but so far I only see you trolling. Not helpful.

mj-xmr · 2022-06-02T09:04:10Z

@selsta My review is so shallow, because I'm stuffed with so much other work (*). But if you'd like me to help, I'd gladly create my own version, based on the unit tests written by @perfect-daemon . This would mean that I have to switch my priorities and focus on the discovered vulnerabilities. No problem.

I'm also already trying to support @SChernykh in his attempt to do the same, like here: #8365 . Just that it turned out to be a dead end, only speaks for him, that he can think self-critically, and not that I'm not helping.

(*) Currently I'm preparing a build time testing framework for you, so that you can more objectively verify the compilation speedups, that I do.

selsta · 2022-06-02T09:15:15Z

@mj-xmr No, I'd prefer if you don't pollute this PR (and my email) with troll comments.

mj-xmr · 2022-06-02T09:16:33Z

I'll try anyway.

jeffro256

review round one (mostly net_ssl)

jeffro256 · 2022-05-30T03:15:37Z

tests/unit_tests/epee_boosted_tcp_server.cpp

@@ -344,6 +351,7 @@ TEST(test_epee_connection, test_lifetime)
      }
      ASSERT_TRUE(shared_state->get_connections_count() == 1);
      shared_state->del_out_connections(1);
+      while (shared_state->sock_count);


This infinite loop is really not great. Give it a max timeout (using e.g. usleep) and then fail the test if it takes too long.

jeffro256 · 2022-05-30T03:16:12Z

tests/unit_tests/epee_boosted_tcp_server.cpp

@@ -454,7 +462,11 @@ TEST(test_epee_connection, test_lifetime)
    }
    for (;workers.size(); workers.pop_back())
      workers.back().join();
-
+    while (server.get_connections_count()) {


jeffro256 · 2022-05-31T19:11:19Z

tests/unit_tests/epee_boosted_tcp_server.cpp

@@ -278,6 +280,11 @@ TEST(test_epee_connection, test_lifetime)
    ASSERT_TRUE(shared_state->get_connections_count() == 0);
    constexpr auto DELAY = 30;
    constexpr auto TIMEOUT = 1;
+    while (server.get_connections_count()) {


Another (potentially) infinite loop in test suite

jeffro256 · 2022-05-31T19:24:28Z

contrib/epee/src/net_ssl.cpp

+  auto start_handshake = [&]{
+    using ec_t = boost::system::error_code;
+    using timer_t = boost::asio::steady_timer;
+    using strand_t = boost::asio::io_service::strand;


Again with the type aliasing. I think at a certain point, it starts to really hinder readability. Also what other type would we use for srand_t? It's such a specialized type it wouldn't make sense to use anything else and anything else is not going to have the same interface anyways.

jeffro256 · 2022-05-31T21:55:45Z

contrib/epee/include/net/abstract_tcp_server2.inl

+  void connection<T>::on_interrupted()
+  {
+    assert(state.status == status_t::INTERRUPTED);
+    if (state.timers.general.wait_expire)


switch statement would be better here

jeffro256 · 2022-05-31T21:56:01Z

contrib/epee/include/net/abstract_tcp_server2.inl

+  void connection<T>::on_terminating()
+  {
+    assert(state.status == status_t::TERMINATING);
+    if (state.timers.general.wait_expire)


switch statement would be better here

jeffro256 · 2022-06-01T17:40:33Z

contrib/epee/src/net_ssl.cpp

+        socket.async_handshake(
+          type,
+          boost::asio::buffer(buffer),
+          strand.wrap(on_handshake)


io_service::strand::wrap is no longer available in newer versions of Boost (>= 1.66)

I have run this code with boost versions >= 1.66, any idea?

Here is the relevant line in strand.hpp version 1.58 which provides the method wrap: https://github.com/boostorg/asio/blob/36f5fda32fd518714c00ae6fe23a352ed64a7f66/include/boost/asio/strand.hpp#L223

There is no such method in the commit tagged for Boost 1.66: https://github.com/boostorg/asio/blob/310f245a35a20af59a4cd0030626a4de23c7309c/include/boost/asio/strand.hpp

I don't know @selsta, you might have been running a compatibility library? Or the versioning system is off?

Could it be that you looked in the wrong file?

https://github.com/boostorg/asio/blob/develop/include/boost/asio/io_context_strand.hpp has wrap and io_service_strand.hpp just includes io_context_strand.hpp.

I just know that it compiles fine with the latest boost version without any compatibility library or whatever so there's something else going on.

Figured it out: it's deprecated in favor of bind_executor

We still support 1.59 and bind_executor isn't supported in that version so we would have to add #if.

jeffro256 · 2022-06-01T18:55:49Z

contrib/epee/src/net_ssl.cpp

@@ -472,12 +474,10 @@ bool ssl_options_t::has_fingerprint(boost::asio::ssl::verify_context &ctx) const
  return false;
 }

-bool ssl_options_t::handshake(
+void ssl_options_t::configure(


This is good that this is renamed to configure

jeffro256 · 2022-06-01T19:06:12Z

contrib/epee/src/net_ssl.cpp

+    timer_t deadline(io_context, timeout);
+
+    struct state_t {
+      lock_t lock;


Using a lock in asynchronous code is awkward. Asynchronous code touching the same object should be running from the same strand and shouldn't be executing concurrently, eliminating the need for locks.

Did you check the strand_deadlock test case? I'm quite sure some of these design decisions were intentional to solve this issue, together with setting -DBOOST_ASIO_ENABLE_SEQUENTIAL_STRAND_ALLOCATION

But I also don't want to give false info so maybe ignore if this doesn't make sense.

Yes I'm sure the author put the locks in there for a reason, due to the current design of handshake, but since this is a complete rewrite, it doesn't make sense to have code running asynchronously from a single executor AND have status booleans AND have it be manually, externally locked. Even if it functions correctly, this will likely gum up the io_context of socket waiting for locks and it makes it a ton harder to track what's going on.

In this case it ensures synchronization between the handshake which executes inside the strand, and the deadline timer which handles a timeout sequence outside the strand in the event of a slow handshake. I'm having trouble seeing how it would be better constructed without a lock

In other cases it ensures synchronization across the 2 separate strands

jeffro256 · 2022-06-01T19:51:52Z

contrib/epee/src/net_ssl.cpp

+      state.condition.wait_for(
+        state.lock,
+        std::chrono::milliseconds(30),
+        [&]{


Using a stop_waiting predicate inside a while loop is redundant

jeffro256

Round 2

jeffro256 · 2022-06-04T00:16:56Z

tests/unit_tests/epee_boosted_tcp_server.cpp

+  socket.handshake(boost::asio::ssl::stream_base::client, ec);
+  EXPECT_EQ(ec.value(), 0);
+  std::this_thread::sleep_for(std::chrono::milliseconds(100));
+  while (server.get_config_shared()->get_connections_count() < 1);


potentially infinite while loop

This test expects this condition to be met, yet should instead error if this condition isn't, as this condition not being met is a failure of the code being run. I'd also appreciate a sleep statement being added to reduce CPU utilization while this condition is checked for. Same for the other infinite loop noted below this (and likely any others).

jeffro256 · 2022-06-04T00:18:04Z

tests/unit_tests/epee_boosted_tcp_server.cpp

+  std::this_thread::sleep_for(std::chrono::milliseconds(100));
+  while (server.get_config_shared()->get_connections_count() < 1);
+  server.get_config_shared()->del_in_connections(1);
+  while (server.get_config_shared()->get_connections_count() > 0);


potentially infinite while loop

jeffro256 · 2022-06-04T00:24:26Z

contrib/epee/include/net/abstract_tcp_server2.h

+    using network_throttle_t = epee::net_utils::network_throttle;
+    using network_throttle_manager_t = epee::net_utils::network_throttle_manager;
+
+    unsigned int host_count(int delta = 0);


Even if it is just a single line, please document these private methods

jeffro256 · 2022-06-04T02:16:20Z

contrib/epee/include/net/abstract_tcp_server2.h

@@ -86,6 +90,181 @@ namespace net_utils
    public i_service_endpoint,
    public connection_basic
  {
+  private:
+    using string_t = std::string;
+    using handler_t = t_protocol_handler;


This alias can be confusing because of the use of the term "handler" with regards to asio

jeffro256 · 2022-06-04T02:37:54Z

CMakeLists.txt

@@ -1006,6 +1006,7 @@ if(STATIC)
  set(Boost_USE_STATIC_RUNTIME ON)
 endif()
 find_package(Boost 1.58 QUIET REQUIRED COMPONENTS system filesystem thread date_time chrono regex serialization program_options locale)
+add_definitions(-DBOOST_ASIO_ENABLE_SEQUENTIAL_STRAND_ALLOCATION)


Not that I'm necessarily opposed to this change, but what it the rational for switching to round-robin allocation of strands? Was there any evidence that deadlocks with the current code are related to non-round-robin allocation of strands?

The core of the PR's solution to the strand deadlock test case (test 3) is to use 2 strands (strand and strand_). Might be clearer to see how over in this code where I tried to pull out the solution to just test 3.

If boost happens to randomly allocate the same strand implementation for both strands (boost randomly allocates from a pool of 193 by default), then a deadlock can resurface, since the writes won't be able to clear from the strand that handle_recv is posting to (the writes get stuck in the same strand behind the handle_recv = deadlock).

Allocating strands round robin avoids this because it should guarantee that both strands in each connection will have distinct implementations allocated (since connections themselves are also only created 1 at a time).

If you try running his test 3 without this definition, you'll eventually run into a test failure at some point, which should be the collision case described above.

jeffro256 · 2022-06-04T03:46:06Z

contrib/epee/include/net/abstract_tcp_server2.h

+        bool handshaked;
+      };
+
+      struct socket_t {


socket_t now refers to 2 types: this one and boost::asio::ip::tcp::socket, and is ambiguous

jeffro256 · 2022-06-04T03:46:51Z

contrib/epee/include/net/abstract_tcp_server2.h

+        bool cancel_shutdown;
+      };
+
+      struct timer_t {


timer_t now refers to 2 types: this one and boost::asio::steady_timer, and is ambiguous

jeffro256 · 2022-06-04T03:51:06Z

contrib/epee/include/net/abstract_tcp_server2.h

+      data_t data;
+    };
+
+    using status_t = typename state_t::status_t;


Aliasing nested user-defined types is not a design choice I think is easy on future maintainers/readers. If you want your new type to be scoped to another do that. If you want it be be top-level, do that. But don't nest it then alias it back up to the top IMO

jeffro256 · 2022-06-04T04:16:31Z

contrib/epee/include/net/abstract_tcp_server2.inl

+              terminate();
+            }
+            else if (
+              not epee::net_utils::is_ssl(


like other have suggested, use ! instead of not

j-berman

Generally, I agree the style is non-conforming and it would be nicer if it conformed to existing conventions better. But I don't think that is a good blocker for a PR that solves reliability issues, especially one that (justifiably, imo) rewrites large sections to do so. The author has fair justification for the decisions made in the code (some reasoning explained via DM). As is, this PR is a clear improvement over the current code.

Further, in discussions with the author, and in some of the answers given here, this PR is a stepping stone to even more improvements. To that end, I do think this is in line with the desired piecemeal approach to solving issues in the daemon. This PR does indeed seem to be a smaller piece of a puzzle toward even larger improvements, even if it is fairly large in its own right.

I'll approve it once merge conflicts are resolved satisfactorily.

Separate note...

I strongly suggest anyone who wants to fully understand this code run the old code with the provided tests and observe where the failures occur, then review the 3 commits of this branch to understand the code in this PR that is fixing the issues, then read and step through this PR's code to understand how the code is working from a bigger picture perspective. Happy to answer questions/help make it clear what the issues are and what is being fixed. The PR author was also helpful to me in understanding this PR over DM, mainly after I made it clear I did step through the code myself and demonstrated I understood various problems it's solving.

j-berman · 2022-06-14T11:50:46Z

CMakeLists.txt

@@ -1006,6 +1006,7 @@ if(STATIC)
  set(Boost_USE_STATIC_RUNTIME ON)
 endif()
 find_package(Boost 1.58 QUIET REQUIRED COMPONENTS system filesystem thread date_time chrono regex serialization program_options locale)
+add_definitions(-DBOOST_ASIO_ENABLE_SEQUENTIAL_STRAND_ALLOCATION)


The core of the PR's solution to the strand deadlock test case (test 3) is to use 2 strands (strand and strand_). Might be clearer to see how over in this code where I tried to pull out the solution to just test 3.

If boost happens to randomly allocate the same strand implementation for both strands (boost randomly allocates from a pool of 193 by default), then a deadlock can resurface, since the writes won't be able to clear from the strand that handle_recv is posting to (the writes get stuck in the same strand behind the handle_recv = deadlock).

Allocating strands round robin avoids this because it should guarantee that both strands in each connection will have distinct implementations allocated (since connections themselves are also only created 1 at a time).

If you try running his test 3 without this definition, you'll eventually run into a test failure at some point, which should be the collision case described above.

j-berman · 2022-06-14T11:56:22Z

contrib/epee/include/net/abstract_tcp_server2.h

+
+        bool wait_read;
+        bool handle_read;
+        bool cancel_read;


Each boolean ensures that no function is called more times than is necessary (each action in the socket executes 1 at a time as needed). The (seemingly bloated) boolean defense provides explicit synchronization, offering complete control over the execution flow without relying on boost internals, which as evidenced by the seemingly seg fault-prone handshake, can lead to screw-ups. Error codes are still checked where needed AFAICT (for example, here).

Perhaps the booleans go against the standard way of using boost ASIO, and perhaps this approach could be considered less future proof as is, but I think given the issues this PR is solving as constructed (and the PR author hopes to solve and optimize further by building from this PR), then I'd say it's a clear improvement over the current that is worth shipping as is. He has also provided defense for the decision to include them and the bar is high enough for me to warrant inclusion given the scope of the PR's aims.

j-berman · 2022-06-14T12:00:21Z

contrib/epee/include/net/abstract_tcp_server2.inl

-              boost::asio::placeholders::error,
-              boost::asio::placeholders::bytes_transferred)));
-        //_info("[sock " << socket().native_handle() << "]Async read requested.");
+    state.socket.wait_handshake = true;


AFAICT the lock should be held by the caller of the function here and in the other places

j-berman · 2022-06-14T12:20:29Z

contrib/epee/src/net_ssl.cpp

+  const std::string& host,
+  std::chrono::milliseconds timeout) const
+{
+  configure(socket, type, host);


It solves the seg fault in the handshake test case, and it does it in a manner that's really similar to the rest of the PR. I'd argue it's actually significantly more readable/easier to grok than the rest of the PR since it's fairly contained and isolated. It helped me as I got started reviewing the rest of the PR because of this.

I agree still that this can probably be a separate PR with just this commit or perhaps even fewer changes, but I'm not hugely opposed to it being here as is/don't consider it a PR blocker.

And if the issue being fixed is the racy ec = ..., this can be fixed with less changes. All that was needed is an std::atomic_flag to signal the change (a safe "happens before/after synchronization), these other changes are overkill.

The primary fix here ensures that upon successful completion of the handshake, the cancel_timer boolean prevents the deadline timer from executing an erroneous close sequence in the timeout handler, which also demonstrates the implausibility of the test's seg fault.

Perhaps there is a cleaner more standard solution to this with fewer changes as you mention, but it's a pretty clean section of code that seems to do the job as is I would say.

j-berman · 2022-06-14T12:26:00Z

contrib/epee/src/net_ssl.cpp

+    timer_t deadline(io_context, timeout);
+
+    struct state_t {
+      lock_t lock;


In this case it ensures synchronization between the handshake which executes inside the strand, and the deadline timer which handles a timeout sequence outside the strand in the event of a slow handshake. I'm having trouble seeing how it would be better constructed without a lock

In other cases it ensures synchronization across the 2 separate strands

j-berman · 2022-06-14T12:28:50Z

contrib/epee/include/net/abstract_tcp_server2.inl

  {
+    static lock_t hosts_mutex;
+    lock_guard_t guard(hosts_mutex);
+    static std::map<string_t, unsigned int> hosts;


Agree this seems like a good suggestion. The code seems to already have this issue though, the PR author just moved the code so it showed up in the diff FWIW. I'd say worth a separate PR

Gingeropolous · 2023-07-02T17:54:13Z

because of #8426, can this PR be closed?

selsta · 2023-07-02T18:45:13Z

Merged in #8426

selsta mentioned this pull request Jun 22, 2021

monerod sometimes stops accepting incoming RPC connections #6262

Closed

perfect-daemon force-pushed the pr006_master branch from 0278b54 to 23f6b1b Compare June 22, 2021 20:33

perfect-daemon force-pushed the pr006_master branch from 23f6b1b to e30f356 Compare June 23, 2021 02:35

vtnerd reviewed Jun 23, 2021

View reviewed changes

perfect-daemon force-pushed the pr006_master branch 11 times, most recently from 07c5db3 to 184dedb Compare June 26, 2021 12:06

anon added 2 commits June 28, 2021 19:13

connection: add segfault and deadlocks demo

e239ea1

connection: fix implementation

925c036

perfect-daemon force-pushed the pr006_master branch from 184dedb to 925c036 Compare June 28, 2021 19:13

vtnerd reviewed Aug 22, 2021

View reviewed changes

vtnerd mentioned this pull request Aug 26, 2021

Fix SSL async handshake (racy) #7895

Closed

selsta mentioned this pull request Oct 6, 2021

Daemon RPC Connection Failure #7411

Closed

nahuhh mentioned this pull request Mar 2, 2022

v4.1.1 CryptoGrampy/android-termux-monero-node#10

Merged

r4v3r23 mentioned this pull request Mar 15, 2022

[REQUEST] add node optimization CryptoGrampy/android-termux-monero-node#13

Open

rbrunner7 mentioned this pull request Mar 27, 2022

Monero Dev Meeting: v15 network upgrade - Sat 2 April 2022 @ 17:00 UTC monero-project/meta#680

Closed

selsta mentioned this pull request Apr 8, 2022

http_simple_client disconnect deadlock #5397

Open

j-berman mentioned this pull request Apr 10, 2022

Feature: resubmit a failed tx from the wallet #8251

Open

j-berman reviewed May 26, 2022

View reviewed changes

SChernykh reviewed May 30, 2022

View reviewed changes

selsta mentioned this pull request May 31, 2022

Tons of connections, monerod ignoring the cap #8367

Open

jeffro256 suggested changes Jun 2, 2022

View reviewed changes

jeffro256 suggested changes Jun 4, 2022

View reviewed changes

j-berman reviewed Jun 14, 2022

View reviewed changes

j-berman mentioned this pull request Jul 7, 2022

connection: fix implementation (by @perfect-daemon) #8426

Merged

selsta closed this Jul 2, 2023

connection: fix implementation #7760

connection: fix implementation #7760

Conversation

perfect-daemon commented Jun 22, 2021

vtnerd commented Jun 22, 2021

perfect-daemon commented Jun 22, 2021

vtnerd commented Jun 22, 2021

perfect-daemon commented Jun 22, 2021

perfect-daemon commented Jun 22, 2021

vtnerd commented Jun 22, 2021

perfect-daemon commented Jun 22, 2021

perfect-daemon commented Jun 22, 2021

perfect-daemon commented Jun 22, 2021 • edited Loading

vtnerd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as duplicate.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as duplicate.

j-berman left a comment

Choose a reason for hiding this comment

j-berman May 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SChernykh May 30, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SChernykh May 30, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mj-xmr commented May 30, 2022 • edited Loading

mj-xmr commented Jun 2, 2022 • edited Loading

selsta commented Jun 2, 2022

mj-xmr commented Jun 2, 2022 • edited Loading

selsta commented Jun 2, 2022

perfect-daemon commented Jun 22, 2021 •

edited

Loading

j-berman May 26, 2022 •

edited

Loading

SChernykh May 30, 2022 •

edited

Loading

SChernykh May 30, 2022 •

edited

Loading

mj-xmr commented May 30, 2022 •

edited

Loading

mj-xmr commented Jun 2, 2022 •

edited

Loading

mj-xmr commented Jun 2, 2022 •

edited

Loading