-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
connection: fix implementation #7760
Conversation
Please provide more information, is this for performance increase, code clarity, or? This is going to be a big review for everyone else. |
In the worst case It's just an approximation of a fix for all added tests and some other known problems. |
0278b54
to
23f6b1b
Compare
Can you at least tell me (even privately) what bugs this is fixing? Otherwise this is possibly a re-write for clarity sake, bordering on a re-write to meet your favored style of callbacks. Granted, I prefer asio coroutines or this style over the current style, but I'm not sure its worth a giant re-write. The review and testing for this is going to take a while because it impacts the p2p code too. |
"Can you at least tell me (even privately) what bugs this is fixing?" at least those 3 that are triggered by added tests |
Did you try to run added tests from the first commit ? |
Yes, I can read the tests, but was a re-write the best way to fix these issues? Can the code changes be done in piecemeal so that more reviewers can reasonably comment on this code over time? You've provided little auxiliary justification for these changes. The PR is basically asserting that your rewrite of the TCP server code, all at once, is worthwhile. I'm going to assume by your sparse comments that you insist that an at once re-write is necessary - which is likely to delay merging due to less reviewers. My TCP shutdown code is still waiting to be merged, and its much smaller scope. |
It fixes everything that is possible without changing public interface of connection. |
I'm touching more at once since there are interconnected problems that require such big scope. |
That ssl shutdown patch doesn't fix segfault within ssl handshake and other problems that are located within connection. |
23f6b1b
to
e30f356
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That ssl shutdown patch doesn't fix segfault within ssl handshake and other problems that are located within connection.
It was supposed to address one issue, not claim to fix an unspecified amount of things in a rewrite. FWIW, I think this is going to take multiple reviews to hammer out because its likely just swapping one set of issues for another.
As an example, every outgoing write now appears to have a randomized wait_for
call which should have a measurable performance hit. And the following wait_for
call is still a DoS as the thread is blocked (but not busy spin looping at least). So its worse performance for ... ?
This was a light skim through, not really a review.
|
||
bool wait_read; | ||
bool handle_read; | ||
bool cancel_read; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these cancel_
booleans are unnecessary; the standard technique is checking ec == boost::asio::error::operation_aborted
(nearly every example uses this technique). This method is preferable because it catches all cancellations, even ones added in the future (which are currently only caught if the flag is set to true).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repeating this again since it got lost in the back-and-forth, but the cancel_
booleans are not necessary. Requesting their removal is not "an optimization request" - if you are going to re-write the entirety of the connection code then don't leave new things for others to remove and deal with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each boolean ensures that no function is called more times than is necessary (each action in the socket executes 1 at a time as needed). The (seemingly bloated) boolean defense provides explicit synchronization, offering complete control over the execution flow without relying on boost internals, which as evidenced by the seemingly seg fault-prone handshake, can lead to screw-ups. Error codes are still checked where needed AFAICT (for example, here).
Perhaps the booleans go against the standard way of using boost ASIO, and perhaps this approach could be considered less future proof as is, but I think given the issues this PR is solving as constructed (and the PR author hopes to solve and optimize further by building from this PR), then I'd say it's a clear improvement over the current that is worth shipping as is. He has also provided defense for the decision to include them and the bar is high enough for me to warrant inclusion given the scope of the PR's aims.
if (state.data.write.wait_consume || | ||
state.data.write.queue.size() > ABSTRACT_SERVER_SEND_QUE_MAX_COUNT | ||
) | ||
state.condition.wait_for(state.lock, random_delay(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wait_for
is going to need some justification, in the comments, and with some performance testing. Or just any explanation at all. Because this is slowing down all outgoing writes by an arbitrary amount of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait this was inside an if
statement? Ugh, sorry, can we fix the style of this at least?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see why this was duplicated, but there has to be a simpler more concise technique. And seriously {}
this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
state.data.write.wait_consume = true; | ||
while (!message.empty()) { | ||
if (state.data.write.queue.size() > ABSTRACT_SERVER_SEND_QUE_MAX_COUNT) { | ||
state.condition.wait_for(state.lock, random_delay(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still a DoS case - the thread is still blocked.This should be changed to a drop (this is one of the reasons why I am pushing for a piecemeal approach ...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both connection and sender should be asynchronous to remove this problem.
It's partially solved now, as much as possible without changing the rest of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This queue will be removed from connection later, once everything else is asynchronous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't remove a queue, or at least you shouldn't. Because post
ing to a strand is still a queue.
AFAIK there are 3 choices: (1) close connection (drop data), (2) block a thread to wait, or (3) unbounded queue. I think this needs to transition to (1) if an entire re-write is being done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can, will and it's needed. But it isn't easy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asynchronous generator? You mean asynchronously generate bytes piecemeal? That's just a more complicated outgoing buffer. This has to be a drop or block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the sources of data for outgoing queue ?
- Synchronous forward of notifications with txs/blocks which can be made asynchronous with fetching from local blockchain storage.
- Requests/responses required by explicit synchronization which can be generated asynchronously as soon as outgoing/incoming fixed size buffer is free.
In general, daemon sends and receives only verified data that is stored/loaded to/from blockchain db.
This is the only available unbounded buffer to store data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already know exactly what you are proposing, and I am pointing out that some p2p messages require a response, and a queue of asynchronous events is still a queue. The commentary by the ZeroMQ developers on Pub/Sub seems appropriate here - if the read side is behind eventually you have to drop data and/or the connection.
All of these changes don't really change the behavior of the existing problem. Multiple threads can get stuck busy waiting for a queue to clear. The previous method had the threads waiting indefinitely at a lock, whereas here they wait indefinitely in conditional variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still a DoS case - the thread is still blocked.This should be changed to a drop (this is one of the reasons why I am pushing for a piecemeal approach ...).
I didn't understand that you were talking about any action after failed conditional sleep.
I've added missing terminate(); return false
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these changes don't really change the behavior of the existing problem. Multiple threads can get stuck busy waiting for a queue to clear. The previous method had the threads waiting indefinitely at a lock, whereas here they wait indefinitely in conditional variables.
AFAICT this approach is a marked improvement over the current code:
- Reads/writes are no longer waiting on
handle_recv
or callbacks to complete (which execute in the newly introduced 2ndstrand_
), therefore they can process quicker and clear from the strand quicker. - It uses a conditional wait that can unblock as soon as the queue is less than the max.
Without a demonstrable DoS on this new code as is which is evidently harder to DoS, I think the code is merge-able as is, and worth continuing to improve, which the PR author intends to do building from the structure of this PR's code
Additionally there is a risk in dropping the connection here, which is that it can feasibly harm honest usage of the daemon. It's going to take more work to dig deeper to make this entire section air tight to hit the optimal sweet spot imo. For now, this code is an improvement moving in a better direction.
state.timers.throttle.out.wait_expire = false; | ||
if (state.timers.throttle.out.cancel_expire) { | ||
state.timers.throttle.out.cancel_expire = false; | ||
if (state.status == status_t::RUNNING) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switch
if (state.socket.cancel_write) { | ||
state.socket.cancel_write = false; | ||
state.data.write.queue.clear(); | ||
if (state.status == status_t::RUNNING) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switch
state.socket.wait_shutdown = false; | ||
if (state.socket.cancel_shutdown) { | ||
state.socket.cancel_shutdown = false; | ||
if (state.status == status_t::RUNNING) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switch
template<typename T> | ||
void connection<T>::start_read() | ||
{ | ||
if (state.timers.throttle.in.wait_expire || state.socket.wait_read || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait_read
is only necessary because the callbacks invoke start_read
... again. This is unnecessary if the callback were a local struct that contains only the inner looping read calls. I think this is preferable because it separates the start and looping code, whereas this is mixing the two. It also allows for less state machines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be needed later when everything is asynchronous since reader may not consume every received chunk immediately and it's possible that there is no active callbacks at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flag is only needed because start_read
is both starting and continuing the reads. It is possible to separate the two, removing the need for the flag. After every async read (and subsequent processing), the continue function is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't try to optimize number of states / transitions.
It can be done later unless it's needed for significant performance increase or it's related to correctness of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just my personal opinion obviously, but -
if you want to do a re-write of the entire connection code, then it should be "optimized" with the fewest number of states / transitions possible. Otherwise, what is the purpose of the re-write? I'm being critical because if this were done piecemeal, the code could've looked similar in the end to this PR, but the unnecessary code/states could be removed along the way easier during review. Removing the unnecessary bits makes it easier to maintain and debug.
As an example, the cancel_
state flags have to be removed, unless there is some justification why the standard way of checking for cancellation is inappropriate (none was provided above in the other comment thread). And this wait_read
(that started this thread) doesn't appear to have any justification (yet), given how ASIO callbacks are typically done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... Removing the unnecessary bits makes it easier to maintain and debug.
It depends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you want to do a re-write of the entire connection code, then it should be "optimized" with the fewest number of states / transitions possible. Otherwise, what is the purpose of the re-write?
The main goal is to fix known problems and prepare code for design changes, which can't be done in case of greedy early optimizations of every component.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These explicit flags are needed to not depend on internal behaviour of boost asio which doesn't allow to cancel already dispatched callback.
I've thought about this case before, and typically its not worth stressing about. In the read/write callbacks either reset the timer immediately or call cancel
immediately. The timer will be in the strand queue with the appropriate error_code
.
In the off chance that it queued the timer just after the other event but before the cancel could occur ... the timer was so close to executing its not worth the extra state variable and logic. Let ASIO handle all of the state tracking because its better "defensive coding" - the error_code
cancel check handles every future update to how the timer is used.
It depends.
Just chop out these unnecessary states. There's a giant if
ladder than get be chopped down for instance.
The main goal is to fix known problems and prepare code for design changes, which can't be done in case of greedy early optimizations of every component.
I dislike your usage of the word optimization in this context because it makes it sound like I am asking for a performance comparison. This is about making the code more concise+legible - particularly the start_read
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"concise - short and clear, expressing what needs to be said without unnecessary words"
"optimization - the process of making something as good or effective as possible"
Non-programming language is ambiguous enough to allow misinterpretation of anything so If current code is correct then any further changes can be treated as optimization.
This comment was marked as duplicate.
This comment was marked as duplicate.
Sorry, something went wrong.
07c5db3
to
184dedb
Compare
184dedb
to
925c036
Compare
@@ -86,6 +90,181 @@ namespace net_utils | |||
public i_service_endpoint, | |||
public connection_basic | |||
{ | |||
private: | |||
using string_t = std::string; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This "alias everything" style breaks from existing coding conventions, and it would preferable to revert to something closer to the existing style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this. Things like std::string
or std::mutex
shouldn't be renamed at all to avoid confusion. Long epee
and boost
names can probably be renamed.
|
||
bool wait_read; | ||
bool handle_read; | ||
bool cancel_read; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repeating this again since it got lost in the back-and-forth, but the cancel_
booleans are not necessary. Requesting their removal is not "an optimization request" - if you are going to re-write the entirety of the connection code then don't leave new things for others to remove and deal with.
const std::string& host, | ||
std::chrono::milliseconds timeout) const | ||
{ | ||
configure(socket, type, host); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no reason to include these SSL changes with the connection re-write. Include these changes in a new PR with an explanation as to why its being changed.
And if the issue being fixed is the racy ec = ...
, this can be fixed with less changes. All that was needed is an std::atomic_flag
to signal the change (a safe "happens before/after synchronization), these other changes are overkill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It solves the seg fault in the handshake test case, and it does it in a manner that's really similar to the rest of the PR. I'd argue it's actually significantly more readable/easier to grok than the rest of the PR since it's fairly contained and isolated. It helped me as I got started reviewing the rest of the PR because of this.
I agree still that this can probably be a separate PR with just this commit or perhaps even fewer changes, but I'm not hugely opposed to it being here as is/don't consider it a PR blocker.
And if the issue being fixed is the racy ec = ..., this can be fixed with less changes. All that was needed is an std::atomic_flag to signal the change (a safe "happens before/after synchronization), these other changes are overkill.
The primary fix here ensures that upon successful completion of the handshake, the cancel_timer
boolean prevents the deadline timer from executing an erroneous close sequence in the timeout handler, which also demonstrates the implausibility of the test's seg fault.
Perhaps there is a cleaner more standard solution to this with fewer changes as you mention, but it's a pretty clean section of code that seems to do the job as is I would say.
template<typename T> | ||
void connection<T>::start_read() | ||
{ | ||
if (state.timers.throttle.in.wait_expire || state.socket.wait_read || |
This comment was marked as duplicate.
This comment was marked as duplicate.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working through tests 2 and 3 still. Question on test 1...
}; | ||
strand.post( | ||
[this, self, on_shutdown]{ | ||
connection_basic::socket_.async_shutdown( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice that test 1 doesn't get in here; instead deleting a connection via the daemon invokes socket_.next_layer().shutdown
inside on_terminating
. Would this possibly leave the session vulnerable to a truncation attack?
I'm still building up my knowledge on the ssl protocol in general, but figure you'd have the experience to know definitively if this was an oversight and if this would be the case. Why call shutdown on the next layer in on_terminating
and not on the higher level socket like you do here? What makes that case different from this one? Hard to find clear answers on this on the internet (and in the code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR author's reasoning to not invoke socket_.async_shutdown
when terminating a connection seems to be two-fold:
- the risk of a truncation attack is not practically significant
- using
async_shutdown
on connection close would require a timeout in the event of an unresponsive client, which arguably opens a wider & more significant surface for abuse than a truncation attack
Seems reasonable to me.
timers_t timers; | ||
protocol_t protocol; | ||
stat_t stat; | ||
data_t data; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these class members should have m_
prefix in their name - take src/cryptonote_basic/connection_context.h
as an example of naming convention. This is to avoid confusion with local variables in the class methods.
bool local{}; | ||
string_t host{}; | ||
state_t state{}; | ||
handler_t handler; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, m_
prefix here and in other classes (structs are fine).
{ | ||
static lock_t hosts_mutex; | ||
lock_guard_t guard(hosts_mutex); | ||
static std::map<string_t, unsigned int> hosts; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hosts_mutex and hosts should be moved to connection
class as static members to avoid generating initialization code inside the function. Also, this function doesn't remove an entry from hosts
when val
is 0, so hosts
can grow indefinitely and consume all memory if there is an attacker exploiting this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if delta
is 0 (which is the common use of this function), it should have code like this
if (!delta) {
auto it = hosts.find(host);
return (it != hosts.end()) ? it->second : 0;
}
to avoid using operator[]
which modifies the map and is slower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree this seems like a good suggestion. The code seems to already have this issue though, the PR author just moved the code so it showed up in the diff FWIW. I'd say worth a separate PR
return GET_IO_SERVICE(socket()); | ||
if (state.timers.general.wait_expire) | ||
return; | ||
state.timers.general.wait_expire = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sequence is not thread-safe. Should it be protected by state.lock
? Maybe making wait_expire
atomic and
if (state.timers.general.wait_expire.exchange(true))
return;
will be enough?
_dbg2("[" << print_connection_context_short(context) << "] fired_callback"); | ||
m_protocol_handler.handle_qued_callback(); | ||
CATCH_ENTRY_L0("connection<t_protocol_handler>::call_back_starter()", void()); | ||
if (not state.timers.general.wait_expire) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very rarely used in C++ and it's definitely not used in other parts of the Monero codebase. Change it to if (!state.timers.general.wait_expire)
please
"" | ||
); | ||
auto self = connection<T>::shared_from_this(); | ||
if (not state.ssl.forced and not state.ssl.detected) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, use normal C++ operator syntax !
&&
boost::asio::placeholders::error, | ||
boost::asio::placeholders::bytes_transferred))); | ||
//_info("[sock " << socket().native_handle() << "]Async read requested."); | ||
state.socket.wait_handshake = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is separated from the if (state.socket.wait_handshake)
check and again not thread safe. It should be done as an atomic operation and in one place, either here or in the beginning of the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT the lock should be held by the caller of the function here and in the other places
{ | ||
if (state.timers.throttle.out.wait_expire || state.socket.wait_write || | ||
state.data.write.queue.empty() || | ||
(state.ssl.enabled && not state.ssl.handshaked) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you even mix &&
with not
. Please use normal C++ syntax.
if (state.socket.wait_shutdown) | ||
return; | ||
auto self = connection<T>::shared_from_this(); | ||
state.socket.wait_shutdown = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again the same non-thread safe pattern.
shutdown(); | ||
return false; | ||
engine rng(seed); | ||
return std::chrono::milliseconds( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just plain stupid nonsense. RNG must be initialized once and then used, not seeded from std::random_device every time (~2500 bytes!!!). At this stage, you could've just used std::random_device directly. This implementation will deplete system's entropy very quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, seeding the full state (2500 bytes) from std::random_device is totally ok and even good if you do it once. But this code must be rewritten.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation will deplete system's entropy very quickly.
That's indeed bad for a crypto project, like Monero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out this code has been there for a long time. Still, since this PR is almost a full rewrite, it's better to fix it like in #8365
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Large rewrites aren't tolerated here from anybody, and for valid reasons.
Plus we can't use double standards.
Please fix the merge conflicts first and allow the new code to pass the unchanged previously existing tests. There are possible exceptions to this rule of not touching the tests, but only if the given failing test was proven to be overfitting the previous solution. But it's the contributor who needs to prove this fact. |
@perfect-daemon
|
@mj-xmr You claimed you are involved with reviewing this patch but so far I only see you trolling. Not helpful. |
@selsta My review is so shallow, because I'm stuffed with so much other work (*). But if you'd like me to help, I'd gladly create my own version, based on the unit tests written by @perfect-daemon . This would mean that I have to switch my priorities and focus on the discovered vulnerabilities. No problem. I'm also already trying to support @SChernykh in his attempt to do the same, like here: #8365 . Just that it turned out to be a dead end, only speaks for him, that he can think self-critically, and not that I'm not helping. (*) Currently I'm preparing a build time testing framework for you, so that you can more objectively verify the compilation speedups, that I do. |
@mj-xmr No, I'd prefer if you don't pollute this PR (and my email) with troll comments. |
I'll try anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
review round one (mostly net_ssl)
@@ -344,6 +351,7 @@ TEST(test_epee_connection, test_lifetime) | |||
} | |||
ASSERT_TRUE(shared_state->get_connections_count() == 1); | |||
shared_state->del_out_connections(1); | |||
while (shared_state->sock_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This infinite loop is really not great. Give it a max timeout (using e.g. usleep
) and then fail the test if it takes too long.
@@ -454,7 +462,11 @@ TEST(test_epee_connection, test_lifetime) | |||
} | |||
for (;workers.size(); workers.pop_back()) | |||
workers.back().join(); | |||
|
|||
while (server.get_connections_count()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
@@ -278,6 +280,11 @@ TEST(test_epee_connection, test_lifetime) | |||
ASSERT_TRUE(shared_state->get_connections_count() == 0); | |||
constexpr auto DELAY = 30; | |||
constexpr auto TIMEOUT = 1; | |||
while (server.get_connections_count()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another (potentially) infinite loop in test suite
auto start_handshake = [&]{ | ||
using ec_t = boost::system::error_code; | ||
using timer_t = boost::asio::steady_timer; | ||
using strand_t = boost::asio::io_service::strand; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again with the type aliasing. I think at a certain point, it starts to really hinder readability. Also what other type would we use for srand_t
? It's such a specialized type it wouldn't make sense to use anything else and anything else is not going to have the same interface anyways.
void connection<T>::on_interrupted() | ||
{ | ||
assert(state.status == status_t::INTERRUPTED); | ||
if (state.timers.general.wait_expire) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switch
statement would be better here
void connection<T>::on_terminating() | ||
{ | ||
assert(state.status == status_t::TERMINATING); | ||
if (state.timers.general.wait_expire) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switch
statement would be better here
socket.async_handshake( | ||
type, | ||
boost::asio::buffer(buffer), | ||
strand.wrap(on_handshake) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
io_service::strand::wrap
is no longer available in newer versions of Boost (>= 1.66)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have run this code with boost versions >= 1.66, any idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the relevant line in strand.hpp
version 1.58 which provides the method wrap
: https://github.com/boostorg/asio/blob/36f5fda32fd518714c00ae6fe23a352ed64a7f66/include/boost/asio/strand.hpp#L223
There is no such method in the commit tagged for Boost 1.66: https://github.com/boostorg/asio/blob/310f245a35a20af59a4cd0030626a4de23c7309c/include/boost/asio/strand.hpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know @selsta, you might have been running a compatibility library? Or the versioning system is off?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be that you looked in the wrong file?
https://github.com/boostorg/asio/blob/develop/include/boost/asio/io_context_strand.hpp has wrap and io_service_strand.hpp
just includes io_context_strand.hpp
.
I just know that it compiles fine with the latest boost version without any compatibility library or whatever so there's something else going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Figured it out: it's deprecated in favor of bind_executor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still support 1.59 and bind_executor
isn't supported in that version so we would have to add #if
.
@@ -472,12 +474,10 @@ bool ssl_options_t::has_fingerprint(boost::asio::ssl::verify_context &ctx) const | |||
return false; | |||
} | |||
|
|||
bool ssl_options_t::handshake( | |||
void ssl_options_t::configure( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good that this is renamed to configure
timer_t deadline(io_context, timeout); | ||
|
||
struct state_t { | ||
lock_t lock; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a lock in asynchronous code is awkward. Asynchronous code touching the same object should be running from the same strand and shouldn't be executing concurrently, eliminating the need for locks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you check the strand_deadlock
test case? I'm quite sure some of these design decisions were intentional to solve this issue, together with setting -DBOOST_ASIO_ENABLE_SEQUENTIAL_STRAND_ALLOCATION
But I also don't want to give false info so maybe ignore if this doesn't make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I'm sure the author put the locks in there for a reason, due to the current design of handshake
, but since this is a complete rewrite, it doesn't make sense to have code running asynchronously from a single executor AND have status booleans AND have it be manually, externally locked. Even if it functions correctly, this will likely gum up the io_context
of socket
waiting for locks and it makes it a ton harder to track what's going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case it ensures synchronization between the handshake which executes inside the strand, and the deadline timer which handles a timeout sequence outside the strand in the event of a slow handshake. I'm having trouble seeing how it would be better constructed without a lock
In other cases it ensures synchronization across the 2 separate strands
state.condition.wait_for( | ||
state.lock, | ||
std::chrono::milliseconds(30), | ||
[&]{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a stop_waiting
predicate inside a while
loop is redundant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Round 2
socket.handshake(boost::asio::ssl::stream_base::client, ec); | ||
EXPECT_EQ(ec.value(), 0); | ||
std::this_thread::sleep_for(std::chrono::milliseconds(100)); | ||
while (server.get_config_shared()->get_connections_count() < 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
potentially infinite while
loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test expects this condition to be met, yet should instead error if this condition isn't, as this condition not being met is a failure of the code being run. I'd also appreciate a sleep statement being added to reduce CPU utilization while this condition is checked for. Same for the other infinite loop noted below this (and likely any others).
std::this_thread::sleep_for(std::chrono::milliseconds(100)); | ||
while (server.get_config_shared()->get_connections_count() < 1); | ||
server.get_config_shared()->del_in_connections(1); | ||
while (server.get_config_shared()->get_connections_count() > 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
potentially infinite while
loop
using network_throttle_t = epee::net_utils::network_throttle; | ||
using network_throttle_manager_t = epee::net_utils::network_throttle_manager; | ||
|
||
unsigned int host_count(int delta = 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if it is just a single line, please document these private methods
@@ -86,6 +90,181 @@ namespace net_utils | |||
public i_service_endpoint, | |||
public connection_basic | |||
{ | |||
private: | |||
using string_t = std::string; | |||
using handler_t = t_protocol_handler; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This alias can be confusing because of the use of the term "handler" with regards to asio
@@ -1006,6 +1006,7 @@ if(STATIC) | |||
set(Boost_USE_STATIC_RUNTIME ON) | |||
endif() | |||
find_package(Boost 1.58 QUIET REQUIRED COMPONENTS system filesystem thread date_time chrono regex serialization program_options locale) | |||
add_definitions(-DBOOST_ASIO_ENABLE_SEQUENTIAL_STRAND_ALLOCATION) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I'm necessarily opposed to this change, but what it the rational for switching to round-robin allocation of strands? Was there any evidence that deadlocks with the current code are related to non-round-robin allocation of strands?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The core of the PR's solution to the strand deadlock test case (test 3) is to use 2 strands (strand
and strand_
). Might be clearer to see how over in this code where I tried to pull out the solution to just test 3.
If boost happens to randomly allocate the same strand implementation for both strands (boost randomly allocates from a pool of 193 by default), then a deadlock can resurface, since the writes won't be able to clear from the strand that handle_recv
is posting to (the writes get stuck in the same strand behind the handle_recv
= deadlock).
Allocating strands round robin avoids this because it should guarantee that both strands in each connection will have distinct implementations allocated (since connections themselves are also only created 1 at a time).
If you try running his test 3 without this definition, you'll eventually run into a test failure at some point, which should be the collision case described above.
bool handshaked; | ||
}; | ||
|
||
struct socket_t { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
socket_t
now refers to 2 types: this one and boost::asio::ip::tcp::socket
, and is ambiguous
bool cancel_shutdown; | ||
}; | ||
|
||
struct timer_t { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timer_t
now refers to 2 types: this one and boost::asio::steady_timer
, and is ambiguous
data_t data; | ||
}; | ||
|
||
using status_t = typename state_t::status_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aliasing nested user-defined types is not a design choice I think is easy on future maintainers/readers. If you want your new type to be scoped to another do that. If you want it be be top-level, do that. But don't nest it then alias it back up to the top IMO
terminate(); | ||
} | ||
else if ( | ||
not epee::net_utils::is_ssl( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like other have suggested, use !
instead of not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, I agree the style is non-conforming and it would be nicer if it conformed to existing conventions better. But I don't think that is a good blocker for a PR that solves reliability issues, especially one that (justifiably, imo) rewrites large sections to do so. The author has fair justification for the decisions made in the code (some reasoning explained via DM). As is, this PR is a clear improvement over the current code.
Further, in discussions with the author, and in some of the answers given here, this PR is a stepping stone to even more improvements. To that end, I do think this is in line with the desired piecemeal approach to solving issues in the daemon. This PR does indeed seem to be a smaller piece of a puzzle toward even larger improvements, even if it is fairly large in its own right.
I'll approve it once merge conflicts are resolved satisfactorily.
Separate note...
I strongly suggest anyone who wants to fully understand this code run the old code with the provided tests and observe where the failures occur, then review the 3 commits of this branch to understand the code in this PR that is fixing the issues, then read and step through this PR's code to understand how the code is working from a bigger picture perspective. Happy to answer questions/help make it clear what the issues are and what is being fixed. The PR author was also helpful to me in understanding this PR over DM, mainly after I made it clear I did step through the code myself and demonstrated I understood various problems it's solving.
@@ -1006,6 +1006,7 @@ if(STATIC) | |||
set(Boost_USE_STATIC_RUNTIME ON) | |||
endif() | |||
find_package(Boost 1.58 QUIET REQUIRED COMPONENTS system filesystem thread date_time chrono regex serialization program_options locale) | |||
add_definitions(-DBOOST_ASIO_ENABLE_SEQUENTIAL_STRAND_ALLOCATION) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The core of the PR's solution to the strand deadlock test case (test 3) is to use 2 strands (strand
and strand_
). Might be clearer to see how over in this code where I tried to pull out the solution to just test 3.
If boost happens to randomly allocate the same strand implementation for both strands (boost randomly allocates from a pool of 193 by default), then a deadlock can resurface, since the writes won't be able to clear from the strand that handle_recv
is posting to (the writes get stuck in the same strand behind the handle_recv
= deadlock).
Allocating strands round robin avoids this because it should guarantee that both strands in each connection will have distinct implementations allocated (since connections themselves are also only created 1 at a time).
If you try running his test 3 without this definition, you'll eventually run into a test failure at some point, which should be the collision case described above.
|
||
bool wait_read; | ||
bool handle_read; | ||
bool cancel_read; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each boolean ensures that no function is called more times than is necessary (each action in the socket executes 1 at a time as needed). The (seemingly bloated) boolean defense provides explicit synchronization, offering complete control over the execution flow without relying on boost internals, which as evidenced by the seemingly seg fault-prone handshake, can lead to screw-ups. Error codes are still checked where needed AFAICT (for example, here).
Perhaps the booleans go against the standard way of using boost ASIO, and perhaps this approach could be considered less future proof as is, but I think given the issues this PR is solving as constructed (and the PR author hopes to solve and optimize further by building from this PR), then I'd say it's a clear improvement over the current that is worth shipping as is. He has also provided defense for the decision to include them and the bar is high enough for me to warrant inclusion given the scope of the PR's aims.
boost::asio::placeholders::error, | ||
boost::asio::placeholders::bytes_transferred))); | ||
//_info("[sock " << socket().native_handle() << "]Async read requested."); | ||
state.socket.wait_handshake = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT the lock should be held by the caller of the function here and in the other places
const std::string& host, | ||
std::chrono::milliseconds timeout) const | ||
{ | ||
configure(socket, type, host); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It solves the seg fault in the handshake test case, and it does it in a manner that's really similar to the rest of the PR. I'd argue it's actually significantly more readable/easier to grok than the rest of the PR since it's fairly contained and isolated. It helped me as I got started reviewing the rest of the PR because of this.
I agree still that this can probably be a separate PR with just this commit or perhaps even fewer changes, but I'm not hugely opposed to it being here as is/don't consider it a PR blocker.
And if the issue being fixed is the racy ec = ..., this can be fixed with less changes. All that was needed is an std::atomic_flag to signal the change (a safe "happens before/after synchronization), these other changes are overkill.
The primary fix here ensures that upon successful completion of the handshake, the cancel_timer
boolean prevents the deadline timer from executing an erroneous close sequence in the timeout handler, which also demonstrates the implausibility of the test's seg fault.
Perhaps there is a cleaner more standard solution to this with fewer changes as you mention, but it's a pretty clean section of code that seems to do the job as is I would say.
timer_t deadline(io_context, timeout); | ||
|
||
struct state_t { | ||
lock_t lock; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case it ensures synchronization between the handshake which executes inside the strand, and the deadline timer which handles a timeout sequence outside the strand in the event of a slow handshake. I'm having trouble seeing how it would be better constructed without a lock
In other cases it ensures synchronization across the 2 separate strands
{ | ||
static lock_t hosts_mutex; | ||
lock_guard_t guard(hosts_mutex); | ||
static std::map<string_t, unsigned int> hosts; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree this seems like a good suggestion. The code seems to already have this issue though, the PR author just moved the code so it showed up in the diff FWIW. I'd say worth a separate PR
because of #8426, can this PR be closed? |
Merged in #8426 |
No description provided.