Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connection: fix implementation #7760

Closed

Conversation

perfect-daemon
Copy link

No description provided.

@vtnerd
Copy link
Contributor

vtnerd commented Jun 22, 2021

Please provide more information, is this for performance increase, code clarity, or? This is going to be a big review for everyone else.

@perfect-daemon
Copy link
Author

In the worst case It's just an approximation of a fix for all added tests and some other known problems.

@vtnerd
Copy link
Contributor

vtnerd commented Jun 22, 2021

Can you at least tell me (even privately) what bugs this is fixing? Otherwise this is possibly a re-write for clarity sake, bordering on a re-write to meet your favored style of callbacks.

Granted, I prefer asio coroutines or this style over the current style, but I'm not sure its worth a giant re-write. The review and testing for this is going to take a while because it impacts the p2p code too.

@perfect-daemon
Copy link
Author

"Can you at least tell me (even privately) what bugs this is fixing?" at least those 3 that are triggered by added tests

@perfect-daemon
Copy link
Author

Did you try to run added tests from the first commit ?

@vtnerd
Copy link
Contributor

vtnerd commented Jun 22, 2021

Did you try to run added tests from the first commit ?

Yes, I can read the tests, but was a re-write the best way to fix these issues? Can the code changes be done in piecemeal so that more reviewers can reasonably comment on this code over time? You've provided little auxiliary justification for these changes. The PR is basically asserting that your rewrite of the TCP server code, all at once, is worthwhile.

I'm going to assume by your sparse comments that you insist that an at once re-write is necessary - which is likely to delay merging due to less reviewers. My TCP shutdown code is still waiting to be merged, and its much smaller scope.

@perfect-daemon
Copy link
Author

It fixes everything that is possible without changing public interface of connection.
There are many other problems that will require even bigger changes.
It doesn't even fix all problems within connection since the rest of the code relies no some internal ugliness of connection.

@perfect-daemon
Copy link
Author

I'm touching more at once since there are interconnected problems that require such big scope.

@perfect-daemon
Copy link
Author

perfect-daemon commented Jun 22, 2021

I'm going to assume by your sparse comments that you insist that an at once re-write is necessary - which is likely to delay merging due to less reviewers. My TCP shutdown code is still waiting to be merged, and its much smaller scope.

That ssl shutdown patch doesn't fix segfault within ssl handshake and other problems that are located within connection.

Copy link
Contributor

@vtnerd vtnerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That ssl shutdown patch doesn't fix segfault within ssl handshake and other problems that are located within connection.

It was supposed to address one issue, not claim to fix an unspecified amount of things in a rewrite. FWIW, I think this is going to take multiple reviews to hammer out because its likely just swapping one set of issues for another.

As an example, every outgoing write now appears to have a randomized wait_for call which should have a measurable performance hit. And the following wait_for call is still a DoS as the thread is blocked (but not busy spin looping at least). So its worse performance for ... ?

This was a light skim through, not really a review.


bool wait_read;
bool handle_read;
bool cancel_read;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these cancel_ booleans are unnecessary; the standard technique is checking ec == boost::asio::error::operation_aborted (nearly every example uses this technique). This method is preferable because it catches all cancellations, even ones added in the future (which are currently only caught if the flag is set to true).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repeating this again since it got lost in the back-and-forth, but the cancel_ booleans are not necessary. Requesting their removal is not "an optimization request" - if you are going to re-write the entirety of the connection code then don't leave new things for others to remove and deal with.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each boolean ensures that no function is called more times than is necessary (each action in the socket executes 1 at a time as needed). The (seemingly bloated) boolean defense provides explicit synchronization, offering complete control over the execution flow without relying on boost internals, which as evidenced by the seemingly seg fault-prone handshake, can lead to screw-ups. Error codes are still checked where needed AFAICT (for example, here).

Perhaps the booleans go against the standard way of using boost ASIO, and perhaps this approach could be considered less future proof as is, but I think given the issues this PR is solving as constructed (and the PR author hopes to solve and optimize further by building from this PR), then I'd say it's a clear improvement over the current that is worth shipping as is. He has also provided defense for the decision to include them and the bar is high enough for me to warrant inclusion given the scope of the PR's aims.

if (state.data.write.wait_consume ||
state.data.write.queue.size() > ABSTRACT_SERVER_SEND_QUE_MAX_COUNT
)
state.condition.wait_for(state.lock, random_delay(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wait_for is going to need some justification, in the comments, and with some performance testing. Or just any explanation at all. Because this is slowing down all outgoing writes by an arbitrary amount of time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait this was inside an if statement? Ugh, sorry, can we fix the style of this at least?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see why this was duplicated, but there has to be a simpler more concise technique. And seriously {} this!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

state.data.write.wait_consume = true;
while (!message.empty()) {
if (state.data.write.queue.size() > ABSTRACT_SERVER_SEND_QUE_MAX_COUNT) {
state.condition.wait_for(state.lock, random_delay(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a DoS case - the thread is still blocked.This should be changed to a drop (this is one of the reasons why I am pushing for a piecemeal approach ...).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both connection and sender should be asynchronous to remove this problem.
It's partially solved now, as much as possible without changing the rest of the code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This queue will be removed from connection later, once everything else is asynchronous.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't remove a queue, or at least you shouldn't. Because posting to a strand is still a queue.

AFAIK there are 3 choices: (1) close connection (drop data), (2) block a thread to wait, or (3) unbounded queue. I think this needs to transition to (1) if an entire re-write is being done.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can, will and it's needed. But it isn't easy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asynchronous generator? You mean asynchronously generate bytes piecemeal? That's just a more complicated outgoing buffer. This has to be a drop or block.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the sources of data for outgoing queue ?

  1. Synchronous forward of notifications with txs/blocks which can be made asynchronous with fetching from local blockchain storage.
  2. Requests/responses required by explicit synchronization which can be generated asynchronously as soon as outgoing/incoming fixed size buffer is free.

In general, daemon sends and receives only verified data that is stored/loaded to/from blockchain db.
This is the only available unbounded buffer to store data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already know exactly what you are proposing, and I am pointing out that some p2p messages require a response, and a queue of asynchronous events is still a queue. The commentary by the ZeroMQ developers on Pub/Sub seems appropriate here - if the read side is behind eventually you have to drop data and/or the connection.

All of these changes don't really change the behavior of the existing problem. Multiple threads can get stuck busy waiting for a queue to clear. The previous method had the threads waiting indefinitely at a lock, whereas here they wait indefinitely in conditional variables.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a DoS case - the thread is still blocked.This should be changed to a drop (this is one of the reasons why I am pushing for a piecemeal approach ...).

I didn't understand that you were talking about any action after failed conditional sleep.
I've added missing terminate(); return false.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these changes don't really change the behavior of the existing problem. Multiple threads can get stuck busy waiting for a queue to clear. The previous method had the threads waiting indefinitely at a lock, whereas here they wait indefinitely in conditional variables.

AFAICT this approach is a marked improvement over the current code:

  1. Reads/writes are no longer waiting on handle_recv or callbacks to complete (which execute in the newly introduced 2nd strand_), therefore they can process quicker and clear from the strand quicker.
  2. It uses a conditional wait that can unblock as soon as the queue is less than the max.

Without a demonstrable DoS on this new code as is which is evidently harder to DoS, I think the code is merge-able as is, and worth continuing to improve, which the PR author intends to do building from the structure of this PR's code

Additionally there is a risk in dropping the connection here, which is that it can feasibly harm honest usage of the daemon. It's going to take more work to dig deeper to make this entire section air tight to hit the optimal sweet spot imo. For now, this code is an improvement moving in a better direction.

state.timers.throttle.out.wait_expire = false;
if (state.timers.throttle.out.cancel_expire) {
state.timers.throttle.out.cancel_expire = false;
if (state.status == status_t::RUNNING)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switch

if (state.socket.cancel_write) {
state.socket.cancel_write = false;
state.data.write.queue.clear();
if (state.status == status_t::RUNNING)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switch

state.socket.wait_shutdown = false;
if (state.socket.cancel_shutdown) {
state.socket.cancel_shutdown = false;
if (state.status == status_t::RUNNING)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switch

template<typename T>
void connection<T>::start_read()
{
if (state.timers.throttle.in.wait_expire || state.socket.wait_read ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait_read is only necessary because the callbacks invoke start_read ... again. This is unnecessary if the callback were a local struct that contains only the inner looping read calls. I think this is preferable because it separates the start and looping code, whereas this is mixing the two. It also allows for less state machines.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be needed later when everything is asynchronous since reader may not consume every received chunk immediately and it's possible that there is no active callbacks at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flag is only needed because start_read is both starting and continuing the reads. It is possible to separate the two, removing the need for the flag. After every async read (and subsequent processing), the continue function is called.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't try to optimize number of states / transitions.
It can be done later unless it's needed for significant performance increase or it's related to correctness of the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just my personal opinion obviously, but -

if you want to do a re-write of the entire connection code, then it should be "optimized" with the fewest number of states / transitions possible. Otherwise, what is the purpose of the re-write? I'm being critical because if this were done piecemeal, the code could've looked similar in the end to this PR, but the unnecessary code/states could be removed along the way easier during review. Removing the unnecessary bits makes it easier to maintain and debug.

As an example, the cancel_ state flags have to be removed, unless there is some justification why the standard way of checking for cancellation is inappropriate (none was provided above in the other comment thread). And this wait_read (that started this thread) doesn't appear to have any justification (yet), given how ASIO callbacks are typically done.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... Removing the unnecessary bits makes it easier to maintain and debug.

It depends.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want to do a re-write of the entire connection code, then it should be "optimized" with the fewest number of states / transitions possible. Otherwise, what is the purpose of the re-write?

The main goal is to fix known problems and prepare code for design changes, which can't be done in case of greedy early optimizations of every component.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These explicit flags are needed to not depend on internal behaviour of boost asio which doesn't allow to cancel already dispatched callback.

I've thought about this case before, and typically its not worth stressing about. In the read/write callbacks either reset the timer immediately or call cancel immediately. The timer will be in the strand queue with the appropriate error_code.

In the off chance that it queued the timer just after the other event but before the cancel could occur ... the timer was so close to executing its not worth the extra state variable and logic. Let ASIO handle all of the state tracking because its better "defensive coding" - the error_code cancel check handles every future update to how the timer is used.

It depends.

Just chop out these unnecessary states. There's a giant if ladder than get be chopped down for instance.

The main goal is to fix known problems and prepare code for design changes, which can't be done in case of greedy early optimizations of every component.

I dislike your usage of the word optimization in this context because it makes it sound like I am asking for a performance comparison. This is about making the code more concise+legible - particularly the start_read function.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"concise - short and clear, expressing what needs to be said without unnecessary words"
"optimization - the process of making something as good or effective as possible"
Non-programming language is ambiguous enough to allow misinterpretation of anything so If current code is correct then any further changes can be treated as optimization.

This comment was marked as duplicate.

@perfect-daemon perfect-daemon force-pushed the pr006_master branch 11 times, most recently from 07c5db3 to 184dedb Compare June 26, 2021 12:06
@@ -86,6 +90,181 @@ namespace net_utils
public i_service_endpoint,
public connection_basic
{
private:
using string_t = std::string;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "alias everything" style breaks from existing coding conventions, and it would preferable to revert to something closer to the existing style.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this. Things like std::string or std::mutex shouldn't be renamed at all to avoid confusion. Long epee and boost names can probably be renamed.


bool wait_read;
bool handle_read;
bool cancel_read;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repeating this again since it got lost in the back-and-forth, but the cancel_ booleans are not necessary. Requesting their removal is not "an optimization request" - if you are going to re-write the entirety of the connection code then don't leave new things for others to remove and deal with.

const std::string& host,
std::chrono::milliseconds timeout) const
{
configure(socket, type, host);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no reason to include these SSL changes with the connection re-write. Include these changes in a new PR with an explanation as to why its being changed.

And if the issue being fixed is the racy ec = ..., this can be fixed with less changes. All that was needed is an std::atomic_flag to signal the change (a safe "happens before/after synchronization), these other changes are overkill.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It solves the seg fault in the handshake test case, and it does it in a manner that's really similar to the rest of the PR. I'd argue it's actually significantly more readable/easier to grok than the rest of the PR since it's fairly contained and isolated. It helped me as I got started reviewing the rest of the PR because of this.

I agree still that this can probably be a separate PR with just this commit or perhaps even fewer changes, but I'm not hugely opposed to it being here as is/don't consider it a PR blocker.


And if the issue being fixed is the racy ec = ..., this can be fixed with less changes. All that was needed is an std::atomic_flag to signal the change (a safe "happens before/after synchronization), these other changes are overkill.

The primary fix here ensures that upon successful completion of the handshake, the cancel_timer boolean prevents the deadline timer from executing an erroneous close sequence in the timeout handler, which also demonstrates the implausibility of the test's seg fault.

Perhaps there is a cleaner more standard solution to this with fewer changes as you mention, but it's a pretty clean section of code that seems to do the job as is I would say.

template<typename T>
void connection<T>::start_read()
{
if (state.timers.throttle.in.wait_expire || state.socket.wait_read ||

This comment was marked as duplicate.

Copy link
Collaborator

@j-berman j-berman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working through tests 2 and 3 still. Question on test 1...

};
strand.post(
[this, self, on_shutdown]{
connection_basic::socket_.async_shutdown(
Copy link
Collaborator

@j-berman j-berman May 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that test 1 doesn't get in here; instead deleting a connection via the daemon invokes socket_.next_layer().shutdown inside on_terminating. Would this possibly leave the session vulnerable to a truncation attack?

I'm still building up my knowledge on the ssl protocol in general, but figure you'd have the experience to know definitively if this was an oversight and if this would be the case. Why call shutdown on the next layer in on_terminating and not on the higher level socket like you do here? What makes that case different from this one? Hard to find clear answers on this on the internet (and in the code)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timers_t timers;
protocol_t protocol;
stat_t stat;
data_t data;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these class members should have m_ prefix in their name - take src/cryptonote_basic/connection_context.h as an example of naming convention. This is to avoid confusion with local variables in the class methods.

bool local{};
string_t host{};
state_t state{};
handler_t handler;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, m_ prefix here and in other classes (structs are fine).

{
static lock_t hosts_mutex;
lock_guard_t guard(hosts_mutex);
static std::map<string_t, unsigned int> hosts;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hosts_mutex and hosts should be moved to connection class as static members to avoid generating initialization code inside the function. Also, this function doesn't remove an entry from hosts when val is 0, so hosts can grow indefinitely and consume all memory if there is an attacker exploiting this.

Copy link
Contributor

@SChernykh SChernykh May 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if delta is 0 (which is the common use of this function), it should have code like this

if (!delta) {
    auto it = hosts.find(host);
    return (it != hosts.end()) ? it->second : 0;
}

to avoid using operator[] which modifies the map and is slower.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree this seems like a good suggestion. The code seems to already have this issue though, the PR author just moved the code so it showed up in the diff FWIW. I'd say worth a separate PR

return GET_IO_SERVICE(socket());
if (state.timers.general.wait_expire)
return;
state.timers.general.wait_expire = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sequence is not thread-safe. Should it be protected by state.lock? Maybe making wait_expire atomic and

    if (state.timers.general.wait_expire.exchange(true))
        return;

will be enough?

_dbg2("[" << print_connection_context_short(context) << "] fired_callback");
m_protocol_handler.handle_qued_callback();
CATCH_ENTRY_L0("connection<t_protocol_handler>::call_back_starter()", void());
if (not state.timers.general.wait_expire)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very rarely used in C++ and it's definitely not used in other parts of the Monero codebase. Change it to if (!state.timers.general.wait_expire) please

""
);
auto self = connection<T>::shared_from_this();
if (not state.ssl.forced and not state.ssl.detected) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, use normal C++ operator syntax ! &&

boost::asio::placeholders::error,
boost::asio::placeholders::bytes_transferred)));
//_info("[sock " << socket().native_handle() << "]Async read requested.");
state.socket.wait_handshake = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is separated from the if (state.socket.wait_handshake) check and again not thread safe. It should be done as an atomic operation and in one place, either here or in the beginning of the function.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT the lock should be held by the caller of the function here and in the other places

{
if (state.timers.throttle.out.wait_expire || state.socket.wait_write ||
state.data.write.queue.empty() ||
(state.ssl.enabled && not state.ssl.handshaked)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you even mix && with not. Please use normal C++ syntax.

if (state.socket.wait_shutdown)
return;
auto self = connection<T>::shared_from_this();
state.socket.wait_shutdown = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again the same non-thread safe pattern.

shutdown();
return false;
engine rng(seed);
return std::chrono::milliseconds(
Copy link
Contributor

@SChernykh SChernykh May 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just plain stupid nonsense. RNG must be initialized once and then used, not seeded from std::random_device every time (~2500 bytes!!!). At this stage, you could've just used std::random_device directly. This implementation will deplete system's entropy very quickly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, seeding the full state (2500 bytes) from std::random_device is totally ok and even good if you do it once. But this code must be rewritten.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation will deplete system's entropy very quickly.

That's indeed bad for a crypto project, like Monero.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out this code has been there for a long time. Still, since this PR is almost a full rewrite, it's better to fix it like in #8365

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Large rewrites aren't tolerated here from anybody, and for valid reasons.

Plus we can't use double standards.

@mj-xmr
Copy link
Contributor

mj-xmr commented May 30, 2022

Please fix the merge conflicts first and allow the new code to pass the unchanged previously existing tests.

There are possible exceptions to this rule of not touching the tests, but only if the given failing test was proven to be overfitting the previous solution. But it's the contributor who needs to prove this fact.

@mj-xmr
Copy link
Contributor

mj-xmr commented Jun 2, 2022

Can you at least tell me (even privately) what bugs this is fixing? Otherwise this is possibly a re-write for clarity sake, bordering on a re-write to meet your favored style of callbacks.

Granted, I prefer asio coroutines or this style over the current style, but I'm not sure its worth a giant re-write. The review and testing for this is going to take a while because it impacts the p2p code too.

-- vtnerd

"Can you at least tell me (even privately) what bugs this is fixing?" at least those 3 that are triggered by added tests

-- (perfect-daemon )

@perfect-daemon
I'd like to learn more about Monero's vulnerability process.

  • Regarding the Zero-Day vulnerabilities, that you helped uncover for bad actors especially through your unit tests, and kept them open for the public for already a year: were they addressed through the Monero's Vulnerability Response Process?
  • Can you explain why the branch is not merged yet after almost a whole year already? Have you done everything you could, in order to satisfy your Reviewers' requests?
  • If you haven't done enough, do you need help, maybe?

@selsta
Copy link
Collaborator

selsta commented Jun 2, 2022

@mj-xmr You claimed you are involved with reviewing this patch but so far I only see you trolling. Not helpful.

@mj-xmr
Copy link
Contributor

mj-xmr commented Jun 2, 2022

@selsta My review is so shallow, because I'm stuffed with so much other work (*). But if you'd like me to help, I'd gladly create my own version, based on the unit tests written by @perfect-daemon . This would mean that I have to switch my priorities and focus on the discovered vulnerabilities. No problem.

I'm also already trying to support @SChernykh in his attempt to do the same, like here: #8365 . Just that it turned out to be a dead end, only speaks for him, that he can think self-critically, and not that I'm not helping.

(*) Currently I'm preparing a build time testing framework for you, so that you can more objectively verify the compilation speedups, that I do.

@selsta
Copy link
Collaborator

selsta commented Jun 2, 2022

@mj-xmr No, I'd prefer if you don't pollute this PR (and my email) with troll comments.

@mj-xmr
Copy link
Contributor

mj-xmr commented Jun 2, 2022

I'll try anyway.

Copy link
Contributor

@jeffro256 jeffro256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

review round one (mostly net_ssl)

@@ -344,6 +351,7 @@ TEST(test_epee_connection, test_lifetime)
}
ASSERT_TRUE(shared_state->get_connections_count() == 1);
shared_state->del_out_connections(1);
while (shared_state->sock_count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This infinite loop is really not great. Give it a max timeout (using e.g. usleep) and then fail the test if it takes too long.

@@ -454,7 +462,11 @@ TEST(test_epee_connection, test_lifetime)
}
for (;workers.size(); workers.pop_back())
workers.back().join();

while (server.get_connections_count()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

@@ -278,6 +280,11 @@ TEST(test_epee_connection, test_lifetime)
ASSERT_TRUE(shared_state->get_connections_count() == 0);
constexpr auto DELAY = 30;
constexpr auto TIMEOUT = 1;
while (server.get_connections_count()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another (potentially) infinite loop in test suite

auto start_handshake = [&]{
using ec_t = boost::system::error_code;
using timer_t = boost::asio::steady_timer;
using strand_t = boost::asio::io_service::strand;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again with the type aliasing. I think at a certain point, it starts to really hinder readability. Also what other type would we use for srand_t? It's such a specialized type it wouldn't make sense to use anything else and anything else is not going to have the same interface anyways.

void connection<T>::on_interrupted()
{
assert(state.status == status_t::INTERRUPTED);
if (state.timers.general.wait_expire)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switch statement would be better here

void connection<T>::on_terminating()
{
assert(state.status == status_t::TERMINATING);
if (state.timers.general.wait_expire)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switch statement would be better here

socket.async_handshake(
type,
boost::asio::buffer(buffer),
strand.wrap(on_handshake)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

io_service::strand::wrap is no longer available in newer versions of Boost (>= 1.66)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have run this code with boost versions >= 1.66, any idea?

Copy link
Contributor

@jeffro256 jeffro256 Jun 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know @selsta, you might have been running a compatibility library? Or the versioning system is off?

Copy link
Collaborator

@selsta selsta Jun 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be that you looked in the wrong file?

https://github.com/boostorg/asio/blob/develop/include/boost/asio/io_context_strand.hpp has wrap and io_service_strand.hpp just includes io_context_strand.hpp.

I just know that it compiles fine with the latest boost version without any compatibility library or whatever so there's something else going on.

Copy link
Contributor

@jeffro256 jeffro256 Jun 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Figured it out: it's deprecated in favor of bind_executor

Copy link
Collaborator

@selsta selsta Jun 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still support 1.59 and bind_executor isn't supported in that version so we would have to add #if.

@@ -472,12 +474,10 @@ bool ssl_options_t::has_fingerprint(boost::asio::ssl::verify_context &ctx) const
return false;
}

bool ssl_options_t::handshake(
void ssl_options_t::configure(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good that this is renamed to configure

timer_t deadline(io_context, timeout);

struct state_t {
lock_t lock;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a lock in asynchronous code is awkward. Asynchronous code touching the same object should be running from the same strand and shouldn't be executing concurrently, eliminating the need for locks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you check the strand_deadlock test case? I'm quite sure some of these design decisions were intentional to solve this issue, together with setting -DBOOST_ASIO_ENABLE_SEQUENTIAL_STRAND_ALLOCATION

But I also don't want to give false info so maybe ignore if this doesn't make sense.

Copy link
Contributor

@jeffro256 jeffro256 Jun 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I'm sure the author put the locks in there for a reason, due to the current design of handshake, but since this is a complete rewrite, it doesn't make sense to have code running asynchronously from a single executor AND have status booleans AND have it be manually, externally locked. Even if it functions correctly, this will likely gum up the io_context of socket waiting for locks and it makes it a ton harder to track what's going on.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it ensures synchronization between the handshake which executes inside the strand, and the deadline timer which handles a timeout sequence outside the strand in the event of a slow handshake. I'm having trouble seeing how it would be better constructed without a lock

In other cases it ensures synchronization across the 2 separate strands

state.condition.wait_for(
state.lock,
std::chrono::milliseconds(30),
[&]{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a stop_waiting predicate inside a while loop is redundant

Copy link
Contributor

@jeffro256 jeffro256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round 2

socket.handshake(boost::asio::ssl::stream_base::client, ec);
EXPECT_EQ(ec.value(), 0);
std::this_thread::sleep_for(std::chrono::milliseconds(100));
while (server.get_config_shared()->get_connections_count() < 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potentially infinite while loop

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test expects this condition to be met, yet should instead error if this condition isn't, as this condition not being met is a failure of the code being run. I'd also appreciate a sleep statement being added to reduce CPU utilization while this condition is checked for. Same for the other infinite loop noted below this (and likely any others).

std::this_thread::sleep_for(std::chrono::milliseconds(100));
while (server.get_config_shared()->get_connections_count() < 1);
server.get_config_shared()->del_in_connections(1);
while (server.get_config_shared()->get_connections_count() > 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potentially infinite while loop

using network_throttle_t = epee::net_utils::network_throttle;
using network_throttle_manager_t = epee::net_utils::network_throttle_manager;

unsigned int host_count(int delta = 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if it is just a single line, please document these private methods

@@ -86,6 +90,181 @@ namespace net_utils
public i_service_endpoint,
public connection_basic
{
private:
using string_t = std::string;
using handler_t = t_protocol_handler;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This alias can be confusing because of the use of the term "handler" with regards to asio

@@ -1006,6 +1006,7 @@ if(STATIC)
set(Boost_USE_STATIC_RUNTIME ON)
endif()
find_package(Boost 1.58 QUIET REQUIRED COMPONENTS system filesystem thread date_time chrono regex serialization program_options locale)
add_definitions(-DBOOST_ASIO_ENABLE_SEQUENTIAL_STRAND_ALLOCATION)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I'm necessarily opposed to this change, but what it the rational for switching to round-robin allocation of strands? Was there any evidence that deadlocks with the current code are related to non-round-robin allocation of strands?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core of the PR's solution to the strand deadlock test case (test 3) is to use 2 strands (strand and strand_). Might be clearer to see how over in this code where I tried to pull out the solution to just test 3.

If boost happens to randomly allocate the same strand implementation for both strands (boost randomly allocates from a pool of 193 by default), then a deadlock can resurface, since the writes won't be able to clear from the strand that handle_recv is posting to (the writes get stuck in the same strand behind the handle_recv = deadlock).

Allocating strands round robin avoids this because it should guarantee that both strands in each connection will have distinct implementations allocated (since connections themselves are also only created 1 at a time).

If you try running his test 3 without this definition, you'll eventually run into a test failure at some point, which should be the collision case described above.

bool handshaked;
};

struct socket_t {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

socket_t now refers to 2 types: this one and boost::asio::ip::tcp::socket, and is ambiguous

bool cancel_shutdown;
};

struct timer_t {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timer_t now refers to 2 types: this one and boost::asio::steady_timer, and is ambiguous

data_t data;
};

using status_t = typename state_t::status_t;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aliasing nested user-defined types is not a design choice I think is easy on future maintainers/readers. If you want your new type to be scoped to another do that. If you want it be be top-level, do that. But don't nest it then alias it back up to the top IMO

terminate();
}
else if (
not epee::net_utils::is_ssl(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like other have suggested, use ! instead of not

Copy link
Collaborator

@j-berman j-berman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, I agree the style is non-conforming and it would be nicer if it conformed to existing conventions better. But I don't think that is a good blocker for a PR that solves reliability issues, especially one that (justifiably, imo) rewrites large sections to do so. The author has fair justification for the decisions made in the code (some reasoning explained via DM). As is, this PR is a clear improvement over the current code.

Further, in discussions with the author, and in some of the answers given here, this PR is a stepping stone to even more improvements. To that end, I do think this is in line with the desired piecemeal approach to solving issues in the daemon. This PR does indeed seem to be a smaller piece of a puzzle toward even larger improvements, even if it is fairly large in its own right.

I'll approve it once merge conflicts are resolved satisfactorily.


Separate note...

I strongly suggest anyone who wants to fully understand this code run the old code with the provided tests and observe where the failures occur, then review the 3 commits of this branch to understand the code in this PR that is fixing the issues, then read and step through this PR's code to understand how the code is working from a bigger picture perspective. Happy to answer questions/help make it clear what the issues are and what is being fixed. The PR author was also helpful to me in understanding this PR over DM, mainly after I made it clear I did step through the code myself and demonstrated I understood various problems it's solving.

@@ -1006,6 +1006,7 @@ if(STATIC)
set(Boost_USE_STATIC_RUNTIME ON)
endif()
find_package(Boost 1.58 QUIET REQUIRED COMPONENTS system filesystem thread date_time chrono regex serialization program_options locale)
add_definitions(-DBOOST_ASIO_ENABLE_SEQUENTIAL_STRAND_ALLOCATION)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core of the PR's solution to the strand deadlock test case (test 3) is to use 2 strands (strand and strand_). Might be clearer to see how over in this code where I tried to pull out the solution to just test 3.

If boost happens to randomly allocate the same strand implementation for both strands (boost randomly allocates from a pool of 193 by default), then a deadlock can resurface, since the writes won't be able to clear from the strand that handle_recv is posting to (the writes get stuck in the same strand behind the handle_recv = deadlock).

Allocating strands round robin avoids this because it should guarantee that both strands in each connection will have distinct implementations allocated (since connections themselves are also only created 1 at a time).

If you try running his test 3 without this definition, you'll eventually run into a test failure at some point, which should be the collision case described above.


bool wait_read;
bool handle_read;
bool cancel_read;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each boolean ensures that no function is called more times than is necessary (each action in the socket executes 1 at a time as needed). The (seemingly bloated) boolean defense provides explicit synchronization, offering complete control over the execution flow without relying on boost internals, which as evidenced by the seemingly seg fault-prone handshake, can lead to screw-ups. Error codes are still checked where needed AFAICT (for example, here).

Perhaps the booleans go against the standard way of using boost ASIO, and perhaps this approach could be considered less future proof as is, but I think given the issues this PR is solving as constructed (and the PR author hopes to solve and optimize further by building from this PR), then I'd say it's a clear improvement over the current that is worth shipping as is. He has also provided defense for the decision to include them and the bar is high enough for me to warrant inclusion given the scope of the PR's aims.

boost::asio::placeholders::error,
boost::asio::placeholders::bytes_transferred)));
//_info("[sock " << socket().native_handle() << "]Async read requested.");
state.socket.wait_handshake = true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT the lock should be held by the caller of the function here and in the other places

const std::string& host,
std::chrono::milliseconds timeout) const
{
configure(socket, type, host);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It solves the seg fault in the handshake test case, and it does it in a manner that's really similar to the rest of the PR. I'd argue it's actually significantly more readable/easier to grok than the rest of the PR since it's fairly contained and isolated. It helped me as I got started reviewing the rest of the PR because of this.

I agree still that this can probably be a separate PR with just this commit or perhaps even fewer changes, but I'm not hugely opposed to it being here as is/don't consider it a PR blocker.


And if the issue being fixed is the racy ec = ..., this can be fixed with less changes. All that was needed is an std::atomic_flag to signal the change (a safe "happens before/after synchronization), these other changes are overkill.

The primary fix here ensures that upon successful completion of the handshake, the cancel_timer boolean prevents the deadline timer from executing an erroneous close sequence in the timeout handler, which also demonstrates the implausibility of the test's seg fault.

Perhaps there is a cleaner more standard solution to this with fewer changes as you mention, but it's a pretty clean section of code that seems to do the job as is I would say.

timer_t deadline(io_context, timeout);

struct state_t {
lock_t lock;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it ensures synchronization between the handshake which executes inside the strand, and the deadline timer which handles a timeout sequence outside the strand in the event of a slow handshake. I'm having trouble seeing how it would be better constructed without a lock

In other cases it ensures synchronization across the 2 separate strands

{
static lock_t hosts_mutex;
lock_guard_t guard(hosts_mutex);
static std::map<string_t, unsigned int> hosts;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree this seems like a good suggestion. The code seems to already have this issue though, the PR author just moved the code so it showed up in the diff FWIW. I'd say worth a separate PR

@Gingeropolous
Copy link
Collaborator

because of #8426, can this PR be closed?

@selsta
Copy link
Collaborator

selsta commented Jul 2, 2023

Merged in #8426

@selsta selsta closed this Jul 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants