Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

p2p: fix back ping to discover healthy peers to connect to #8640

Merged
merged 1 commit into from
Jan 11, 2023

Conversation

j-berman
Copy link
Collaborator

Overview

When Node A receives an incoming connection from Node B, Node A will attempt to "back ping" Node B to see if Node B would be a healthy peer to connect to in the future. #8426 introduced a bug that causes Node A to fail to back ping Node B: Node A attempts to read Node B's response over an SSL stream, which errors out (looks like because Node A does not establish an SSL connection with Node B ... see more on this here).

The back ping seems to be a critical component of the p2p protocol for getting incoming connections (also see this comment). As such, I think fixing the back ping in this PR should help resolve or alleviate the lack of incoming peers issue reported in #8520 once enough nodes on the network start running it. To be clear, if you run this PR with your node, this PR wouldn't help you get incoming connections directly -- the nodes your node connects to must also run this update in order to help you get incoming connections.

The fix

The fix matches connect_async's logic to this synchronous call to connect(), ensuring that a node attempting to back ping its incoming connection will use the expected SSL setting (as the code is structured today, this specific m_ssl_support appears to always be disabled since SSL isn't supported for p2p connections).

How to manually test

Run 2 nodes, pointing one to the other. Start Node A normally with --log-level 2 and ensure that Node A can receive incoming connections. Start Node B with the flag --add-priority-node <Node A> and also make sure that Node B can receive incoming connections.

Once Node B initiates the connection to Node A, make sure that Node A successfully back pings Node B. In order to make sure this happens, grep Node A's logs for "PING SUCCESS" and you should see Node B's <IP:port number>.

If you try repeating the above test without this PR, there should be no "PING SUCCESS" in the logs. A lack of "PING SUCCESS" indicates that Node A does not add Node B to its white list, which means that neither Node A nor the rest of the network will know that Node B is a healthy peer to connect to right off the bat.

Note: I plan on writing an explicit test for this behavior to ensure we don't end up breaking back ping behavior again in the future.

@j-berman
Copy link
Collaborator Author

j-berman commented Nov 16, 2022

@selsta and I just ran the following test:

  1. I connected to a seed node @selsta has access to on testnet using the --add-priority-node flag. Neither of us were running this patch on our nodes. I confirmed that my node made outgoing connections to @selsta's node and to other nodes fine, but after 10 minutes, had 0 incoming connections.

  2. @selsta updated the seed node using this patch.

  3. I re-connected to @selsta's seed node. The back ping from the seed node to my node succeeded and I immediately started getting incoming connections.

@moneromooo-monero
Copy link
Collaborator

This implies that autodetecting SSL does not work, right ? If so, this seems to be the real problem, and disabling SSL just bodges it.

@j-berman
Copy link
Collaborator Author

I don't think it would make sense to call connect_async with SSL support set to autodetect here anyway because we don't use SSL for p2p. I figure disabling SSL support in this call to connect_async is a justified contained fix that prevents this particular back ping bug.

While I think a fix for autodetect in connect_async is probably also warranted, I don't think it being broken has any practical ramifications. I don't see where we should expect to make an outgoing connection with SSL support using the TCP server connection, but maybe I'm missing something there. If I run monerod --rpc-ssl=autodetect and monero-wallet-rpc --daemon-ssl=autodetect --rpc-ssl=autodetect, I have no issues even when connecting to the respective SSL-enabled backends, autodetect seems to work as expected.


To fix autodetect in connect_async, it seems we'd need to implement this same handshake sequence the synchronous connect() implements before calling start on the connection:

Synchronous connect():

auto try_connect_result = try_connect(new_connection_l, adr, port, sock_, remote_endpoint, bind_ip_to_use, conn_timeout, ssl_support);
if (try_connect_result == CONNECT_FAILURE)
return false;
if (ssl_support == epee::net_utils::ssl_support_t::e_ssl_support_autodetect && try_connect_result == CONNECT_NO_SSL)
{
// we connected, but could not connect with SSL, try without
MERROR("SSL handshake failed on an autodetect connection, reconnecting without SSL");
new_connection_l->disable_ssl();
try_connect_result = try_connect(new_connection_l, adr, port, sock_, remote_endpoint, bind_ip_to_use, conn_timeout, epee::net_utils::ssl_support_t::e_ssl_support_disabled);
if (try_connect_result != CONNECT_SUCCESS)
return false;
}
// start adds the connection to the config object's list, so we don't need to have it locally anymore
connections_mutex.lock();
connections_.erase(new_connection_l);
connections_mutex.unlock();
bool r = new_connection_l->start(false, 1 < m_threads_count);

try_connect():

const ssl_support_t ssl_support = new_connection_l->get_ssl_support();
if (ssl_support == epee::net_utils::ssl_support_t::e_ssl_support_enabled || ssl_support == epee::net_utils::ssl_support_t::e_ssl_support_autodetect)
{
// Handshake
MDEBUG("Handshaking SSL...");
if (!new_connection_l->handshake(boost::asio::ssl::stream_base::client))
{
if (ssl_support == epee::net_utils::ssl_support_t::e_ssl_support_autodetect)
{
boost::system::error_code ignored_ec;
sock_.shutdown(boost::asio::ip::tcp::socket::shutdown_both, ignored_ec);
sock_.close();
return CONNECT_NO_SSL;
}
MERROR("SSL handshake failed");
if (sock_.is_open())
sock_.close();
return CONNECT_FAILURE;
}
}
return CONNECT_SUCCESS;

I believe the fact that connect_async doesn't do the above handshake sequence implies SSL is broken via connect_async. It doesn't look like mishandling autodetect was the primary issue here; it looks like it was just lucky that the back ping worked before the connection patch.


Circling back to my original point: I think this PR is still justified as is toward preventing the back ping because SSL should be disabled for p2p connections. I think it's worth a separate PR to make sure connect_async handles SSL correctly.

@j-berman
Copy link
Collaborator Author

j-berman commented Nov 17, 2022

Clarifying: when a wallet initiates an outbound connection to a daemon, SSL autodetect is handled correctly in this totally separate area of the code:

inline
try_connect_result_t try_connect(const std::string& addr, const std::string& port, std::chrono::milliseconds timeout)
{
m_deadline.expires_from_now(timeout);
boost::unique_future<boost::asio::ip::tcp::socket> connection = m_connector(addr, port, m_deadline);
for (;;)
{
m_io_service.reset();
m_io_service.run_one();
if (connection.is_ready())
break;
}
m_ssl_socket->next_layer() = connection.get();
m_deadline.cancel();
if (m_ssl_socket->next_layer().is_open())
{
m_connected = true;
m_deadline.expires_at(std::chrono::steady_clock::time_point::max());
// SSL Options
if (m_ssl_options.support == epee::net_utils::ssl_support_t::e_ssl_support_enabled || m_ssl_options.support == epee::net_utils::ssl_support_t::e_ssl_support_autodetect)
{
if (!m_ssl_options.handshake(*m_ssl_socket, boost::asio::ssl::stream_base::client, {}, addr, timeout))
{
if (m_ssl_options.support == epee::net_utils::ssl_support_t::e_ssl_support_autodetect)
{
boost::system::error_code ignored_ec;
m_ssl_socket->next_layer().shutdown(boost::asio::ip::tcp::socket::shutdown_both, ignored_ec);
m_ssl_socket->next_layer().close();
m_connected = false;
return CONNECT_NO_SSL;
}
else
{
MWARNING("Failed to establish SSL connection");
m_connected = false;
return CONNECT_FAILURE;
}
}
}
return CONNECT_SUCCESS;
}else
{
MWARNING("Some problems at connect, expected open socket");
return CONNECT_FAILURE;
}
}
inline
bool connect(const std::string& addr, const std::string& port, std::chrono::milliseconds timeout)
{
m_connected = false;
try
{
m_ssl_socket->next_layer().close();
// Set SSL options
// disable sslv2
m_ssl_socket.reset(new boost::asio::ssl::stream<boost::asio::ip::tcp::socket>(m_io_service, m_ctx));
// Get a list of endpoints corresponding to the server name.
try_connect_result_t try_connect_result = try_connect(addr, port, timeout);
if (try_connect_result == CONNECT_FAILURE)
return false;
if (m_ssl_options.support == epee::net_utils::ssl_support_t::e_ssl_support_autodetect)
{
if (try_connect_result == CONNECT_NO_SSL)
{
MERROR("SSL handshake failed on an autodetect connection, reconnecting without SSL");
m_ssl_options.support = epee::net_utils::ssl_support_t::e_ssl_support_disabled;
if (try_connect(addr, port, timeout) != CONNECT_SUCCESS)
return false;
}
}
}
catch(const boost::system::system_error& er)
{
MDEBUG("Some problems at connect, message: " << er.what());
return false;
}
catch(...)
{
MDEBUG("Some fatal problems.");
return false;
}
return true;
}

I don't see how a "server" would ever initiate an outbound SSL connection using the boosted_tcp_server connection anywhere. It seems the reason there is what looks like unused SSL code for a server initiating the outbound connect (and connect_async) is that the initial PR to add SSL support originally supported SSL for p2p, which was then removed: #4054 (comment). It looks like code that was left in there by accident. Which would mean fixing connect_async to correctly handle autodetect (and SSL) would just be fixing unused code.

And circling back again: regardless, I'm not seeing why connect_async should be called here in this p2p handshake with SSL autodetect as is. I still think this PR is a correct solution in its own right

@vtnerd
Copy link
Contributor

vtnerd commented Nov 17, 2022

And FWIW I read the backlog of discussion here, and had my cup of coffee.

The connect_async function still needs to be fixed for autodetect. This small patch is better for now because its simpler, and we want this behavior even with a fixed connect_async. Defaulting to autodetect is "noisy", has wasted RTT, etc.

@vtnerd
Copy link
Contributor

vtnerd commented Nov 17, 2022

@j-berman one thing for maintenance - should the parameter positions be flipped? The string default here, "0.0.0.0", might be problematic during maintenance and changes to the function. Only small nitpick - I think this needs to be merged sooner rather than later.

Copy link
Contributor

@vtnerd vtnerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh. Wrong PR.

@j-berman
Copy link
Collaborator Author

j-berman commented Nov 17, 2022

@vtnerd agree with the above :) Will submit a new maintenance PR with SSL fixed in connect_async+ parameter positions flipped

EDIT: I might have misunderstood that you were suggesting the param change for this PR. I also think this PR needs to be merged sooner rather than later, so would prefer to keep this PR as small and simple as possible, and will do the maintenance PR separately

@moneromooo-monero
Copy link
Collaborator

Sounds good.

@jeffro256
Copy link
Contributor

Does a node ever attempt to back-ping on a loopback address?

@j-berman
Copy link
Collaborator Author

@jeffro256

monero/src/p2p/net_node.inl

Lines 2352 to 2353 in 365fd45

if(!zone.m_peerlist.is_host_allowed(context.m_remote_address))
return false;

bool peerlist_manager::is_host_allowed(const epee::net_utils::network_address &address)
{
//never allow loopback ip
if(address.is_loopback())
return false;

@jeffro256
Copy link
Contributor

@j-berman That's exactly what I was looking for, thank you! I was looking for the reason why the back-ping test wasn't working locally.

Copy link
Contributor

@jeffro256 jeffro256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to confirm this patch fixed the node so that back pings work 👍🏻

@luigi1111 luigi1111 merged commit ac8580c into monero-project:master Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants