-
Notifications
You must be signed in to change notification settings - Fork 957
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix regression w.r.t. reporting of dial errors. #1493
Conversation
PR [1440] introduced a regression w.r.t. the reporting of dial errors. In particular, if a connection attempt fails due to an invalid remote peer ID, any remaining addresses for the same peer would not be tried (intentional) but the dial failure would not be reported to the behaviour, causing e.g. libp2p-kad queries to potentially stall. In hindsight, I figured it is better to preserve the previous behaviour to still try alternative addresses of the peer even on invalid peer ID errors on an earlier address. In particular because in the context of libp2p-kad it is not uncommon for peers to report localhost addresses while the local node actually has e.g. an ipfs node running on that address, obviously with a different peer ID, which is the scenario causing frequent invalid peer ID (mismatch) errors when running the ipfs-kad example. This commit thus restores the previous behaviour w.r.t. trying all remaining addresses on invalid peer ID errors as well as making sure `inject_dial_error` is always called when the last attempt failed. [1440]: libp2p#1440.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
core/src/connection/pool.rs
Outdated
assert_ne!(&self.local_id, entry.connected().peer_id()); | ||
if let Some(peer) = peer { | ||
assert_eq!(&peer, entry.connected().peer_id()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose that this is because you require Debug
above?
If so, I'd prefer to not enforce Debug
if possible.
assert_ne!(&self.local_id, entry.connected().peer_id()); | |
if let Some(peer) = peer { | |
assert_eq!(&peer, entry.connected().peer_id()); | |
assert_ne!(&self.local_id, entry.connected().peer_id(), "Unexpected local peer ID"); | |
if let Some(peer) = peer { | |
assert_eq!(&peer, entry.connected().peer_id(), "PeerId mismatch"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fmt::Debug
is required by assert_eq!
and assert_ne!
for the arguments, so are you suggesting something like:
if &self.local_id == entry.connected().peer_id() {
panic!("...")
}
if let Some(peer) = peer {
if peer != entry.connected().peer_id() {
panic!("...")
}
}
? It is unfortunate though not to have the problematic peer IDs in the output. Why is it undesirable to require fmt::Debug
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now done this: 5807155
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it undesirable to require fmt::Debug?
I just see it as "more correct" to only enforce the minimum possible set of traits.
In an ideal world, we would print the PeerId
if it implements Debug
and not print it if it doesn't. But that's unfortunately not possible.
core/src/connection/pool.rs
Outdated
@@ -536,7 +555,7 @@ where | |||
PoolEvent<'a, TInEvent, TOutEvent, THandler, TTransErr, THandlerErr, TConnInfo, TPeerId> | |||
> where | |||
TConnInfo: ConnectionInfo<PeerId = TPeerId> + Clone, | |||
TPeerId: Clone, | |||
TPeerId: Clone + fmt::Debug, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TPeerId: Clone + fmt::Debug, | |
TPeerId: Clone, |
core/src/network.rs
Outdated
@@ -331,7 +330,7 @@ where | |||
THandler::Handler: ConnectionHandler<Substream = Substream<TMuxer>, InEvent = TInEvent, OutEvent = TOutEvent> + Send + 'static, | |||
<THandler::Handler as ConnectionHandler>::Error: error::Error + Send + 'static, | |||
TConnInfo: Clone, | |||
TPeerId: AsRef<[u8]> + Send + 'static, | |||
TPeerId: fmt::Debug + Send + 'static, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TPeerId: fmt::Debug + Send + 'static, | |
TPeerId: Send + 'static, |
#1440 introduced a regression w.r.t. the reporting of dial errors. In particular, if a connection attempt fails due to an invalid remote peer ID, any remaining addresses for the same peer would not be tried (intentional) but the dial failure would not be reported to the behaviour, causing e.g. libp2p-kad queries to potentially stall.
In hindsight, I figured it is better to preserve the previous behaviour to still try alternative addresses
of the peer even on invalid peer ID errors on an earlier address. In particular because in the context of libp2p-kad it is not uncommon for peers to report localhost addresses while the local node actually has e.g. an ipfs node running on that address, obviously with a different peer ID, which is the scenario causing frequent invalid peer ID (mismatch) errors when running the ipfs-kad example in the go-ipfs docker container.
This PR thus restores the previous behaviour w.r.t. trying all remaining addresses on invalid peer ID errors, as well as making sure
inject_dial_error
is always called when the last attempt failed, regardless of whether the peer is already connected or not (e.g. as a result of a simultaneous incoming connection).Overall this is a slight simplification, though requiring some additional cloning of peer IDs (as was also done before #1440).
This PR restores the ipfs-kad example as an integration test for CI, which should now be stable again.