Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a "transient" network connectivity state #2696

Merged
merged 18 commits into from
May 8, 2024

Conversation

Stebalien
Copy link
Member

@Stebalien Stebalien commented Jan 29, 2024

Previously, we'd consider "transiently" connected peers to be connected. This meant:

  1. We wouldn't fire a second event when transitioning to "really connected". There wasn't a way for users to wait for a "full" connection without using the old-style Notify interface.
  2. "Connectedness" checks would be a little too eager to treat a peer as connected.

For 99% of users, "transient" peers should be treated as disconnected. So while it's technically a breaking change to split-out "transient" connectivity into a separate state, I expect it's more likely to fix bugs than anything.

Unfortunately, this change did require several changes to go-libp2p itself because go-libp2p does care about transient connections:

  1. We want to keep peerstore information for transient peers.
  2. We may sometimes want to treat peers as "connected" in the host.
  3. Identify still needs to run over transient connections.

fixes #2692

core/network/network.go Outdated Show resolved Hide resolved
core/network/network.go Show resolved Hide resolved
p2p/host/basic/basic_host.go Outdated Show resolved Hide resolved
if !forceDirect {
if rh.Network().Connectedness(pi.ID) == network.Connected {
connectedness := rh.Network().Connectedness(pi.ID)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we wouldn't try to perform routing if we're just going to wait on an ongoing connection. But that's not a new issue.

Copy link
Member

@sukunrt sukunrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with these changes. We just need to

  1. Fix this for the close path.
  2. Keep existing Connect behavior. I'll raise a PR to make Connect consistent with NewStream and return network.ErrTransientConn when appropriate.
  3. Deprecate the items in network.Connectedness that we don't care about.

core/network/network.go Show resolved Hide resolved
Comment on lines 400 to 402
// Notify goroutines waiting for a direct connection
if !c.Stat().Transient {
newState := network.Transient
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should calculate newState just after we add it to conns map with connectednessUnlocked. We already have the lock it's a small set to check anyway.
There's a problem here that if someone makes a relayed connection after we have a direct connection this will send a limited connectivity event.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I addressed it correctly, can you please double check @sukunrt ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need.

	oldState := s.connectednessUnlocked(p)
	s.conns.m[p] = append(s.conns.m[p], c)
	newState := s.connectednessUnlocked(p)

This prevents us from sending a PeerConnectedness Limited event when we make a relayed connection after a direct connection.

p2p/net/swarm/swarm.go Outdated Show resolved Hide resolved
@sukunrt sukunrt mentioned this pull request Feb 21, 2024
22 tasks
@sukunrt sukunrt self-requested a review March 22, 2024 15:28
@lidel
Copy link
Member

lidel commented Apr 16, 2024

@sukunrt is there any remaining work here, or awaiting your final review?

@lidel lidel added the need/maintainer-input Needs input from the current maintainer(s) label Apr 16, 2024
@MarcoPolo MarcoPolo self-requested a review April 16, 2024 17:08
@MarcoPolo
Copy link
Collaborator

I'll review this as well

// changed notifications consider closed connections. When remote closes a connection
// the underlying transport connection is closed first, so tracking changes to the connectedness
// state requires considering this recently closed connections impact on Connectedness.
func (s *Swarm) connectednessUnlocked(p peer.ID, considerClosed bool) network.Connectedness {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment doesn't make sense to me. I don't really understand the point of the body either.

It seems like we have two cases:

  1. We have multiple conns to a peer and we close one. If any other conn exists and is open then we are connected.
  2. We only have one conn to a peer a close it. The peer is no longer connected.

Why are we potentially returning Connected in the case that every conn in the map returns c.IsClosed()=true with considerClosed=true

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may also have limited connections to the peer. So once a connection is closed, we check:

  • If there is at least 1 unlimited connection, return Connected
  • Else, if there is at least 1 limited connection, return Limited
  • Else, return NotConnected

Why are we potentially returning Connected in the case that every conn in the map returns c.IsClosed()=true with considerClosed=true

It is true. However, considerClosed is true only when called from addConn and removeConn, which hold the lock. removeConn makes sure to remove the connection from s.conns.m, so the function will not return Connected if we removed the last unlimited connection.

Though, I don't really understand why considerClosed is needed at all. @sukunrt could you provide some context?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is confusing.

There are two different states this function is called from:

When closing the connection from our side by calling Close, this function is called when the underlying transport connection is not closed.

When closing the connection because the remote Closed the connection, this function is called when the underlying transport connection has already been closed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider that we are connected to the peer over a QUIC and a relayed connection.

Now the peer closes the QUIC connection. So we have moved from a Connected state to Limited connectivity state.

To make this work, when calculating oldState in removeConn we need to consider connectivity assuming QUIC connection was still connected.

Alternatively we can store the last sent connectedness event for a peer. That would make this calculation cleaner at the expense of more state for an edge case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively we can store the last sent connectedness event for a peer. That would make this calculation cleaner at the expense of more state for an edge case.

I think we should do this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@sukunrt sukunrt removed the need/maintainer-input Needs input from the current maintainer(s) label Apr 30, 2024
@sukunrt sukunrt requested a review from MarcoPolo April 30, 2024 16:28
@sukunrt
Copy link
Member

sukunrt commented May 2, 2024

We have a deadlock in our event sending mechanism when the consumer closes the connection in the subscription goroutine.

Closing the connection causes the subscription goroutine to emit an event and assuming the subscription queue is full this will deadlock. Check the test in 57c688e

There are two options here:

  1. We disallow Close in subscription.
  2. We enqueue the events in a slice and we send the events in a separate goroutine. This slice will only take 2*NumConns space since for a given connection we will only add a connectedness event at most twice. This is also the current memory usage since the goroutines block on sending the connectedness event.

We also need 2 to ensure that we send Limited vs NotConnected events in the right order.

In case we are connected to a peer via a relay and a direct connection and the direct connection and relayed connection both close concurrently, right now there is a race between which event will be sent last, 2 ensures that NotConnected is always sent last.

I have implemented 2 in 0dcc73b

I'll refactor a bit if we do decide to go ahead with 2.

@sukunrt sukunrt requested a review from MarcoPolo May 2, 2024 13:33
@sukunrt sukunrt force-pushed the steb/connected-v-transient branch 3 times, most recently from 12d5826 to c68a035 Compare May 2, 2024 16:32
// This ensures that if a event subscriber closes the connection from the subscription goroutine
// this doesn't deadlock
s.refs.Add(1)
go func() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really like to avoid spawning a go routine on close

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we shouldn't do this. I haven't read the rest of the patch so I might be missing an obvious solution, but it looks like we're just trying to "raise" a flag for this peer.

I'd just avoid channels entirely and use the pattern I used in https://github.com/ipfs/boxo/blob/2559ec2c7b93ac6905fe83d7a4b38b8e9a703b0a/bitswap/network/connecteventmanager.go#L69-L81, https://github.com/ipfs/boxo/blob/2559ec2c7b93ac6905fe83d7a4b38b8e9a703b0a/bitswap/network/connecteventmanager.go#L97-L122. It's not simple, but it works and avoids stacking goroutines.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven’t looked closely, but it seems like changeQueue is unbounded. Is that right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, but it shouldn't be an issue in practice. We only add a peer to the queue the first time the connectivity state changes. If we end up with a lot of "dead" peers in this queue, we should burn through them very quickly (transition from disconnected -> disconnected doesn't require any events).

We can do better by removing peers that have transitioned back to the same state from the queue, but it's likely not worth the complexity.

Copy link
Member

@sukunrt sukunrt May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a queue with this will work

We can do better by removing peers that have transitioned back to the same state from the queue, but it's likely not worth the complexity.

We do need the complexity because a malicious peer can connect, disconnect very quickly causing this queue to grow quickly if a consumer is slow. It can also do this with different peer ids.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general rule we should have no unbounded queues in the core logic. In this case I can think of a way that a reasonable application can fall victim to a dos attack.

Example: An application opens a new stream and does an exchange on a newly connected peer. If this is done by the same go routine that reads events then the attacker controls the rate events are popped off, and thus exploit this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, I'd agree. But in cases like this we can argue that the queue will hit a steady-state max size and stop growing (assuming the read side isn't deadlocked forever). It's not great, but if someone can DoS us due to this queue, they can also DoS us by running us out of memory before GC can kick in.

But ideally we'd just have some form of backpressure somewhere. And I think that's what we'll need given #2696 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying an approach where we have backpressure on the add side and to have an unbounded queue on the remove side(close connection) because the remove side is effectively bounded by the total number of connections since on the add connection side we do have backpressure.

Copy link
Member

@sukunrt sukunrt May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does 5271dd8 1abc063 look?

We apply backpressure when adding connections to the swarm. This allows us to enqueue remove conn events in a slice without spawning a goroutine. Back pressure on adding connections limits the slice length. We also ensure that we send at least 1 disconnection event for every peer we connect to.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really have time for a full review but:

  1. I'd be careful to avoid too many channels/goroutines. Go tends to push you in that direction, but it can quickly get harder to understand than simple locks & conditions.
  2. Make sure backpressure is applied all the way to accepting connections in the first place. It sounds like this might be a gap in libp2p and may be something we want to fix in a separate patch.

(I defer to @MarcoPolo on all of this, I'm just kibitzing from the sidelines)

@sukunrt sukunrt force-pushed the steb/connected-v-transient branch 5 times, most recently from cee9aa1 to 02a74d8 Compare May 4, 2024 20:00
@sukunrt sukunrt force-pushed the steb/connected-v-transient branch 3 times, most recently from 9a247a9 to a16b08d Compare May 5, 2024 08:44
Stebalien and others added 17 commits May 5, 2024 14:23
Previously, we'd consider "transiently" connected peers to be connected.
This meant:

1. We wouldn't fire a second event when transitioning to "really
connected". The only option for users was to listen on the old-style
per-connection notifications.
2. "Connectedness" checks would be a little too eager to treat a peer as
connected.

For 99% of users, "transient" peers should be treated as disconnected.
So while it's technically a breaking change to split-out "transient"
connectivity into a separate state, I expect it's more likely to fix
bugs than anything.

Unfortunately, this change _did_ require several changes to go-libp2p
itself because go-libp2p _does_ care about transient connections:

1. We want to keep peerstore information for transient peers.
2. We may sometimes want to treat peers as "connected" in the host.
3. Identify still needs to run over transient connections.

fixes #2692
@sukunrt sukunrt force-pushed the steb/connected-v-transient branch from a16b08d to 25ccbf2 Compare May 5, 2024 08:54
Copy link
Collaborator

@MarcoPolo MarcoPolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one question, but I think this looks a lot cleaner than what was there before. Thanks!

p2p/net/swarm/connectedness_event_emitter.go Outdated Show resolved Hide resolved
p2p/net/swarm/connectedness_event_emitter.go Show resolved Hide resolved
p2p/net/swarm/swarm.go Outdated Show resolved Hide resolved
p2p/net/swarm/swarm_event_test.go Outdated Show resolved Hide resolved
p2p/net/swarm/connectedness_event_emitter.go Show resolved Hide resolved
@sukunrt sukunrt force-pushed the steb/connected-v-transient branch from e9b2300 to b542b15 Compare May 7, 2024 19:03
@sukunrt sukunrt merged commit af0161e into master May 8, 2024
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Transient as a "connectedness" state
6 participants