Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(swarm)!: Remove ConnectionHandler::Error in favor of CloseReason #3201

Closed
wants to merge 8 commits into from

Conversation

thomaseizinger
Copy link
Contributor

Description

Previously, a ConnectionHandler could define an error type for why it would actively close a connection. This type had to passed through a lot of layers until it finally ended up in SwarmEvent::ConnectionClosed. Most notable, this type appeared everywhere where SwarmEvent is referenced. Due to the composition of ConnectionHandlers, this type is clunky to name and changes every time you change the composition of your derived NetworkBehaviour.

Very likely, this error is only logged and remains uninspected in the majority of cases. To simplify the public API, we box up the error. If users still want to inspect it, they can attempt downcasting the error a specific one.

Notes

Opening this as a draft to get some feedback.

Links to any relevant issues

Open Questions

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • A changelog entry has been made in the appropriate crates

Now that we are always boxing the error, it makes more sense to
have handlers directly return the most specific error without
additional wrapping.
@rkuhn
Copy link
Contributor

rkuhn commented Dec 8, 2022

Yes, I agree that this type parameter tends to complicate things, so this is a good change. There is one use-case for the pattern match:

https://github.com/ipfs-rust/ipfs-embed/blob/db72675b4fd787096511100ab29c21795f30a386/src/net/peers.rs#L372-L402

To continue supporting this will require some discipline by protocol authors once this PR is merged — we’d need a dedicated error type (like GossipsubHandlerError or ping’s Failure) for every protocol to allow downcasting. Most other protocols use ConnectionHandlerUpgrErr<std::io::Error>, which means the protocol that requested the connection to be closed cannot be identified.

@thomaseizinger
Copy link
Contributor Author

Yes, I agree that this type parameter tends to complicate things, so this is a good change. There is one use-case for the pattern match:

https://github.com/ipfs-rust/ipfs-embed/blob/db72675b4fd787096511100ab29c21795f30a386/src/net/peers.rs#L372-L402

But you are just logging it right? That will still be supported and then Option also remains so I think this is covered!

@rkuhn
Copy link
Contributor

rkuhn commented Dec 9, 2022

No, “just logging it” wouldn’t need the pattern match — the crucial point is logging the origin of the error.

@thomaseizinger
Copy link
Contributor Author

No, “just logging it” wouldn’t need the pattern match — the crucial point is logging the origin of the error.

Right. Backtraces are stable since 1.65 so we could return a stacktrace as well which would clearly show the origin of the error.

@thomaseizinger
Copy link
Contributor Author

No, “just logging it” wouldn’t need the pattern match — the crucial point is logging the origin of the error.

Right. Backtraces are stable since 1.65 so we could return a stacktrace as well which would clearly show the origin of the error.

If we go this path, it would probably be good to define a custom type that can be constructed from anything E: Error and does the capturing of the stacktrace for the user, so they can't forget / aren't forced to do it manually.

@rkuhn
Copy link
Contributor

rkuhn commented Dec 12, 2022

Coming back to the code I linked: having a backtrace wouldn’t solve the problem — log messages preferably are concise and fit on a single line, so displaying a backtrace is not desirable. Parsing a backtrace to match on stack frame names is a rather brittle way of figuring out the single word label to apply in the logging. So I maintain that creating a specific error type for each protocol is the best solution proposed so far.

IOW, I’d like to avoid code having to play “who dunnit” when it comes to figuring out why a connection was closed.

@thomaseizinger
Copy link
Contributor Author

Yeah that is fair. I am just wondering, what the best API is to help users here.

@rkuhn
Copy link
Contributor

rkuhn commented Dec 13, 2022

Barring innovation in the “Rust error handling space” done by libp2p (which I’m not sure would be a great idea) I see three possibilities:

  1. add an enum with one variant per protocol in this repository
  2. enum like above plus one case Other(Box<dyn Error + Send + 'static>)
  3. use plain Box<dyn Error + Send + 'static> and make sure that each protocol has its own type

The first is not extensible, the second offer second-class citizenship to externally defined protocol, the third is extensible with the same rules for every participant. It does look a bit ugly, though, and handling would benefit from a macro for writing down the downcasting chains (well, Java devs would immediately feel at home regardless). Still, I’d prefer number three.

@divagant-martian
Copy link
Contributor

Having the error boxed seems reasonable to me.

Just checked to be sure how we handle this, and we currently only even inspect errors returned as explicit handler events, not closing errors. At this point the error the PR removes is not even logged (we most likely should) but a boxed error will work for that matter anyway

@thomaseizinger
Copy link
Contributor Author

Barring innovation in the “Rust error handling space” done by libp2p (which I’m not sure would be a great idea) I see three possibilities:

1. add an enum with one variant per protocol in this repository

2. enum like above plus one case `Other(Box<dyn Error + Send + 'static>)`

3. use plain `Box<dyn Error + Send + 'static>` and make sure that each protocol has its own type

The first is not extensible, the second offer second-class citizenship to externally defined protocol, the third is extensible with the same rules for every participant. It does look a bit ugly, though, and handling would benefit from a macro for writing down the downcasting chains (well, Java devs would immediately feel at home regardless). Still, I’d prefer number three.

How about the following:

  • A newtype that wraps a Box<dyn Error + Send + 'static>
  • Newtype also has a constructor that accepts an error and a &'static str which will construct an error saying: "connection closed by XYZ: {reason}" with "XYZ" being the &'static str argument.

This would make it easy for users to do the right thing, even if they don't create their own type. I think chances are, even if users create their own type, they are not going to include the name of the component in the message. That will lose this detail again if users just log the error which is what I'd want them to do. They may also downcast but ideally not just for logging reasons.

@thomaseizinger thomaseizinger changed the title feat(swarm)!: Remove ConnectionHandler::Error in favor of Box<dyn Error> feat(swarm)!: Remove ConnectionHandler::Error in favor of CloseReason Dec 16, 2022
Copy link
Contributor

@rkuhn rkuhn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I like this! It makes it clear to protocol authors what is expected of them.

.push_back(ConnectionHandlerEvent::Close(CloseReason::new(
"dcutr-relay",
p,
)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about trait CloseReasonExt to make this ConnectionHandlerEvent::Close(p.closeReason("dcutr-relay"))?

Copy link
Contributor Author

@thomaseizinger thomaseizinger Dec 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do:

p.into_close_reason_for("dcutr-relay")

perhaps?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not to start bikesheding but what about impl From<(&str, Box<dyn Error + Send + 'static>)> for CloseReason and have ConnectionHandlerEvent::Close(("dcutr-relay", p).into())?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I am not a fan of From<(.., ..)> implementations. I generally tend to avoid to implement From unless I can provide a function somewhere that accepts impl Into and thus allows us to avoid .into() calls somewhere. To me, calling .into yourself is a bit of a code smell.


impl error::Error for CloseReason {
fn source(&self) -> Option<&(dyn error::Error + 'static)> {
Some(self.source.as_ref())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps document that logging these errors should follow the .source() chain, e.g. by using anyhow and formatting with {:#}. Or is there a generic helper for implementing alternate Error printing so we can offer it out of the box? (This is the weakest point in Rust for me, probably since I’m spoilt by living on the JVM for so long.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is unfortunately but I whole-heartedly agree that it is really missing from std.

We could offer it on this type-specifically although I think it would just be better to tell users to use anyhow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, let’s document it.

@thomaseizinger thomaseizinger marked this pull request as ready for review December 16, 2022 22:22
@mergify
Copy link
Contributor

mergify bot commented Dec 16, 2022

This pull request has merge conflicts. Could you please resolve them @thomaseizinger? 🙏

@mergify
Copy link
Contributor

mergify bot commented Dec 23, 2022

This pull request has merge conflicts. Could you please resolve them @thomaseizinger? 🙏

Copy link
Member

@jxs jxs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for this simplification Thomas, and with CloseReason we seem to address Roland's concern.

Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the benefit of reduced type complexity with this patch. That said, I don't think it is worth giving up strong typing on the error. I would consider providing a string to identify an error, even though the information that the string carries is already available, a hack.

/// ```rust
/// # use libp2p_swarm::handler::CloseReason;
///
/// # fn main() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the main function needed? Aren't doc examples wrapped in a main automatically?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't tried but can investigate.

@@ -58,7 +58,6 @@ pub struct ConnectionHandler;
impl crate::handler::ConnectionHandler for ConnectionHandler {
type InEvent = Void;
type OutEvent = Void;
type Error = Void;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this type information useful. I.e. I believe there is value in making it impossible at compile time for a protocol to close a connection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you articulate and perhaps quantify the value you are seeing?

In my view, type systems are best leveraged on boundaries. Boundaries are where we likely split work between developers (users vs maintainers for example) so they have documentation value and the compiler can ensure invariants.

In this case, the Error is part of a boundary but the consumer is within our responsibility (Swarm). Because it is an associated type, we have to cater for it to be anything so we can't really take advantage of the type system here. On the contrary, it makes things more complicated because we have to carry around a type parameter which is a cost in terms of complexity.

In addition to Swarm, this boundary also touches the implementer of a ConnectionHandler and the user of Swarm.

Let's start with the latter. Connections can be closed at any time for a variety of reasons. There is definitely value in getting notified about connections being closed. However, I'd argue that it is unlikely that one would take different codepaths depending on which part of my system actively closed a connection. A type-safe solution is useful if I want the compiler to ensure how and when certain code-paths are being taken. But, if we treat all events about closed connections the same, there is no added value in a type-system driven solution.

Last but not least, we have the implementer of ConnectionHandler. One may argue that it is a useful safety check to first declare that a protocol may never close a connection to ensure this isn't accidentally introduced later (either by the same dev or as part of a change by someone else). This is a plus for a type-system driven solution although it is important to add that one can still audit fairly easily whether a particular protocol actively closes a connection or not.

An additional negative point is that more moving parts (i.e. number of associated types) makes it more difficult for newcomers to implement a ConnectionHandler. By how much is difficult to measure but from my own experience, I was quite overwhelmed with this, esp. because the default value for when you generate this trait implementation (type Error = ()) doesn't compile.

In my view, this boils down to weighing the ease of auditability of protocols for active connection closings vs the complexity this type parameter causes in our codebase.

To me, reducing the complexity in our codebase is a pretty clear winner here. It is difficult to express this in numbers but I'd assume that auditing a protocol for when it closes connections is something that doesn't happen very often. Compared to that, we have 4 maintainers and various active contributers here that routinely come in contact with the parts of this codebase involving this type parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I forgot to add: A pretty substantial cost is that this type appears within SwarmEvent too which pretty much every user of libp2p comes in contact with and is clunky to name due to the composition of the error as part of NetworkBehaviour.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the reasoning you lay out above. I do believe there is value in strong typing here, e.g. preventing at compile time that a ConnectionHandler never closes a connection. That said, I am not sure it warrants the downsides you list above. Long story short, I don't have a strong opinion here. Let's go with your suggestion.

/// Encapsulates the reason for why a specific [`ConnectionHandler`] closed a connection.
#[derive(Debug)]
pub struct CloseReason {
protocol: &'static str,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me this is a hack. The compiler is already aware of the place where a CloseReason is instantiated. Passing an additional string seems error prone to me. Is there no way we can leverage the type system to make passing a string unnecessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is a hack and I would like to get rid of it. If we agree above that we want something like this, I am happy to invest more time into this detail to make that better.

Copy link
Contributor Author

@thomaseizinger thomaseizinger Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One alternative as discussed above is to recommend users to implement their own error type if they care about identifying the origin of the connection closure. We can follow this as a best-practice in our codebase.

This is kind of equivalent to what we have now. A user could already specify a generic error like std::io::Error here and thus essentially drop the information of who closed the connection. This assumes they don't want to rely on the order of how the error is composed as part of derive(NetworkBehaviour). I think this is a fair assumption to make given that one could change the order of fields in a NetworkBehaviour without necessarily generating a compile error.

@thomaseizinger
Copy link
Contributor Author

I tried to push this forward but the issue is actually bigger. Started a discussion here: #3307 (comment)

@thomaseizinger
Copy link
Contributor Author

My suggestion to move forward here is to:

  • Audit the codebase for when our protocols close connections. Most of them shouldn't actively do that in my opinion. Instead, they should probably just set keep-alive to No.
  • Send individual PRs to fix the above problems.
  • Come back to this PR and see how many protocols still actively close connections.
  • Introduce dedicated error types for those such that CloseReason can be downcasted to them.

@thomaseizinger
Copy link
Contributor Author

Note, I started a discussion that is related to this topic in #3353. @rkuhn you might be interested in that.

@thomaseizinger
Copy link
Contributor Author

Closing in favor of #3591.

@thomaseizinger thomaseizinger deleted the box-connection-handler-error branch April 26, 2023 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants