Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maybe stream should be broken with None after an error #222

Closed
rib opened this issue Jan 9, 2020 · 10 comments
Closed

Maybe stream should be broken with None after an error #222

rib opened this issue Jan 9, 2020 · 10 comments
Labels
A-tonic C-bug Category: Something isn't working
Milestone

Comments

@rib
Copy link

rib commented Jan 9, 2020

Within a Tonic server, handling a request where a client opens a 'keep alive' stream, I got bitten today by only checking for None to determine if a client had disappeared and the stream was closed.

I.e. I had a loop like this to read the stream:

tokio::spawn(async move {
    while let Some(_) = keep_alive_stream.next().await {
        // SNIP: ensure server keeps multi-user session state alive
    }
});

but after closing the client, it ended up being considered an error within Tonic that actually left my server code stuck in an infinite loop after that client dissappeared, since I had neglected to check the inner Result status returned by .next().await.

Arguably it's just my bug that I ignored the result status, but also perhaps arguably the stream has ended, and maybe Tonic could recognise this and limit how many errors may be reported before marking the stream end with None?

Just for reference, this is the error I saw earlier:
tonic::codec::decode: decoder inner stream error: Status { code: Unknown, message: "error reading a body from connection: protocol error: stream no longer needed" }
Though it wasn't entirely clear to me what the error really meant.

and the preceding h2 tracing made it seem to me like a fairly clean break by the client:

Jan 09 17:34:22.950 DEBUG h2::codec::framed_read: received; frame=Reset { stream_id: StreamId(5), error_code: CANCEL }
Jan 09 17:34:22.951 DEBUG h2::codec::framed_read: received; frame=Reset { stream_id: StreamId(3), error_code: CANCEL }
Jan 09 17:34:22.951 DEBUG hyper::proto::h2::server: stream error: operation was canceled
Jan 09 17:34:22.964 DEBUG hyper::proto::h2::server: stream error: operation was canceled

The fix for my server was simple enough, after realising the issue - but filing this in case it makes sense to tweak Tonic's error handling to ensure a disconnected client won't leave a stream in an endless error state.

@LucioFranco
Copy link
Member

So I think in this case you probably want to use TryStreamExt::try_next but there is also https://docs.rs/tonic/0.1.0-beta.1/tonic/struct.Streaming.html#method.message which returns a Result on the outter. I'm not sure if the behavior should be to end the stream on the first error or not? In theory, you should be checking the error and exiting on your own and if you poll again it should error. But not sure exactly what the right behavior should be.

@rib
Copy link
Author

rib commented Jan 11, 2020

Ah, neat, yeah try_next is a nice option for moving the result to the outside.

Thanks for the pointer about message too and I'll see about using one or the other here.

Yeah, I'm not entirely sure about the ideal way to handle stream termination here, but my gut feeling is that if Tonic knows that a stream has effectively been terminated (even if it's due to some error condition) then it should probably report None within a finite amount of time.

If Tonic knows that an error is fatal with respect to a stream then I guess after reporting the error it could internally mark the stream as terminated and that status would have higher precedence than any persistent error condition, and None could be sent on the next read.

Rust's stream polling seems to be designed to allow for streams to contain multiple errors and in general errors within a stream don't have to imply that the stream has ended.

Earlier versions of Stream (https://docs.rs/futures/0.2.1/futures/stream/trait.Stream.html) which used to return a Result from poll_next were documented with this:

Streams, like futures, also bake in errors through an associated Error type. An error on a stream does not terminate the stream. That is, after one error is received, another value may be received from the same stream (it's valid to keep polling). Thus a stream is somewhat like an Iterator<Item = Result<T, E>>, and is always terminated by returning None.

Now stream polling is even more orthogonal from error reporting, but the above idea still seems to makes sense with the final Stream API too.

Considering that at the lowest level stream polling doesn't care about errors flowing through. then I think that gives me some expectation that None would be the most reliable, highest precedence, indicator of a stream ending and errors are per-stream implementation details. I would think about checking None before checking for errors because the former is a more fundamental state for the stream and errors are optional (in general for rust I mean - not in this specific case).

Semantically it seems to me like Tonic is not terminating the stream for some stream-fatal error conditions, and it's instead transitioning the stream in to a perpetual error stream. Maybe that's fine (it's simple enough to handle if this is known) but maybe it goes against the design spirit of rust streams a bit, to require errors to be interpreted as stream terminators?

Playing devils advocate really ;-p I haven't really looked at lots of apis to have a good sense of what the best choice is and I'm probably biased now anyway :)

@LucioFranco LucioFranco modified the milestone: 0.1 Jan 12, 2020
@LucioFranco
Copy link
Member

So I am going to remove this from the 0.1 milestone. One, thing to think about here in the future is that we can't just treat all connections as being http2 and based on hyper. Thus we need to respect the traits, the traits don't define that if you error its over so I'm not sure if changing that behavior should be done here or at a more foundational level. I'd like to continue thinking about this but we can push it to a future release.

@sfackler
Copy link

While you can use try_* methods to short circuit, the current behavior of infinitely returning the same error is extremely footgun-y. Doing a for_each_concurrent to process the stream rather than a try_for_each_concurrent will cause you to peg a core at 100% if an error occurs.

@LucioFranco
Copy link
Member

Once a stream hits an error it will report it once then return None I believe this should fix this issue? Here is the relevant commit 737ace3

@sfackler
Copy link

I think that fixes this issue when an outgoing stream of responses reports errors, but not when an incoming stream of requests reports errors.

@LucioFranco
Copy link
Member

LucioFranco commented Sep 20, 2021

incoming stream of requests reports errors

I'm not totally following, you mean a client sending messages on a client side stream -> server? There shouldn't be any errors in that case iirc. Also Monday so may not be thinking correctly just yet :D

@sfackler
Copy link

client.rs:

pub mod echo {
    tonic::include_proto!("grpc.examples.echo");
}

use echo::{echo_client::EchoClient, EchoRequest};
use futures::stream;
use tonic::transport::Endpoint;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let channel = Endpoint::from_static("http://[::1]:50051")
        .connect()
        .await?;

    let mut echo_client = EchoClient::new(channel);

    echo_client
        .client_streaming_echo(stream::repeat(EchoRequest {
            message: "hello".into(),
        }))
        .await;

    Ok(())
}

server.rs:

use futures::{Stream, StreamExt};
use std::pin::Pin;
use tonic::{transport::Server, Request, Response, Status};

pub mod echo {
    tonic::include_proto!("grpc.examples.echo");
}

use echo::{
    echo_server::{Echo, EchoServer},
    EchoRequest, EchoResponse,
};

type ResponseStream = Pin<Box<dyn Stream<Item = Result<EchoResponse, Status>> + Send + Sync>>;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let addr = "[::1]:50051".parse().unwrap();

    let echo = EchoServer::new(MyEcho::default());

    Server::builder().add_service(echo).serve(addr).await?;

    Ok(())
}

#[derive(Default)]
pub struct MyEcho;

#[tonic::async_trait]
impl Echo for MyEcho {
    async fn unary_echo(&self, _: Request<EchoRequest>) -> Result<Response<EchoResponse>, Status> {
        Err(Status::unimplemented("Not yet implemented"))
    }

    type ServerStreamingEchoStream = ResponseStream;

    async fn server_streaming_echo(
        &self,
        _: Request<EchoRequest>,
    ) -> Result<Response<Self::ServerStreamingEchoStream>, Status> {
        Err(Status::unimplemented("Not yet implemented"))
    }

    async fn client_streaming_echo(
        &self,
        request: Request<tonic::Streaming<EchoRequest>>,
    ) -> Result<Response<EchoResponse>, Status> {
        let mut stream = request.into_inner();
        while let Some(r) = stream.next().await {
            println!("{:#?}", r);
        }

        Ok(Response::new(EchoResponse {
            message: "hi".to_string(),
        }))
    }

    type BidirectionalStreamingEchoStream = ResponseStream;

    async fn bidirectional_streaming_echo(
        &self,
        _: Request<tonic::Streaming<EchoRequest>>,
    ) -> Result<Response<Self::BidirectionalStreamingEchoStream>, Status> {
        Err(Status::unimplemented("Not yet implemented"))
    }
}

Run both, then kill the client. The server will infinitely print

Err(
    Status {
        code: Unknown,
        message: "error reading a body from connection: broken pipe",
        source: None,
    },
)

@LucioFranco
Copy link
Member

Btw I am still looking into this but was originally having trouble reproducing this in a test. So will continue to look into this.

@LucioFranco LucioFranco added this to the 0.6 milestone Oct 6, 2021
@LucioFranco LucioFranco added A-tonic C-bug Category: Something isn't working labels Oct 13, 2021
@LucioFranco LucioFranco modified the milestones: 0.6, 0.7 Oct 20, 2021
@LucioFranco
Copy link
Member

Ok here is an attempted fix that should in general just return None when we've seen an error from poll_data. So it should go Some(Err) thenNone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tonic C-bug Category: Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants