How do you write a resilient HTTP Hyper server that does not crash with "too many open files"? #1358

klausi · 2017-10-21T16:03:28Z

The Hello world example at https://hyper.rs/ is vulnerable to denial of service attacks if the max number of allowed open file descriptors is not high enough. I was in contact with @seanmonstar already about this and he does not think this is a security issue, so I'm posting this publicly.

Steps to reproduce:

Implement the Hello world example from https://hyper.rs/
Set a very low file descriptor limit to 50 to provoke the crash early: ulimit -n 50
Start the Hello world server: cargo run
Open another shell and attack the server with Apache bench (100 concurrent requests): ab -c 100 -n 10000 http://localhost:3000/

That will crash the server with an IO Error Io(Error { repr: Os { code: 24, message: "Too many open files" } }).

A naive solution is to just restart the server all the time with a loop:

fn main() {
    loop {
        let addr = "127.0.0.1:3000".parse().unwrap();
        let server = Http::new().bind(&addr, || Ok(Proxy)).unwrap();
        match server.run() {
            Err(e) => println!("Error: {:?}", e),
            Ok(_) =>{},
        };
    }
}

Which is not at all ideal because there is a downtime for a short period of time and all connections from clients are reset.

I checked the behavior of other server software, Varnish in this case. With a low file descriptor limit it just waits until it has descriptors available before accepting connections.

Can Hyper do the same? How do you run your Hyper servers in production to prevent a server crash when file descriptors run out?

The text was updated successfully, but these errors were encountered:

klausi · 2017-10-22T21:37:41Z

After digging around a bit in Hyper and Tokio I think that tokio_core::net::Incoming is flawed. It has an Error type but each TCP connection attempt it produces is not a Result but a TcpStream already. So there is no way to catch the IO error on individual incoming accept() attempts and then continue to receive new connections. Contrast that with std::net::Incoming which produces a Result for each connection attempt.

So I think the path forward here is to try out a forked version of tokio_core and Hyper to see if it is possible to produce TCP connections in the same Result way as the standard library.

seanmonstar · 2017-10-23T16:15:58Z

A fork shouldn't be required, you can accomplish the same thing quite easily implementing a new Future that works on listener.accept() instead of incoming().

It's probably a good idea to fix this in hyper as well. (Specifically, the accepting that happens in Server::run_until.)

It might be fair to consider if tokio_core::net::Incoming should be changed. While it does seem odd to have a stream yield Result<Option<Result<TcpStream, Error>>, Error>, it does more closely mirror std::net::Incoming, and docs of accept(2) mention that most errors are actually errors from the accepted socket, not from the listener, and do not mean the listener can't still be used.

carllerche · 2017-10-23T16:37:55Z

I'm still not seeing the problem with Incoming. If it yields an error, that is the error. Then you try to poll incoming again. and it should yield the next socket.

seanmonstar · 2017-10-23T16:40:25Z

@carllerche That's not the contract of Stream::poll, however. Stream::poll returning Err implies that the stream is dead and cannot be used again.

seanmonstar · 2017-10-23T16:41:25Z

As is noticed by doing core.run(listener.incoming().for_each(do_stuff) terminating on the first error from accept.

carllerche · 2017-10-23T16:44:59Z

@seanmonstar that contract isn't actually accurate. Stream returning Err is implementation specific. See: rust-lang/futures-rs#206.

The Incoming stream is intended to allow polling after an error is returned. None represents the final state.

carllerche · 2017-10-23T16:45:34Z

@seanmonstar Re: the for_each example, the current expectation is that you would add an or_else before the for_each to handle the error.

klausi · 2017-10-27T12:40:20Z

I'm trying the or_else() future as @carllerche recommended.

Starting from:

let server = listener.incoming().for_each(move |(sock, addr)| {
    http.bind_connection(
        &handle,
        sock,
        addr,
        Proxy {
            port: port,
            upstream_port: upstream_port,
            client: client.clone(),
        },
    );
    Ok(())
});

Attempt 1: just insert the or_else() and see what the compiler tells us:

let server = listener.incoming().or_else(|e| {
                println!("{:?}", e);
            }).for_each(move |(sock, addr)| { ... });

error[E0277]: the trait bound `(): futures::Future` is not satisfied
   --> src/lib.rs:169:46
    |
169 |             let server = listener.incoming().or_else(|e| {
    |                                              ^^^^^^^ the trait `futures::Future` is not implemented for `()`
    |
    = note: required because of the requirements on the impl of `futures::IntoFuture` for `()`

I was hoping the compiler would give me a hint about the return type I have to produce in my closure, but that is not helpful. I'm not using () anywhere, so what is she talking about? Looking at the docs at https://docs.rs/futures/0.1.16/futures/stream/trait.Stream.html#method.or_else there is no example and the type only says I need to return an U which is IntoFuture<Item = Self::Item>.

Attempt 2: Return an empty Ok tuple as the for_each() does

let server = listener.incoming().or_else(|e| {
                println!("{:?}", e);
                Ok(())
            }).for_each(move |(sock, addr)| { ... });

error[E0271]: type mismatch resolving `<std::result::Result<(), _> as futures::IntoFuture>::Item == (tokio_core::net::TcpStream, std::net::SocketAddr)`
   --> src/lib.rs:169:46
    |
169 |             let server = listener.incoming().or_else(|e| {
    |                                              ^^^^^^^ expected (), found tuple
    |
    = note: expected type `()`
               found type `(tokio_core::net::TcpStream, std::net::SocketAddr)`

Attempt 3: Return an Err:

let server = listener.incoming().or_else(|e| {
                println!("{:?}", e);
                Err(e)
            }).for_each(move |(sock, addr)| { ... });

At least it compiles!!!

But it does not solve the problem: returning an error here bubbles up and crashes my server as before. With the only difference of the additional print statement.

Attempt 4: Return an empty future, assuming it does nothing and Incoming continues with the next connection attempt:

let server = listener.incoming().or_else(|e| -> futures::future::Empty<_, std::io::Error> {
    println!("{:?}", e);
    futures::future::empty()
}).for_each(move |(sock, addr)| { ... });

That compiles, but as soon as the first IO error happens the server does not respond anymore. Looking at the docs: https://docs.rs/futures/0.1.16/futures/future/struct.Empty.html it says "A future which is never resolved.". Aha, so that is probably blocking my server. So this is not really an Empty future and should be renamed to "AlwaysBlockingDoingNothing".

Attempt 5: Let's try the or_else() after the for_each():

let server = listener.incoming().for_each(...).or_else(|e| -> Result<_> {
    println!("{:?}", e);
    Ok(())
});

This compiles, but does not swallow the error. The server still crashes except for the additional print statement.

At this point I'm running out of ideas. How can I swallow the IO error and make the incoming future continue?

sinkuu · 2017-10-27T15:26:25Z

What about filtering out Errs?

let server = listener
    .incoming()
    .then(|x| future::ok::<_, ()>(x.ok()))
    .filter_map(|x| x)
    .for_each(|(sock, addr)| {
        println!("{:?} {:?}", sock, addr);
        Ok(())
    });

algermissen · 2017-10-27T15:40:35Z

Maybe this is not quite to the point, but the thread reminded me of https://crates.io/crates/tk-listen

klausi · 2017-10-28T15:19:39Z

@sinkuu nice idea for a workarund! Unfortunately that results in an even worse behavior with the attack from above. By ignoring the errors one CPU core jumps to 100% (even if no requests are performed anymore after the attack) and the server is not responding anymore. When trying to establish connections they run into timeouts.

I assume that happens because some cleanup after accept() errors is missing, which sounds very similar to what is described at https://crates.io/crates/tk-listen : "Some connection accept errors (like "connection reset") must be ignored, some (like "too many files open") may consume 100% CPU when ignored. You need to know what to do with them every time"

klausi · 2017-10-28T15:22:35Z

So it sounds like @tailhook might have figured stuff out in https://github.com/tailhook/tk-listen which I will look into next.

klausi · 2017-10-29T17:50:51Z

Using tk-listen I can now mitigate the problem and the server does not crash anymore when it has only a few file descriptors with ulimit -n 50. Yay!

Here is the full source for a resilient Hyper echo server:

extern crate futures;
extern crate hyper;
extern crate service_fn;
extern crate tk_listen;
extern crate tokio_core;

use futures::Stream;
use hyper::header::{ContentLength, ContentType};
use hyper::server::{Http, Response};
use service_fn::service_fn;
use std::time::Duration;
use tk_listen::ListenExt;
use tokio_core::net::TcpListener;
use tokio_core::reactor::Core;

const TEXT: &'static str = "Hello, World!";

fn main() {
    let addr = ([127, 0, 0, 1], 3000).into();

    let mut core = Core::new().unwrap();
    let handle = core.handle();
    let handle2 = core.handle();
    let http = Http::new();
    let listener = TcpListener::bind(&addr, &handle).unwrap();

    let server = listener
        .incoming()
        .sleep_on_error(Duration::from_millis(10), &handle2)
        .map(move |(sock, addr)| {
            let hello = service_fn(|_req| {
                Ok(
                    Response::<hyper::Body>::new()
                        .with_header(ContentLength(TEXT.len() as u64))
                        .with_header(ContentType::plaintext())
                        .with_body(TEXT),
                )
            });
            http.bind_connection(
                &handle,
                sock,
                addr,
                hello,
            );
            Ok(())
        })
        // Maximum of 10,000 connections simultaneously.
        .listen(10_000);

    core.run(server).unwrap();
}

Now calling ab -c 1000 -n 100000 http://localhost:3000/ in a new shell works, but it does not finish. The last ~200 of 100k requests never finish and at some point ab exists with

Completed 10000 requests
Completed 20000 requests
Completed 30000 requests
Completed 40000 requests
Completed 50000 requests
Completed 60000 requests
Completed 70000 requests
Completed 80000 requests
Completed 90000 requests
apr_socket_recv: Connection reset by peer (104)
Total of 99798 requests completed

While ab is in progress I can successfully reach the server manually in my browser, so this might not be a big problem. And it is certainly an improvement by not crashing the server :)

@seanmonstar: what do you think of using tk-listen as a dependency in Hyper and patching server/mod.rs to do something similar?

seanmonstar · 2017-11-01T01:19:10Z

I believe the helpers in hyper::server (so, Http::bind and an eventual Future version without a Core) that manages a TcpListener could do this internally.

klausi · 2017-11-04T16:50:31Z

The tinyhttp example in tokio-core has the same vulnerability, also filed an issue there.

klausi · 2018-02-04T19:13:57Z

An alternative solution to this is using std::net::TcpListener to accept connections, as in the Tokio multi threaded server example: https://github.com/tokio-rs/tokio-core/blob/master/examples/echo-threads.rs

Advantages:

no need for tk-listen
server is resilient and does not crash
slight performance increase because request handling is distributed on worker threads with their own Tokio core event loop

The downside is that you have more code in your server that you need to reason about and maintain.

tailhook · 2018-02-04T22:28:20Z

@klausi As far as I can see:

for socket in listener.incoming() {
    let socket = socket.expect("failed to accept");

... that example server will crash on error. And you need to replicate all the same error handling and sleeping to handle the load.

klausi · 2018-02-05T21:52:55Z

Sorry, should have posted a working example: Here is a more resilient HTTP server without tk-listen:

extern crate futures;
extern crate hyper;
extern crate num_cpus;
extern crate service_fn;
extern crate tokio_core;

use std::net::{self, SocketAddr};
use std::thread;

use futures::Future;
use futures::stream::Stream;
use futures::sync::mpsc;
use hyper::header::{ContentLength, ContentType};
use hyper::server::{Http, Response};
use service_fn::service_fn;
use tokio_core::net::TcpStream;
use tokio_core::reactor::Core;

const TEXT: &'static str = "Hello, World!";

fn main() {
    let addr: SocketAddr = ([127, 0, 0, 1], 3000).into();

    let num_threads = num_cpus::get();

    let listener = net::TcpListener::bind(&addr).expect("failed to bind");
    println!("Listening on: {}", addr);

    let mut channels = Vec::new();
    for _ in 0..num_threads {
        let (tx, rx) = mpsc::unbounded();
        channels.push(tx);
        thread::spawn(|| worker(rx));
    }

    let mut next = 0;
    for socket in listener.incoming() {
        let socket = match socket {
            Ok(socket) => socket,
            // Ignore socket errors like "Too many open files" on the OS
            // level. Just continue with the next request.
            Err(_) => continue,
        };
        channels[next]
            .unbounded_send(socket)
            .expect("worker thread died");
        next = (next + 1) % channels.len();
    }
}

// Represents one worker thread of the server that receives TCP connections from
// the main server thread.
fn worker(rx: mpsc::UnboundedReceiver<std::net::TcpStream>) {
    let mut core = Core::new().unwrap();
    let handle = core.handle();
    let http = Http::<hyper::Chunk>::new();

    let done = rx.for_each(move |socket| {
        let socket = match TcpStream::from_stream(socket, &handle) {
            Ok(socket) => socket,
            Err(error) => {
                println!(
                    "Failed to read TCP stream, ignoring connection. Error: {}",
                    error
                );
                return Ok(());
            }
        };
        let addr = match socket.peer_addr() {
            Ok(addr) => addr,
            Err(error) => {
                println!(
                    "Failed to get remote address, ignoring connection. Error: {}",
                    error
                );
                return Ok(());
            }
        };

        let hello = service_fn(|_req| {
            Ok(Response::<hyper::Body>::new()
                .with_header(ContentLength(TEXT.len() as u64))
                .with_header(ContentType::plaintext())
                .with_body(TEXT))
        });

        let connection = http.serve_connection(socket, hello)
            .map(|_| ())
            .map_err(move |err| println!("server connection error: ({}) {}", addr, err));

        handle.spawn(connection);
        Ok(())
    });
    match core.run(done) {
        Ok(_) => println!("Worker tokio core run ended unexpectedly"),
        Err(_) => println!("Worker tokio core run error."),
    };
}

This server never crashes because it just ignores errors and keeps on running.

This is more primitive than tk-listen because it does not use any timeout at all. The echo server withstands the ab attack, although ab -c 1000 -n 100000 http://localhost:3000/ still hangs at the very end. Same as with tk-listen the server is at least reachable from another browser.

tailhook · 2018-02-05T22:11:43Z

This server never crashes because it just ignores errors and keeps on running.

Yes, but it hangs the whole CPU when in error condition. Which matters when you're being DoS attacked.

But other than that, yes. Tk-listen is just a few lines of code wrapped into futures/combinators boilerplate, obviously, you can reimplement it.

seanmonstar · 2018-02-05T22:34:16Z

for socket in listener.incoming() {
    let socket = match socket {
        Ok(socket) => socket,
        // Ignore socket errors like "Too many open files" on the OS
        // level. Just continue with the next request.
        Err(_) => continue,
    };
    // ...  
}

Just to clarify tailhook's comment: this code right here will spin the accept loop hard, since the EMFILE error doesn't remove the socket from the acceptor's queue. You might want to sleep the thread for a few milliseconds or something there.

klausi · 2018-02-18T21:45:39Z

Thanks for the tip on high CPU load. I implemented an error check + thread.sleep() in the example above: https://gist.github.com/klausi/f94b9aff7d36a1cb4ebbca746f0a099f

When ab benchmarking the 2 versions with and without sleep() I did not see a significant difference in CPU usage. In both cases the CPU usage is high because a lot of requests need to be processed. On a typical server today there will be multiple CPU cores so I think it is fine if the socket accepting process spins one core harder in the case file descriptors run out. It does not seem worth the effort to add random sleep() code just because EMFILE errors do not clean up the accept queue behind them.

tailhook · 2018-02-18T22:02:51Z

When ab benchmarking the 2 versions with and without sleep() I did not see a significant difference in CPU usage. In both cases the CPU usage is high because a lot of requests need to be processed.

This is because you're running a microbenchmark. Check the following things:

Do you have keep-alive enabled?
Do some work in request handler. Because requests in test finish in microseconds new descriptors are freed shortly.
Add some sleep into a request handler. In case accepting thread sleeps you will have basically zero CPU used. And without sleep, you'll have a full one core running at 100%.

Note that (2) and (3) emulate real work-a-load of one or other type.

klausi · 2018-02-24T13:56:59Z

Good points! Per default Apache Bench does not have keep-alive enabled. The -k option can be used for that, here is the command I used now for testing:

time ab -k -r -c 100 -n 10000 http://localhost:3000/

The thread sleeping in the request handler is a good idea to see how hard the CPU is used. When I compared the 2 versions with and without EMFILE timeouts the runtime was equal, but the CPU usage was a lot higher in the no timeout version as expected. I then also tried to play a little bit around with the timeout number (10ms vs. 100ms) but could not see a significant difference when running the server with a limited amount of file descriptors.

That got me thinking that I might focus a bit too much on this limited file descriptor edge case that I don't really care about. What I want to keep from this exercise is a multi-threaded hyper server (I think we don't have an example for that yet) and a non crashing hyper server when there is heavy load. So now I'm back at using tk-listen with worker threads: https://gist.github.com/klausi/93b57d3abe2c5bc4975b0b9921f2a3c2

klausi · 2018-02-24T15:29:03Z

I started with a PR to have the sleep_on_error functionality directly in the Hyper server: #1450

Reviews and feedback welcome!

aep · 2018-04-04T10:36:00Z

am i missing something here? How does sleeping prevent anyone from just exhausting FDs by leaving connections half open? SYN is essentially free.

seanmonstar mentioned this issue Oct 23, 2017

RFC: Tokio reform, take 2 tokio-rs/tokio-rfcs#3

Merged

seanmonstar added the A-server Area: server. label Oct 27, 2017

klausi mentioned this issue Nov 4, 2017

Denial of Service vulnerability in example tinyhttp tokio-rs/tokio-core#277

Closed

klausi mentioned this issue Nov 4, 2017

tests(server): Add failing test to demonstrate port exhausting problem #1369

Closed

seanmonstar added the E-easy Effort: easy. A task that would be a great starting point for a new contributor. label Dec 7, 2017

klausi mentioned this issue Jan 21, 2018

Hyper client is permanently broken after "Too many open files" error #1422

Closed

klausi mentioned this issue Feb 24, 2018

fix(server): Sleep on socket IO errors #1450

Merged

seanmonstar closed this as completed in #1450 Feb 26, 2018

klausi mentioned this issue Mar 3, 2018

Make sleep_on_error timeout configurable, enable it by default + document it #1455

Closed

3 tasks

dashed mentioned this issue May 21, 2018

CPU Pooling dashed/toy-project#1

Closed

sanmai-NL mentioned this issue Nov 17, 2018

Default sleep_on_errors to false #1712

Closed

joshkunz mentioned this issue Aug 8, 2021

Backend is crashing every few days with "Too many files" errors joshkunz/co2-sensor#7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do you write a resilient HTTP Hyper server that does not crash with "too many open files"? #1358

How do you write a resilient HTTP Hyper server that does not crash with "too many open files"? #1358

klausi commented Oct 21, 2017

klausi commented Oct 22, 2017

seanmonstar commented Oct 23, 2017

carllerche commented Oct 23, 2017

seanmonstar commented Oct 23, 2017

seanmonstar commented Oct 23, 2017

carllerche commented Oct 23, 2017

carllerche commented Oct 23, 2017

klausi commented Oct 27, 2017

sinkuu commented Oct 27, 2017

algermissen commented Oct 27, 2017

klausi commented Oct 28, 2017

klausi commented Oct 28, 2017

klausi commented Oct 29, 2017

seanmonstar commented Nov 1, 2017

klausi commented Nov 4, 2017

klausi commented Feb 4, 2018

tailhook commented Feb 4, 2018

klausi commented Feb 5, 2018 •

edited

Loading

tailhook commented Feb 5, 2018

seanmonstar commented Feb 5, 2018 •

edited

Loading

klausi commented Feb 18, 2018

tailhook commented Feb 18, 2018

klausi commented Feb 24, 2018

klausi commented Feb 24, 2018

aep commented Apr 4, 2018

How do you write a resilient HTTP Hyper server that does not crash with "too many open files"? #1358

How do you write a resilient HTTP Hyper server that does not crash with "too many open files"? #1358

Comments

klausi commented Oct 21, 2017

klausi commented Oct 22, 2017

seanmonstar commented Oct 23, 2017

carllerche commented Oct 23, 2017

seanmonstar commented Oct 23, 2017

seanmonstar commented Oct 23, 2017

carllerche commented Oct 23, 2017

carllerche commented Oct 23, 2017

klausi commented Oct 27, 2017

sinkuu commented Oct 27, 2017

algermissen commented Oct 27, 2017

klausi commented Oct 28, 2017

klausi commented Oct 28, 2017

klausi commented Oct 29, 2017

seanmonstar commented Nov 1, 2017

klausi commented Nov 4, 2017

klausi commented Feb 4, 2018

tailhook commented Feb 4, 2018

klausi commented Feb 5, 2018 • edited Loading

tailhook commented Feb 5, 2018

seanmonstar commented Feb 5, 2018 • edited Loading

klausi commented Feb 18, 2018

tailhook commented Feb 18, 2018

klausi commented Feb 24, 2018

klausi commented Feb 24, 2018

aep commented Apr 4, 2018

klausi commented Feb 5, 2018 •

edited

Loading

seanmonstar commented Feb 5, 2018 •

edited

Loading