Requests to hyper server eventually start failing. #950

jswrenn · 2016-11-06T16:21:59Z

I wrote a basic server application using hyper 0.9.10 that prints the IP, referrer, and body of each request to stdout (which I then redirect into a file):

extern crate hyper;
use std::env;
use std::io::Read;
use hyper::method::Method::Post;
use hyper::server::{Server, Request, Response};
use hyper::header::{Referer, AccessControlAllowOrigin};
use hyper::net::Openssl;

fn main() {

  let mut args = env::args().skip(1);
  let addr = args.next().unwrap();
  let cert = args.next().unwrap();
  let key  = args.next().unwrap();
  
  let ssl = Openssl::with_cert_and_key(cert, key).unwrap();

  let _ = Server::https((addr.as_str(), 443), ssl).unwrap()
    .handle(|req: Request, mut res: Response| {
      res.headers_mut().set(AccessControlAllowOrigin::Any);
      let _ = res.send(b"");
      if req.method == Post {
        let (addr, _, headers, _, _, mut body) = req.deconstruct();
        headers.get::<Referer>().
          map(|&Referer(ref referer)| {
            let mut buffer = String::new();
            let _ = body.read_to_string(&mut buffer);
            println!("{:}\t{:}\t{:}", addr, referer, buffer);
          });
      }
    });
}

Everything works as expected initially, but after some number of hours after starting the server, all requests to it begin to fail (and do so very slowly). When this happens, the results of time curl -k ... look like this:

curl: (35) Unknown SSL protocol error in connection to logbook.pyret.org:443 

real	1m43.616s
user	0m0.013s
sys	0m0.000s

Restarting the server application corrects this issue and requests are handled within 100ms, tops.

I'm flummoxed.

The text was updated successfully, but these errors were encountered:

seanmonstar · 2016-11-06T18:40:17Z

Does this problem occur without the use of openssl?

jswrenn · 2016-11-06T19:42:27Z

I'm not sure. I have not yet managed to reproduce the problem in a short amount of time, and running the application without openssl is not possible.

EDIT: To give an idea of the timeline between starting the application and it failing: The last time this happened was a week after the server application was started, during which about 90k requests were handled. The incident before that one occurred several days after the application was started.

hannobraun · 2016-11-07T12:26:07Z

I'm possibly seeing the same issue. I haven't spoken up so far because I'm not sure it's Hyper-related or caused by my own code, and I haven't had the time to look into it yet. But since this issue is open now, I figured I'd chime in.

If my issue is Hyper-related at all, it definitely only occurs in connection with OpenSSL. HTTPS requests don't get a reply, while HTTP requests are handled promptly. The issue goes away after a restart.

This has been happening every 1-2 weeks since early August, now almost daily since late October. I didn't deploy a new version before the rate started picking up, so it must be related to external factors, too. I don't have traffic statistics at hand, but it can't be much. The page has near zero content and hasn't been announced anywhere, so I'd be surprised if anyone but me and the search engine bots were looking at it.

Whether the issue is caused by Hyper+OpenSSL or my own code is hard to tell without further inspection, as the only real work is done in response to HTTPS requests. An HTTP request will only return a redirect to the equivalent https:// URL.

@seanmonstar At the moment this has a very low priority for me, but if you wish, I can take the time to look into it and try to create a reduced test case. To be honest, I kinda hope this will just go away with async Hyper, so I wasn't planning on spending any effort on this, unless I need that website to go live before 0.10 lands.

seanmonstar · 2016-11-07T18:28:55Z

Very interesting. I wonder if the usage of SslStream in the 0.9.x branch is doing something wrong.

I do hope that asynchronous IO helps in this way, and I know that the newer tokio-tls crate that hyper will be using has had much more work done. If this bug is because of hyper's usage of OpenSSL, one way to get around for now is to put nginx in front, and terminate TLS there, before sending requests to your hyper server.

e-oz · 2016-11-10T18:22:24Z

I confirm this issue. Without using openssl. Appeared after upgrading from 0.9.10 to 0.9.11, still exist in 0.9.12, even when hyper is without openssl.

e-oz · 2016-11-10T18:26:24Z

More about my environment: nginx is behind, most of queries to the server are simple http, not https.

e-oz · 2016-11-11T11:51:04Z

Downgrade to 0.9.10 didn't help.

e-oz · 2016-11-11T12:35:10Z

Error message in nginx log:
[error] 23607#0: *9 connect() failed (111: Connection refused) while connecting to upstream, client: 8.my.9.ip, server: server.domain.com, request: "GET / HTTP/1.1", upstream: "http://127.0.0.1:3000/", host: "server.domain.com"

0 errors from the server itself.
Connection is refused immediately, without any waiting.

e-oz · 2016-11-11T12:36:40Z

I can say it's not based on amount of requests. With 0.9.10 reboot doesn't help, with 0.9.12 reboot helps (although it's not a solution, so I'm on 0.9.9 right now).

seanmonstar · 2016-11-11T17:43:51Z

Hm, does turning off keep-alive fix it? You can pass None to server.keep_alive(None): http://hyper.rs/hyper/v0.9.12/hyper/server/struct.Server.html#method.keep_alive

e-oz · 2016-11-11T17:45:30Z

Switching to 0.9.9 helped - no failures 5 hours already. Will try keep_alive on weekend.

e-oz · 2016-11-11T18:15:29Z

Nice, now 0.9.9 just failed and don't want to work after restart... Our main site is just offline and I can't do nothing.

e-oz · 2016-11-11T18:22:48Z

Looks like Iron doesn't give access to underlying Hyper Server, so keep-alive is not an option.

seanmonstar · 2016-11-11T18:30:15Z

@e-oz looks like you can pass a Timeouts config struct to Iron::listen_with: http://ironframework.io/doc/iron/struct.Timeouts.html

e-oz · 2016-11-11T18:54:04Z

@seanmonstar at least it started this time, thanks. Will report how long it will work.

hannobraun · 2016-11-12T12:15:05Z

For what it's worth, I've disabled keep-alive some months ago hoping that it would help, but it didn't seem to make any difference. But I'm guessing my issue is different from @e-oz's anyway, since I've never experienced that a restart didn't help.

mikedilger · 2016-11-12T19:51:07Z

I've got multiple websites running on 0.9.11 with no issue, but I'm front-ending them with nginx, and nginx is handling all of the TLS and then proxying the content (hyper serves only to 127.0.0.1 so people can't bypass the nginx frontend). Also my handler sets connection-close if it is busy (and hyper enforces it), and that fixed the problem of clients tying up all the threads with keep-alives.

With this setup, I've had no problems. Just another data-point . Also, I'm not using Iron, so there's that.

e-oz · 2016-11-15T17:00:14Z

@seanmonstar looks like for me switching keep-alive off solved the issue.

@mikedilger I use nginx as a proxy too.

e-oz · 2016-11-21T05:48:41Z

And after 8 days it failed again, in night time, without any requests.

e-oz · 2016-11-24T15:17:32Z

My sincerest apology - I found reason of failures, it was segfault in another library. I'm sorry for false assumptions without any proofs, excuse me please.

hannobraun · 2016-12-05T12:20:15Z

Update from me: I'm going to need that website I'm working on pretty soon now, so my motivation to figure out this issue has increased quite a bit. Unfortunately the frequency of the freezes has gone down since I saw daily freezes in early November, and I'm only seeing the problem every 1-2 weeks now. That makes it harder to figure out what's going on.

None the less, I saw my first freeze today since I've pushed a version with more logging. As far as I can tell, all requests that enter my handler also leave it. That indicates the problem originates in Hyper (or below), which is what I suspected anyway.

I'm going to keep investigating and will check back as I learn more.

hannobraun · 2016-12-09T12:50:56Z

At the time of my last post, I deployed a new version with the following changes:

Full logging enabled
Set read and write timeouts

That version froze yesterday. This rules out my suspicion/hope that setting the timeouts would solve the problem.

After a careful examination of the logs I learned some more things:

The last successful HTTPS request left Worker::handle_connection ("keep_alive loop ending for ..." is logged)
All the following requests in the log are successful HTTP request. None of the frozen requests leave a trace in the log.
Since all requests in the log start with "Incoming stream", I believe the handling thread gets frozen outside of Worker.

My plan now is to add more logging to my Hyper fork and deploy that later today. I'll check back once I learn more.

hannobraun · 2016-12-13T13:40:02Z

I had another freeze. I think I can confirm that problem is related to, maybe even caused by, OpenSSL. After the last successful HTTPS request I have in the logs, there's a partial requests that stops right before calling wrap_server.

Unfortunately I made a mistake placing the log messages within my wrap_server implementation, so I can't say anything definite right now, but I do believe that the freeze is happening somewhere within SslStream::accept, which is part of the OpenSSL crate, not Hyper.

I've already fixed my broken log messages and am going to deploy that fix shortly. If that substantiates my suspicion, I'll try and upgrade the OpenSSL crate to the latest version.

e-oz · 2016-12-13T14:17:18Z

@hannobraun you can't use openssl crate of the latest version with hyper.

hannobraun · 2016-12-13T14:20:12Z

@e-oz I know that Hyper's stuck on 0.7 currently. I plan to upgrade my Hyper fork to the latest version.

e-oz · 2016-12-13T14:22:23Z

@hannobraun latest version of openssl is still not perfect, I'm experiencing segfaults with it. I'm not trying to stop you, though :)

hannobraun · 2016-12-13T14:34:27Z

@e-oz Thanks for the info. Ideally, I'd like to fix the source of the problem. If that source happens to be rust-openssl, it makes sense to first check if it still occurs in the latest version. If I can't fix it, or the end result is not satisfactory for another reason, I can still fall back to using nginx for HTTPS.

By the way, I'm also seeing segfaults, about once per month. Haven't started tracking those down yet :)

seanmonstar · 2016-12-13T16:27:22Z

Thanks for investigating this. I've had a hyper server running at meritbadge.herokuapp.com for more than a year, and it never hangs or segfaults. However, being on heroku means I don't use SSL...

hannobraun · 2016-12-13T18:23:02Z

@seanmonstar No need to thank me. Just doing what I have to, to get that website online. Thank you, for building something that makes it all possible in the first place!

mikedilger · 2016-12-13T19:39:25Z

@hannobraun This is quite a long shot, but I've recently seen surprising behaviour where OpenSSL uses locking callbacks provided externally, and if you link to something that provides such (e.g. python) then you could accidentally block OpenSSL calls with something like a python GIL not being unlocked: sfackler/rust-openssl#532. Like I say, a long shot, but if you are linking to python, something to consider.

hannobraun · 2016-12-14T19:14:01Z

@mikedilger Thanks for the info, although I don't think it applies to my case. I'm not aware of anything in my dependencies that would register those callbacks. Certainly no Python in there. Very interesting bug though :-)

hannobraun · 2016-12-19T15:06:12Z

Froze again. All evidence still points to SslStream::accept. Unfortunately I found another way how my logs could be misleading. Seems unlilely at this point, but I decided to cover my bases. I just deployed another version with improved logging and will check back once it freezes again.

hannobraun · 2017-01-06T12:55:44Z

And it froze again. My suspicions have been confirmed. The thread is definitely entering SslStream::accept and not coming out again.

I've decided to wait for #985 to resolve before testing with the latest OpenSSL. In the meantime, I'm going to try Rustls [1][2]. Depending on how that goes, I might not even go back to OpenSSL afterwards. Let's see.

hannobraun · 2017-01-09T12:01:04Z

Now here's a surprise: 3 days into my Rustls evaluation, the process froze again. Everything looked exactly the same as it did with OpenSSL!

I don't know what this means. Maybe the issue is with Hyper after all, maybe it's with the operating system, or maybe OpenSSL and Rustls just happen to have the same bug. No idea.

Unfortunately, when I switched to Rustls, I also upgraded to the latest Hyper release, which means I lost all the extra logging from my special Hyper fork. I'll re-add the logging and will report back once I know more.

seanmonstar · 2017-01-10T19:31:57Z

@hannobraun I'm curious if setting read and write timeouts before passing the socket to wrap_server would help at all. Depends if the hang is waiting on socket io, or some sort of Mutex deadlock.

hannobraun · 2017-01-11T12:36:17Z

@seanmonstar Interesting, so the timeouts are set in Worker::handle_connection after the socket has been passed wrap_server. This seems wrong to me. We might be on to something :-)

I will test whether setting the timeouts first will solve this problem, but I guess this should be fixed in any case. Do you agree, or am I missing something here?

seanmonstar · 2017-01-11T19:08:00Z

It was just something I never thought, and the code in Worker doesn't make it obvious that socket IO happened before handle_connection, when really, inside HttpsConnector::connect, after the socket has been accepted, there is indeed socket IO to handle the TLS handshake.

There isn't currently a way for the NetworkConnector to know about possible timeouts set in the Server, but perhaps it makes sense to have a healthy timeout set in there before wrap_server. We could say something like 10 seconds for read and write, or possibly it makes sense to update connector in some way to allow the server to hint at a timeout.

hannobraun · 2017-01-11T19:20:41Z

I'll look into it. At the very least I'll hack something together to verify that this causes the problem, but I'll also try to come up with a good solution.

Would you want me to create a pull request against 0.9.x pretty soon (I should be able to get it done by next week, at the latest)? Or do you think it would be better to defer this for a bit, until 0.10.x has landed?

seanmonstar · 2017-01-11T19:39:03Z

I'm publishing 0.10.0 right now (fixing docs deployment before publishing to crates.io, but asap!). So to allow it to be a non-breaking change, maybe a new method on the NetworkConnector trait that has a default implementation to ignore the timeouts.

Again, I'm not certain this will help, as it may be a deadlock somewhere else. Just thinking that TLS handshake has both read and write IO, and with a blocking socket, it will just block until the other side responds or hangs up.

ojensen5115 · 2017-01-21T19:01:30Z

This may be unrelated, but I'm curious if the people experiencing this issue are using Debian / OpenSSL?

I experienced almost identical symptoms in a Python application I was working on some time ago, where after an inconsistent amount of time (sometimes days, sometimes weeks) the application would simply hang doing nothing, and the problem was essentially impossible to reproduce intentionally -- the time after which the issue occurred did not seem to correlate to the number of requests, so simply bombarding the server with requests would not trigger the bug.

The underlying problem was that the Debian repo's version of OpenSSL would sometimes hang indefinitely when being called in Python's do_handshake() method. I wonder if perhaps the same underlying issue is at play here as well. For what it's worth, I ended up "fixing" it in my application by setting a timeout around the call to OpenSSL (see here).

hannobraun · 2017-01-23T06:22:20Z

@ojensen5115 Not me. I'm using Arch Linux with OpenSSL 1.0.2j. It does sound like the same bug, though.

On a related note: I've been running the 0.10.x branch (which includes #1006) for a week now. No freezes yet. Not conclusive at all, but a good sign perhaps (it froze four times in the week before).

hannobraun · 2017-02-06T08:55:15Z

As of today, my process has been running for 3 weeks without freezing. This is the longest confirmed uptime since I started tracking this issue in August.

While this isn't firm proof, it's a strong sign that #1006 is indeed a fix for this problem. I welcome anyone who can reproduce this issue to update to the latest release and see if it helps. I'll keep my process running for as long as I can to further confirm this has been fixed, but I may have to deploy a new version soon.

seanmonstar · 2017-02-06T18:17:55Z

@hannobraun Excellent! Thanks for sticking through it, and reporting back so much. I'm going to close this issue, since it does seem like it is fixed. If it occurs again, we can re-open.

hannobraun · 2017-02-07T08:57:10Z

@seanmonstar You're welcome, and thank you for your help. Glad to be rid of this problem (hopefully).

hannobraun mentioned this issue Jan 13, 2017

Set timeouts immediately after accepting a connection #1006

Closed

seanmonstar added the 0.10.x label Jan 20, 2017

seanmonstar closed this as completed Feb 6, 2017

trezm mentioned this issue Sep 26, 2019

Create an SSL example with tokio-tls #1942

Closed

Requests to hyper server eventually start failing. #950

Requests to hyper server eventually start failing. #950

Comments

jswrenn commented Nov 6, 2016 • edited Loading

seanmonstar commented Nov 6, 2016

jswrenn commented Nov 6, 2016 • edited Loading

hannobraun commented Nov 7, 2016

seanmonstar commented Nov 7, 2016

e-oz commented Nov 10, 2016 • edited Loading

e-oz commented Nov 10, 2016

e-oz commented Nov 11, 2016

e-oz commented Nov 11, 2016

e-oz commented Nov 11, 2016

seanmonstar commented Nov 11, 2016

e-oz commented Nov 11, 2016

e-oz commented Nov 11, 2016

e-oz commented Nov 11, 2016

seanmonstar commented Nov 11, 2016

e-oz commented Nov 11, 2016

hannobraun commented Nov 12, 2016

mikedilger commented Nov 12, 2016

e-oz commented Nov 15, 2016

e-oz commented Nov 21, 2016

e-oz commented Nov 24, 2016

hannobraun commented Dec 5, 2016

hannobraun commented Dec 9, 2016

hannobraun commented Dec 13, 2016

e-oz commented Dec 13, 2016

hannobraun commented Dec 13, 2016

e-oz commented Dec 13, 2016

hannobraun commented Dec 13, 2016

seanmonstar commented Dec 13, 2016

hannobraun commented Dec 13, 2016

mikedilger commented Dec 13, 2016

hannobraun commented Dec 14, 2016

hannobraun commented Dec 19, 2016

hannobraun commented Jan 6, 2017

hannobraun commented Jan 9, 2017

seanmonstar commented Jan 10, 2017

hannobraun commented Jan 11, 2017

seanmonstar commented Jan 11, 2017

hannobraun commented Jan 11, 2017

seanmonstar commented Jan 11, 2017

ojensen5115 commented Jan 21, 2017 • edited Loading

hannobraun commented Jan 23, 2017

hannobraun commented Feb 6, 2017

seanmonstar commented Feb 6, 2017

hannobraun commented Feb 7, 2017

jswrenn commented Nov 6, 2016 •

edited

Loading

jswrenn commented Nov 6, 2016 •

edited

Loading

e-oz commented Nov 10, 2016 •

edited

Loading

ojensen5115 commented Jan 21, 2017 •

edited

Loading