Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requests to hyper server eventually start failing. #950

Closed
jswrenn opened this issue Nov 6, 2016 · 44 comments
Closed

Requests to hyper server eventually start failing. #950

jswrenn opened this issue Nov 6, 2016 · 44 comments

Comments

@jswrenn
Copy link

jswrenn commented Nov 6, 2016

I wrote a basic server application using hyper 0.9.10 that prints the IP, referrer, and body of each request to stdout (which I then redirect into a file):

extern crate hyper;
use std::env;
use std::io::Read;
use hyper::method::Method::Post;
use hyper::server::{Server, Request, Response};
use hyper::header::{Referer, AccessControlAllowOrigin};
use hyper::net::Openssl;

fn main() {

  let mut args = env::args().skip(1);
  let addr = args.next().unwrap();
  let cert = args.next().unwrap();
  let key  = args.next().unwrap();
  
  let ssl = Openssl::with_cert_and_key(cert, key).unwrap();

  let _ = Server::https((addr.as_str(), 443), ssl).unwrap()
    .handle(|req: Request, mut res: Response| {
      res.headers_mut().set(AccessControlAllowOrigin::Any);
      let _ = res.send(b"");
      if req.method == Post {
        let (addr, _, headers, _, _, mut body) = req.deconstruct();
        headers.get::<Referer>().
          map(|&Referer(ref referer)| {
            let mut buffer = String::new();
            let _ = body.read_to_string(&mut buffer);
            println!("{:}\t{:}\t{:}", addr, referer, buffer);
          });
      }
    });
}

Everything works as expected initially, but after some number of hours after starting the server, all requests to it begin to fail (and do so very slowly). When this happens, the results of time curl -k ... look like this:

curl: (35) Unknown SSL protocol error in connection to logbook.pyret.org:443 

real	1m43.616s
user	0m0.013s
sys	0m0.000s

Restarting the server application corrects this issue and requests are handled within 100ms, tops.

I'm flummoxed.

@seanmonstar
Copy link
Member

Does this problem occur without the use of openssl?

@jswrenn
Copy link
Author

jswrenn commented Nov 6, 2016

I'm not sure. I have not yet managed to reproduce the problem in a short amount of time, and running the application without openssl is not possible.

EDIT: To give an idea of the timeline between starting the application and it failing: The last time this happened was a week after the server application was started, during which about 90k requests were handled. The incident before that one occurred several days after the application was started.

@hannobraun
Copy link

I'm possibly seeing the same issue. I haven't spoken up so far because I'm not sure it's Hyper-related or caused by my own code, and I haven't had the time to look into it yet. But since this issue is open now, I figured I'd chime in.

If my issue is Hyper-related at all, it definitely only occurs in connection with OpenSSL. HTTPS requests don't get a reply, while HTTP requests are handled promptly. The issue goes away after a restart.

This has been happening every 1-2 weeks since early August, now almost daily since late October. I didn't deploy a new version before the rate started picking up, so it must be related to external factors, too. I don't have traffic statistics at hand, but it can't be much. The page has near zero content and hasn't been announced anywhere, so I'd be surprised if anyone but me and the search engine bots were looking at it.

Whether the issue is caused by Hyper+OpenSSL or my own code is hard to tell without further inspection, as the only real work is done in response to HTTPS requests. An HTTP request will only return a redirect to the equivalent https:// URL.

@seanmonstar At the moment this has a very low priority for me, but if you wish, I can take the time to look into it and try to create a reduced test case. To be honest, I kinda hope this will just go away with async Hyper, so I wasn't planning on spending any effort on this, unless I need that website to go live before 0.10 lands.

@seanmonstar
Copy link
Member

Very interesting. I wonder if the usage of SslStream in the 0.9.x branch is doing something wrong.

I do hope that asynchronous IO helps in this way, and I know that the newer tokio-tls crate that hyper will be using has had much more work done. If this bug is because of hyper's usage of OpenSSL, one way to get around for now is to put nginx in front, and terminate TLS there, before sending requests to your hyper server.

@e-oz
Copy link

e-oz commented Nov 10, 2016

I confirm this issue. Without using openssl. Appeared after upgrading from 0.9.10 to 0.9.11, still exist in 0.9.12, even when hyper is without openssl.

@e-oz
Copy link

e-oz commented Nov 10, 2016

More about my environment: nginx is behind, most of queries to the server are simple http, not https.

@e-oz
Copy link

e-oz commented Nov 11, 2016

Downgrade to 0.9.10 didn't help.

@e-oz
Copy link

e-oz commented Nov 11, 2016

Error message in nginx log:
[error] 23607#0: *9 connect() failed (111: Connection refused) while connecting to upstream, client: 8.my.9.ip, server: server.domain.com, request: "GET / HTTP/1.1", upstream: "http://127.0.0.1:3000/", host: "server.domain.com"

0 errors from the server itself.
Connection is refused immediately, without any waiting.

@e-oz
Copy link

e-oz commented Nov 11, 2016

I can say it's not based on amount of requests. With 0.9.10 reboot doesn't help, with 0.9.12 reboot helps (although it's not a solution, so I'm on 0.9.9 right now).

@seanmonstar
Copy link
Member

Hm, does turning off keep-alive fix it? You can pass None to server.keep_alive(None): http://hyper.rs/hyper/v0.9.12/hyper/server/struct.Server.html#method.keep_alive

@e-oz
Copy link

e-oz commented Nov 11, 2016

Switching to 0.9.9 helped - no failures 5 hours already. Will try keep_alive on weekend.

@e-oz
Copy link

e-oz commented Nov 11, 2016

Nice, now 0.9.9 just failed and don't want to work after restart... Our main site is just offline and I can't do nothing.

@e-oz
Copy link

e-oz commented Nov 11, 2016

Looks like Iron doesn't give access to underlying Hyper Server, so keep-alive is not an option.

@seanmonstar
Copy link
Member

@e-oz looks like you can pass a Timeouts config struct to Iron::listen_with: http://ironframework.io/doc/iron/struct.Timeouts.html

@e-oz
Copy link

e-oz commented Nov 11, 2016

@seanmonstar at least it started this time, thanks. Will report how long it will work.

@hannobraun
Copy link

For what it's worth, I've disabled keep-alive some months ago hoping that it would help, but it didn't seem to make any difference. But I'm guessing my issue is different from @e-oz's anyway, since I've never experienced that a restart didn't help.

@mikedilger
Copy link
Contributor

I've got multiple websites running on 0.9.11 with no issue, but I'm front-ending them with nginx, and nginx is handling all of the TLS and then proxying the content (hyper serves only to 127.0.0.1 so people can't bypass the nginx frontend). Also my handler sets connection-close if it is busy (and hyper enforces it), and that fixed the problem of clients tying up all the threads with keep-alives.

With this setup, I've had no problems. Just another data-point . Also, I'm not using Iron, so there's that.

@e-oz
Copy link

e-oz commented Nov 15, 2016

@seanmonstar looks like for me switching keep-alive off solved the issue.

@mikedilger I use nginx as a proxy too.

@e-oz
Copy link

e-oz commented Nov 21, 2016

And after 8 days it failed again, in night time, without any requests.

@e-oz
Copy link

e-oz commented Nov 24, 2016

My sincerest apology - I found reason of failures, it was segfault in another library. I'm sorry for false assumptions without any proofs, excuse me please.

@hannobraun
Copy link

Update from me: I'm going to need that website I'm working on pretty soon now, so my motivation to figure out this issue has increased quite a bit. Unfortunately the frequency of the freezes has gone down since I saw daily freezes in early November, and I'm only seeing the problem every 1-2 weeks now. That makes it harder to figure out what's going on.

None the less, I saw my first freeze today since I've pushed a version with more logging. As far as I can tell, all requests that enter my handler also leave it. That indicates the problem originates in Hyper (or below), which is what I suspected anyway.

I'm going to keep investigating and will check back as I learn more.

@hannobraun
Copy link

At the time of my last post, I deployed a new version with the following changes:

  • Full logging enabled
  • Set read and write timeouts

That version froze yesterday. This rules out my suspicion/hope that setting the timeouts would solve the problem.

After a careful examination of the logs I learned some more things:

  • The last successful HTTPS request left Worker::handle_connection ("keep_alive loop ending for ..." is logged)
  • All the following requests in the log are successful HTTP request. None of the frozen requests leave a trace in the log.
  • Since all requests in the log start with "Incoming stream", I believe the handling thread gets frozen outside of Worker.

My plan now is to add more logging to my Hyper fork and deploy that later today. I'll check back once I learn more.

@hannobraun
Copy link

I had another freeze. I think I can confirm that problem is related to, maybe even caused by, OpenSSL. After the last successful HTTPS request I have in the logs, there's a partial requests that stops right before calling wrap_server.

Unfortunately I made a mistake placing the log messages within my wrap_server implementation, so I can't say anything definite right now, but I do believe that the freeze is happening somewhere within SslStream::accept, which is part of the OpenSSL crate, not Hyper.

I've already fixed my broken log messages and am going to deploy that fix shortly. If that substantiates my suspicion, I'll try and upgrade the OpenSSL crate to the latest version.

@e-oz
Copy link

e-oz commented Dec 13, 2016

@hannobraun you can't use openssl crate of the latest version with hyper.

@hannobraun
Copy link

@e-oz I know that Hyper's stuck on 0.7 currently. I plan to upgrade my Hyper fork to the latest version.

@e-oz
Copy link

e-oz commented Dec 13, 2016

@hannobraun latest version of openssl is still not perfect, I'm experiencing segfaults with it. I'm not trying to stop you, though :)

@hannobraun
Copy link

@e-oz Thanks for the info. Ideally, I'd like to fix the source of the problem. If that source happens to be rust-openssl, it makes sense to first check if it still occurs in the latest version. If I can't fix it, or the end result is not satisfactory for another reason, I can still fall back to using nginx for HTTPS.

By the way, I'm also seeing segfaults, about once per month. Haven't started tracking those down yet :)

@seanmonstar
Copy link
Member

Thanks for investigating this. I've had a hyper server running at meritbadge.herokuapp.com for more than a year, and it never hangs or segfaults. However, being on heroku means I don't use SSL...

@hannobraun
Copy link

@seanmonstar No need to thank me. Just doing what I have to, to get that website online. Thank you, for building something that makes it all possible in the first place!

@mikedilger
Copy link
Contributor

@hannobraun This is quite a long shot, but I've recently seen surprising behaviour where OpenSSL uses locking callbacks provided externally, and if you link to something that provides such (e.g. python) then you could accidentally block OpenSSL calls with something like a python GIL not being unlocked: sfackler/rust-openssl#532. Like I say, a long shot, but if you are linking to python, something to consider.

@hannobraun
Copy link

@mikedilger Thanks for the info, although I don't think it applies to my case. I'm not aware of anything in my dependencies that would register those callbacks. Certainly no Python in there. Very interesting bug though :-)

@hannobraun
Copy link

Froze again. All evidence still points to SslStream::accept. Unfortunately I found another way how my logs could be misleading. Seems unlilely at this point, but I decided to cover my bases. I just deployed another version with improved logging and will check back once it freezes again.

@hannobraun
Copy link

And it froze again. My suspicions have been confirmed. The thread is definitely entering SslStream::accept and not coming out again.

I've decided to wait for #985 to resolve before testing with the latest OpenSSL. In the meantime, I'm going to try Rustls [1][2]. Depending on how that goes, I might not even go back to OpenSSL afterwards. Let's see.

@hannobraun
Copy link

Now here's a surprise: 3 days into my Rustls evaluation, the process froze again. Everything looked exactly the same as it did with OpenSSL!

I don't know what this means. Maybe the issue is with Hyper after all, maybe it's with the operating system, or maybe OpenSSL and Rustls just happen to have the same bug. No idea.

Unfortunately, when I switched to Rustls, I also upgraded to the latest Hyper release, which means I lost all the extra logging from my special Hyper fork. I'll re-add the logging and will report back once I know more.

@seanmonstar
Copy link
Member

@hannobraun I'm curious if setting read and write timeouts before passing the socket to wrap_server would help at all. Depends if the hang is waiting on socket io, or some sort of Mutex deadlock.

@hannobraun
Copy link

@seanmonstar Interesting, so the timeouts are set in Worker::handle_connection after the socket has been passed wrap_server. This seems wrong to me. We might be on to something :-)

I will test whether setting the timeouts first will solve this problem, but I guess this should be fixed in any case. Do you agree, or am I missing something here?

@seanmonstar
Copy link
Member

It was just something I never thought, and the code in Worker doesn't make it obvious that socket IO happened before handle_connection, when really, inside HttpsConnector::connect, after the socket has been accepted, there is indeed socket IO to handle the TLS handshake.

There isn't currently a way for the NetworkConnector to know about possible timeouts set in the Server, but perhaps it makes sense to have a healthy timeout set in there before wrap_server. We could say something like 10 seconds for read and write, or possibly it makes sense to update connector in some way to allow the server to hint at a timeout.

@hannobraun
Copy link

I'll look into it. At the very least I'll hack something together to verify that this causes the problem, but I'll also try to come up with a good solution.

Would you want me to create a pull request against 0.9.x pretty soon (I should be able to get it done by next week, at the latest)? Or do you think it would be better to defer this for a bit, until 0.10.x has landed?

@seanmonstar
Copy link
Member

I'm publishing 0.10.0 right now (fixing docs deployment before publishing to crates.io, but asap!). So to allow it to be a non-breaking change, maybe a new method on the NetworkConnector trait that has a default implementation to ignore the timeouts.

Again, I'm not certain this will help, as it may be a deadlock somewhere else. Just thinking that TLS handshake has both read and write IO, and with a blocking socket, it will just block until the other side responds or hangs up.

@ojensen5115
Copy link

ojensen5115 commented Jan 21, 2017

This may be unrelated, but I'm curious if the people experiencing this issue are using Debian / OpenSSL?

I experienced almost identical symptoms in a Python application I was working on some time ago, where after an inconsistent amount of time (sometimes days, sometimes weeks) the application would simply hang doing nothing, and the problem was essentially impossible to reproduce intentionally -- the time after which the issue occurred did not seem to correlate to the number of requests, so simply bombarding the server with requests would not trigger the bug.

The underlying problem was that the Debian repo's version of OpenSSL would sometimes hang indefinitely when being called in Python's do_handshake() method. I wonder if perhaps the same underlying issue is at play here as well. For what it's worth, I ended up "fixing" it in my application by setting a timeout around the call to OpenSSL (see here).

@hannobraun
Copy link

@ojensen5115 Not me. I'm using Arch Linux with OpenSSL 1.0.2j. It does sound like the same bug, though.

On a related note: I've been running the 0.10.x branch (which includes #1006) for a week now. No freezes yet. Not conclusive at all, but a good sign perhaps (it froze four times in the week before).

@hannobraun
Copy link

As of today, my process has been running for 3 weeks without freezing. This is the longest confirmed uptime since I started tracking this issue in August.

While this isn't firm proof, it's a strong sign that #1006 is indeed a fix for this problem. I welcome anyone who can reproduce this issue to update to the latest release and see if it helps. I'll keep my process running for as long as I can to further confirm this has been fixed, but I may have to deploy a new version soon.

@seanmonstar
Copy link
Member

@hannobraun Excellent! Thanks for sticking through it, and reporting back so much. I'm going to close this issue, since it does seem like it is fixed. If it occurs again, we can re-open.

@hannobraun
Copy link

@seanmonstar You're welcome, and thank you for your help. Glad to be rid of this problem (hopefully).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants