Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequent HTTP::TimeoutError's (Read) on 1.0.x but not on 0.9.x #298

Closed
javierjulio opened this issue Jan 19, 2016 · 39 comments
Closed

Frequent HTTP::TimeoutError's (Read) on 1.0.x but not on 0.9.x #298

javierjulio opened this issue Jan 19, 2016 · 39 comments

Comments

@javierjulio
Copy link

I've been using the 0.9.x releases for a few months now and have since upgraded to 1.0.0 and then 1.0.2. I've noticed several HTTP::TimeoutError's on 1.0.x locally but I get it almost every single time I deploy a gem that runs integration tests on CircleCI (Ubuntu). If I update the two gems I've created that depend on http and downgrade it back to add_dependency "http", '~> 0.9' all my integration tests that are making live HTTP requests consistently pass without a single timeout error. At first it seemed that the 1.0.2 release was helping but the timeout errors still occur. I'm trying to debug the issue further but figured in the meantime I'd post this and see if anyone else has had a similar issue when upgrading to 1.0.x. As I learn more I'll report back.

For configuring timeouts I went with a default of { write: 2, connect: 5, read: 10 }. Increasing the read timeout value doesn't help. Note that this is per request. While I don't have stats I know that requests don't take 10 seconds each (just the read part fails). The same timeouts are used for http 0.9.x and all integration tests pass with no frequent timeout errors.

I use http gem in the following JSON API client gems: synapse_payments and paysafe.

Using Ruby 2.2.3 in app and each gem using http gem.

@zanker
Copy link
Contributor

zanker commented Jan 19, 2016

How long does the request normally take, and how long before it times out? Also what version of Ruby?

There's some changes @tarcieri made around the API calls we use for timeouts, but they should work identical to before. I could believe there's a small rounding change where if you were cutting it close on timeouts, but given you said increasing it doesn't help, it's probably not that.

@tarcieri
Copy link
Member

@javierjulio can you provide any sort of timeline of the events that are happening and how those don't line up with your described timeout model?

@javierjulio
Copy link
Author

@zanker I don't know how long requests normally take, since I'm making requests to a sandbox service I'd say about a second or so each. Note that the timeouts are for each request (per operation). I've updated the description with this info and Ruby version (2.2.3). Thanks!

@tarcieri to be honest I don't know what the best timeouts are, I just went with what shown in the README as it seemed better than nothing. I figured I would just come back and fine tune it later. What I do know though is after upgrading to 1.0.x I consistently received HTTP::TimeoutError (Read) errors but I'm not seeing it on 0.9.x. I just deployed 3 separate branches to CircleCI that runs a fair amount of integration tests and all passed without timeout errors. Before it happened on each deploy. Having a hard time figuring out how I can get you more information. If I find out more if you have a suggestion let me know and I'll do my best to get it to you ASAP. Thanks!

@tarcieri
Copy link
Member

@javierjulio more generally, do you think you're getting a HTTP::TimeoutError sooner than you should? I guess my question is: what's misbehaving? What do you expect, and what's actually happening?

@javierjulio
Copy link
Author

@tarcieri that's a great question, I'm not sure. I don't believe so. I'm not expecting it to timeout. It hasn't before and when downgrading its fine. At first I thought it was service related but the third party service has been up and no reported issues any time I check. I started to suspect http since I upgrade dependencies frequently. Since I've been on 1.0.x I've had a few requests fail due to timeout errors consistently. The requests vary. I don't have more info sadly. Sorry. 😩 If I find out more I'll provide it or if any questions I can answer.

@ixti
Copy link
Member

ixti commented Jan 20, 2016

We experienced same issue. So we had to downgrade to 0.9.
It fails WAY sooner for us :connection => 15 and it fails instantly...
As soon as I will grab details on issue to be able reproduce - will add it here.

@ixti
Copy link
Member

ixti commented Jan 20, 2016

Was unable to 100% simple reproducible variant.
But following one can be used to reproduce:

# reproducible on ruby 2.1.8

url = "https://play.google.com/store/apps/" \
      "category/GAMES/collection/top?start=400&num=100&hl=en&gl=us"

42.times do
  start = Time.now.to_f
  begin
    HTTP.timeout(:read => 5, :connect => 5, :write => 5).get(url).flush
  rescue
    puts "Failed after: #{Time.now.to_f - start}"
    raise
  end
end

@ixti
Copy link
Member

ixti commented Jan 20, 2016

Fails for me with:

Failed after: 0.14695525169372559
HTTP::TimeoutError: Read timed out after 5 seconds
from /home/ixti/.gem/ruby/2.1.8/gems/http-1.0.2/lib/http/timeout/per_operation.rb:75:in `block in readpartial'

@ixti
Copy link
Member

ixti commented Jan 20, 2016

The snippet abbove reproduces error on ruby 2.1.8, 2.2.4, but not on 2.3.0.

@javierjulio
Copy link
Author

@ixti awesome thanks! I was thinking I might be crazy. Good to know its reproducible elsewhere. I just took your same code snippet, ran it on Ruby 2.2.3 using HTTP 1.0.2 and immediately replicated the issue. It fails right away. The output from the first failure is below. I ran the snippet four more times and each one failed consistently for me.

Failed after: 0.0874330997467041
HTTP::TimeoutError: Read timed out after 5 seconds
    from /Users/javierjulio/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/http-1.0.2/lib/http/timeout/per_operation.rb:75:in `block in readpartial'
    from /Users/javierjulio/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/http-1.0.2/lib/http/timeout/per_operation.rb:61:in `loop'
    from /Users/javierjulio/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/http-1.0.2/lib/http/timeout/per_operation.rb:61:in `readpartial'
    from /Users/javierjulio/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/http-1.0.2/lib/http/connection.rb:213:in `read_more'
    from /Users/javierjulio/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/http-1.0.2/lib/http/connection.rb:83:in `readpartial'
    from /Users/javierjulio/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/http-1.0.2/lib/http/response/body.rb:41:in `to_s'
    from /Users/javierjulio/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/http-1.0.2/lib/http/response.rb:79:in `flush'

@tarcieri
Copy link
Member

@ixti @javierjulio thanks for the repro. Will look into it

@ixti
Copy link
Member

ixti commented Jan 20, 2016

Just to explain a bit. It fails pretty randomly:

url = "https://play.google.com/store/apps/" \
      "category/GAMES/collection/top?start=400&num=100&hl=en&gl=us"

42.times do |i|
  start = Time.now.to_f

  print format("%3d ... ", i + 1)
  begin
    HTTP.timeout(:read => 5, :connect => 5, :write => 5).get(url).flush
    puts "pass (#{Time.now.to_f - start})"
  rescue
    puts "fail (#{Time.now.to_f - start})"
    raise
  end
end

It may fail at any cycle. Absolutely randomly. So if you will remove raise from this snippet, results will look like:

  1 ... pass (0.13664865493774414)
  2 ... pass (0.15675663948059082)
  3 ... pass (0.1458735466003418)
  4 ... fail (0.13687992095947266)
  5 ... pass (0.13691043853759766)
  6 ... pass (0.13472723960876465)
  7 ... pass (0.1393880844116211)
  8 ... pass (0.14629626274108887)
  9 ... pass (0.14220976829528809)
 10 ... fail (0.16389107704162598)
 11 ... pass (0.14554977416992188)
 12 ... pass (0.1415846347808838)
 13 ... pass (0.13765287399291992)
 14 ... pass (0.14561843872070312)
 15 ... fail (0.1480247974395752)
 16 ... pass (0.17023563385009766)
 17 ... fail (0.1748495101928711)
 18 ... fail (0.15146231651306152)
 19 ... pass (0.14230012893676758)
 20 ... pass (0.14637160301208496)
 21 ... pass (0.1373133659362793)
 22 ... fail (0.16868877410888672)
 23 ... pass (0.13465023040771484)
 24 ... pass (0.14666128158569336)
 25 ... pass (0.16004157066345215)
 26 ... pass (0.15309667587280273)
 27 ... pass (0.1454617977142334)
 28 ... fail (0.1554243564605713)
 29 ... pass (0.14315485954284668)
 30 ... pass (0.15085339546203613)
 31 ... pass (0.14633703231811523)
 32 ... pass (0.1460552215576172)
 33 ... fail (0.13860630989074707)
 34 ... pass (0.16732072830200195)
 35 ... pass (0.14673137664794922)
 36 ... pass (0.2079026699066162)
 37 ... pass (0.1494007110595703)
 38 ... pass (0.14219236373901367)
 39 ... pass (0.13936614990234375)
 40 ... fail (0.13872718811035156)
 41 ... fail (0.14003658294677734)
 42 ... pass (0.143571138381958)

@ixti
Copy link
Member

ixti commented Jan 20, 2016

Notice that each cycle is pretty same in timing. So it has nothing to do with actual timing.
Also, rescue_readable of Null timeout and readpartial of PerOperation (child of Null)
seems like might to affect each other:

# https://github.com/httprb/http/blob/master/lib/http/timeout/null.rb#L78-L83
class Null
  def rescue_readable
    yield
  rescue IO::WaitReadable
    retry if @socket.to_io.wait_readable(read_timeout)
    raise TimeoutError, "Read timed out after #{read_timeout} seconds"
  end
end

# https://github.com/httprb/http/blob/master/lib/http/timeout/per_operation.rb#L59-L78
class PerOperation < Null
  def readpartial(size)
    loop do
      # JRuby may still raise exceptions on SSL sockets even though
      # we explicitly specify `:exception => false`
      result = rescue_readable do
        @socket.read_nonblock(size, :exception => false)
      end

      if result.nil?
        return :eof
      elsif result != :wait_readable
        return result
      end

      unless @socket.to_io.wait_readable(read_timeout)
        fail TimeoutError, "Read timed out after #{read_timeout} seconds"
      end
    end
  end
end

@tarcieri
Copy link
Member

@ixti @javierjulio if this was working on an earlier version and regressed, can either of you do a git bisect and try to find the commit that introduced the bug?

@ixti
Copy link
Member

ixti commented Jan 21, 2016

@tarcieri sure.

f4df840f104b96a4e46500e6b595691f83ee8573 is the first bad commit
commit f4df840f104b96a4e46500e6b595691f83ee8573
Author: Tony Arcieri <bascule@gmail.com>
Date:   Fri Dec 25 20:54:23 2015 -0800

    Use io/wait on platforms where it's available

    IO.select is an expensive call, because it has to build an fd_set (large
    bitfield) from an array each time we invoke it.

    We're just interested in IO readiness, so on Ruby 2.0+ we can use the io/wait
    API instead, which adds wait_readable/wait_writable to IO objects.

:100644 100644 aec81d476714d4bd1edba1da786c26774b82b997 e197d7ac9d8cbc41a99956b05054ff9c6f566941 M  .rubocop.yml
:040000 040000 b56dd8a64ff68fe4048bca997f1476a7073d0d13 d66955cb1686a5d387e8a04c41b266eb3b79b52e M  lib

@tarcieri
Copy link
Member

@ixti joy! Since you have it reproed, mind playing with it a bit more?

That commit changed all 3 types of timeouts (global.rb, null.rb, and per_operation.rb). Can you try reverting the individual files there back to their previous versions and see if you can further narrow down what's causing the problem?

@ixti
Copy link
Member

ixti commented Jan 21, 2016

@tarcieri sure. :D

@ixti
Copy link
Member

ixti commented Jan 21, 2016

But I won;t be able to till weekend. Just for the record.

@tarcieri
Copy link
Member

No worries

@tarcieri
Copy link
Member

@ixti well, I'm thinking I should revert this entirely unless we can narrow it down

@ixti
Copy link
Member

ixti commented Jan 31, 2016

@tarcieri I've pushed a patch that makes snippet from the above work on all rubies. Also it passes all tests...

@tarcieri
Copy link
Member

@ixti odd... that is something that sticks out from my changes to the original code. The comment is actually documenting why it's using rescue_readable and rescue_writable despite the :exception => false

We may not have sufficient coverage for EAGAIN cases for SSL (they're rather difficult to test)

@ixti
Copy link
Member

ixti commented Jan 31, 2016

After a discussion over an IRC, I agree that proposed patch is terribly wrong.

@ixti
Copy link
Member

ixti commented Jan 31, 2016

After my attempt to simplify code in order to find a better fix, it happened that it requires more thoughtful code reading. I will try to find time over a week to come with a better fix.

Meanwhile, @tarcieri if you're OK to revert that particular commit, probably that's the best option.

If anybody wanna jump on fixing this, please do so! :D

@javierjulio
Copy link
Author

@tarcieri I saw the revert in master so I installed http from master locally and in a console ran the same script @ixti wrote. I fooled around with the timeouts too but I still keep getting frequent read errors, in fact more so than what we showed here. I made a small adjustment to the script to be sure the timeout is on read and it is. Although I could very well be doing something wrong. I did verify my install was using the master branch as it printed out the last git commit sha.

Using http 1.0.2 from https://github.com/httprb/http.git (at master@8f85d95)

The script:

url = "https://play.google.com/store/apps/" \
      "category/GAMES/collection/top?start=400&num=100&hl=en&gl=us"

42.times do |i|
  start = Time.now.to_f

  print format("%3d ... ", i + 1)
  begin
    HTTP.timeout(:read => 10, :connect => 5, :write => 5).get(url).flush
    puts "pass (#{Time.now.to_f - start})"
  rescue HTTP::Error => error
    puts "fail #{error} (#{Time.now.to_f - start})"
  end
end

The output:

  1 ... fail Read timed out after 10 seconds (0.09555983543395996)
  2 ... pass (0.09866118431091309)
  3 ... pass (0.1078948974609375)
  4 ... fail Read timed out after 10 seconds (0.10645508766174316)
  5 ... fail Read timed out after 10 seconds (0.11573100090026855)
  6 ... fail Read timed out after 10 seconds (0.10515403747558594)
  7 ... fail Read timed out after 10 seconds (0.09684395790100098)
  8 ... fail Read timed out after 10 seconds (0.10500097274780273)
  9 ... pass (0.15207505226135254)
 10 ... pass (0.11043787002563477)
 11 ... fail Read timed out after 10 seconds (0.0944669246673584)
 12 ... pass (0.10115599632263184)
 13 ... fail Read timed out after 10 seconds (0.10858893394470215)
 14 ... fail Read timed out after 10 seconds (0.09439206123352051)
 15 ... fail Read timed out after 10 seconds (0.13071513175964355)
 16 ... fail Read timed out after 10 seconds (0.13274002075195312)
 17 ... pass (0.09566402435302734)
 18 ... fail Read timed out after 10 seconds (0.11975598335266113)
 19 ... pass (0.09542608261108398)
 20 ... pass (0.09283804893493652)
 21 ... pass (0.11658596992492676)
 22 ... pass (0.31825733184814453)
 23 ... fail Read timed out after 10 seconds (0.09906291961669922)
 24 ... pass (0.10399913787841797)
 25 ... fail Read timed out after 10 seconds (0.09821271896362305)
 26 ... pass (0.08867406845092773)
 27 ... pass (0.08523297309875488)
 28 ... fail Read timed out after 10 seconds (0.09504318237304688)
 29 ... fail Read timed out after 10 seconds (0.1117861270904541)
 30 ... pass (0.11092257499694824)
 31 ... fail Read timed out after 10 seconds (0.10589003562927246)
 32 ... pass (0.09644794464111328)
 33 ... fail Read timed out after 10 seconds (0.10133981704711914)
 34 ... pass (0.096466064453125)
 35 ... fail Read timed out after 10 seconds (0.09136009216308594)
 36 ... pass (0.09183812141418457)
 37 ... fail Read timed out after 10 seconds (0.08922100067138672)
 38 ... pass (0.11403322219848633)
 39 ... pass (0.09183883666992188)
 40 ... pass (0.09493303298950195)
 41 ... fail Read timed out after 10 seconds (0.10203123092651367)
 42 ... pass (0.10525012016296387)

@tarcieri
Copy link
Member

@javierjulio it wasn't a full revert... I was trying to preserve the new changes to support an IO.wait backend.

Can you try bisecting yourself and seeing if you get the same commit as @ixti? f4df840

I can try a full revert of this commit next if so...

@javierjulio
Copy link
Author

@tarcieri oops sorry about that!

tarcieri added a commit that referenced this issue Mar 19, 2016
This reverts commit f4df840.

Conflicts:
	lib/http/timeout/null.rb
	lib/http/timeout/per_operation.rb
@tarcieri
Copy link
Member

@javierjulio what Ruby was that on? I can't reproduce your problem

@tarcieri
Copy link
Member

I did a full revert in #320 and released that as 1.0.4. That should hopefully fix your problem.

@tarcieri
Copy link
Member

Considering this solved for now. Please reopen if you disagree.

@ixti
Copy link
Member

ixti commented Mar 20, 2016

Not that I disagree... But unfortunately snippet still fails for me on ruby < 2.3.0:

chruby | sed 's/^[^jr]*//' | while read x; do
  chruby $x && echo "== $(ruby -v)"
  ruby -Ilib -rhttp snippet.rb | cut -b8-12 | sort | uniq -c
done

outputs

== jruby 9.0.5.0 (2.2.3) 2016-01-26 7bee00d Java HotSpot(TM) 64-Bit Server VM 25.74-b02 on 1.8.0_74-b02 +jit [linux-amd64]
     42  pass
== ruby 2.1.8p440 (2015-12-16 revision 53160) [x86_64-linux]
      8  fail
     34  pass
== ruby 2.2.4p230 (2015-12-16 revision 53155) [x86_64-linux]
      6  fail
     36  pass
== ruby 2.3.0p0 (2015-12-25 revision 53290) [x86_64-linux]
     42  pass

@ixti
Copy link
Member

ixti commented Mar 20, 2016

Snippet was executed against current master.

@tarcieri
Copy link
Member

@ixti can you repro against 1.0.4 too?

@tarcieri
Copy link
Member

Ok, I'm able to reproduce the problem on Ruby 2.2 for master, although 1-x-stable seems fine for me.

@tarcieri
Copy link
Member

I think this actually fixes the problem: #322

tarcieri added a commit that referenced this issue Mar 21, 2016
Revert to IO.select for PerOperation timeouts (fixes #298)
@tarcieri
Copy link
Member

@ixti @javierjulio please double check, but I think both master and 1-x-stable are fixed now.

@javierjulio
Copy link
Author

@tarcieri sorry, I wasn't able to look at this sooner. I just tried this on Ruby 2.2.3 (Mac OS X latest) with http from 1-x-stable and master with multiple script runs and I got a 100% pass rate. No failures at all. Switched back to 1.0.2 to verify the failures still occurred and they did. Bug is squashed! Thanks! :shipit:

@tarcieri
Copy link
Member

Woop! The root cause is still mysterious, but I'm glad we got this narrowed down and resolved

@ixti
Copy link
Member

ixti commented Mar 21, 2016

I confirm that master as well as 1-x-stable work as expected on my laptop with JRuby 9.0.5.0, MRI 2.1.8p440, 2.2.4p230 and 2.3.0p0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants