Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ConnectionPool connect_timeout and idle_timeout #215

Merged
merged 8 commits into from
Mar 20, 2018
Merged

Conversation

samoconnor
Copy link
Contributor

No description provided.

isdeletable(c) = (!isopen(c.io) || (c.idle_timeout > 0 &&
!c.readbusy &&
!c.writebusy &&
time()-c.timestamp > c.idle_timeout)) &&
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to call close(c.io) here!

@quinnj
Copy link
Member

quinnj commented Mar 4, 2018

Do you think we could somehow keep our own list of known host timeouts and use those defaults per-host? I'm just thinking about the excellent sleuthing you did in #214, like, we could have a github.com default idle/connect timeout, maybe if we detect a specific server type in response headers. Hmmm....maybe too much work to be worth it, but you'd know a little better w/ all the research you've done on this lately. I'm just thinking it might be nice for users to not have to worry about it in some common cases.

# FIXME this currently causes the Julia runtime to hang on exit...
result = Ref{TCPSocket}()
error = Ref{Any}()
@schedule try
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vtjnash I'd like your opinion on this.
Should we include a best-attempt connection timeout feature. Or should we just say for now that underlying infrastructure currently makes this hard to implement, and that users should do their own @async timeout thing if they really need a connect timeout?
JuliaCloud/AWSCore.jl#24 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems reasonable to add, I just recommended using close to do it "properly" (I don't know why the libuv developers claimed in the linked issue that you couldn't cause ECANCELED, since I clearly can):

julia> s = Sockets.TCPSocket()
Main.Sockets.TCPSocket(RawFD(0xffffffff) init, 0 bytes waiting)

julia> Sockets.connect!(s, Sockets.ip"192.168.101.1", 80)

julia> @async @show Sockets.wait_connected(s)

julia> close(s)

julia>
ERROR (unhandled task failure): connect: operation canceled (ECANCELED)

Around this, you just need to add a bit of logic to coordinate closing the socket if the timeout was reached and the socket is still s.state == StatusConnecting. And possible also to loop until the timeout is reached (as mentioned in the linked issue, the kernel has its own idea of what the default timeout should be), if you think 2 minutes is too short.

This also uncovers a minor bug in Base, since this assert is wrong:

diff --git a/stdlib/Sockets/src/Sockets.jl b/stdlib/Sockets/src/Sockets.jl
index ec7876a917..e947da8fd4 100644
--- a/stdlib/Sockets/src/Sockets.jl
+++ b/stdlib/Sockets/src/Sockets.jl
@@ -371,12 +371,13 @@ end
 function uv_connectcb(conn::Ptr{Cvoid}, status::Cint)
     hand = ccall(:jl_uv_connect_handle, Ptr{Cvoid}, (Ptr{Cvoid},), conn)
     sock = @handle_as hand LibuvStream
-    @assert sock.status == StatusConnecting
     if status >= 0
-        sock.status = StatusOpen
+        if !(sock.status == StatusClosed || sock.status == StatusClosing)
+            sock.status = StatusOpen
+        end
         notify(sock.connectnotify)
     else
-        sock.status = StatusInit
+        ccall(:jl_forceclose_uv, Cvoid, (Ptr{Cvoid},), hand)
         err = UVError("connect", status)
         notify_error(sock.connectnotify, err)
     end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See new commit.
When I tried to use just close(tcp) the close call didn't return. If I did @schedule close(tcp) the REPL hung on exit. So, for now I'm calling ccall(:jl_forceclose_uv, which seems to work.

Is is worth making a PR for stdlib/Socket for this? (It seems like it might be easier to make it robust without being confined to the official public interface) or should it stay in HTTP.jl for now?

@samoconnor
Copy link
Contributor Author

Re: list of known host timeouts ...

Doesn't feel right to me.
It might make sense for API packages like GitHub.jl, AWSSQS.jl to be tuned for the behaviour of their particular servers.
I think a sensible default is to have no timeout (assuming that the current scheme of retrying automatically when the connection is found to be dead works in general). Even with a well tuned timeout, the first connection after the timeout still has to pay the reconnection penalty. I think anyone who's application depends on guaranteed worst-case latency limits will need to have some higher level scheme of application level heartbeats, pre-warming of spare connections etc.

What I've read about TCP level keepalive is suggests that in practice many CDNs, ISP NATs, firewalls, proxies etc either don't forward keepalives, or don't count them as activity for timeout purposes.

The general idea of a metadata directory for HTTP endpoints is an interesting one. I don't think building the database into the Julia HTTP library is right, but I can imagine a web service where you do GET http://httpmeta.net/info?url=foobar.com and get back a JSON containing a bunch of info about standards compliance of that server, known issues, tuning parameters, etc etc. Then again, maybe it wouldn't be much use in practice because these days when you go to foobar.com, there are many layers of geo-aware CDNs, local ISP caches/proxies, etc between you and the "real server", so the tuning that works for the thing that "foobar.com" connects you to, might not work for me.

@codecov-io
Copy link

codecov-io commented Mar 5, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@9e41fc3). Click here to learn what that means.
The diff coverage is 40.74%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #215   +/-   ##
=========================================
  Coverage          ?   72.33%           
=========================================
  Files             ?       35           
  Lines             ?     1836           
  Branches          ?        0           
=========================================
  Hits              ?     1328           
  Misses            ?      508           
  Partials          ?        0
Impacted Files Coverage Δ
src/Servers.jl 71.03% <100%> (ø)
src/ConnectionPool.jl 78.39% <38.46%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9e41fc3...a5330d5. Read the comment docs.

yield()

# sum(delays) ~= connect_timeout
delays = ExponentialBackOff(n=connect_timeout + 10,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you busy-polling this? We have a callback for it (wait_connected)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't see how to simultaneously wait for connection success and a timeout.
I can call wait_connected in a background task, and notify a condition when it is done, but there is no timeout parameter wait(::Condition). Or I could wait for the task to be done, but fetch(::Task) has no timeout parameter.

Are you thinking along the lines of wait_connected in the main task and a background task that calls close after the timeout to wake up the wait_connected with an error? I used that pattern over here: https://github.com/JuliaWeb/HTTP.jl/blob/master/src/TimeoutRequest.jl#L20
Don't know why it didn't occur to me in this case :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you answered your own question...yes.

if tcp.status == Base.StatusConnecting
timeout[] = true
ccall(:jl_forceclose_uv, Void, (Ptr{Void},), tcp.handle)
#close(tcp)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vtjnash close didn't work here (0.6.2).
But I assume that's some small detail of handling status flags in Base.close because closing while opening is not a common operation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe fixed on master? This is right, but should also set the state to StatusClosing

Base.wait_connected(tcp)
catch e
if timeout[]
throw(ConnectTimeout(host, port))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want to make sure it throws this exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I don't think UVError should ever be thrown from a user-level API unless it means there is a bug. See JuliaLang/julia#15879 etc.

I think application level code will want to handle "I set a short timeout and it wasn't met" differently from "something bad happened deep in the network stack.

Elsewhere in HTTP.jl we wrap things like UVError and ArgumentError "stream is closed or unusable" in an IOError type:
https://github.com/JuliaWeb/HTTP.jl/blob/master/src/ConnectionRequest.jl#L36

In higher level code it is important to handle some specific errors by retrying but make sure everything else finds its way up to create a crash and a stack trace and an issue report and a fix. e.g. https://github.com/JuliaCloud/AWSCore.jl/blob/master/src/http.jl#L47-L51

Copy link
Contributor Author

@samoconnor samoconnor Mar 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArgumentError "stream is closed or unusable" seems pretty wrong because it looks like a precondition error, but there is now way to check the precondition before calling the API because the stream can be isopen() before the call but be found to be "closed or unusable" while the call is running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @vtjnash, I just noticed that poll_fd has a timeout_s option. Would that work here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how that's relevant here, since you don't have a fd. But yes, that does show another pattern that can be used to implement timeouts (with some more difficulty ... in fact, you've just inspired me to rewrite that file better ... you may see a new PR to base tomorrow sometime 🙂)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was assuming that I could get the fd out of the socket with uv_fileno or something like that.
Is there a gotcha along the lines of windows socket handles won't work with poll_fd?

new PR

👍🙃

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(a) the socket isn't open, so you can't poll it (b) it's invalid to pass libuv-derived fd handles to libuv

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx for taking the time to explain :)

if tcp.status == Base.StatusConnecting
timeout[] = true
ccall(:jl_forceclose_uv, Void, (Ptr{Void},), tcp.handle)
#close(tcp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe fixed on master? This is right, but should also set the state to StatusClosing

@samoconnor samoconnor changed the title WIP: add ConnectionPool connect_timeout and idle_timeout Add ConnectionPool connect_timeout and idle_timeout Mar 9, 2018
@samoconnor samoconnor requested a review from quinnj March 18, 2018 07:53
@samoconnor
Copy link
Contributor Author

@quinnj would you mind doing a quick review of this and the merging?
It might be handy to have it available whenever someone has time to dbug GitHub.jl more.

Copy link
Member

@quinnj quinnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Sorry it took me a while to review.

@samoconnor
Copy link
Contributor Author

No problem. And thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants