using `core.async/put!` w/o backpressure will drop messages #124

ztellman · 2015-05-13T19:54:29Z

https://github.com/ptaoussanis/sente/blob/master/src/taoensso/sente.cljx#L201

If 1024 + buffer-size messages are put! without any take! calls on the other end (which is very possible for all sorts of reasons), then put! will throw an exception, and effectively drop the message. I'm not sure what http-kit or Immutant will do if an exception is thrown, but even if they close the connection (which is pretty much the only correct response, because otherwise messages will just silently vanish) that's not really a desirable failure mode.

I spoke to @tobias at Clojure/West, and he indicated that when using Immutant in a servlet container there was no way to exert backpressure, but outside the container there was. Unfortunately, http-kit doesn't seem to have any mechanism for backpressure at all.

Given that people are using already http-kit and Immutant-in-a-servlet to good effect now, it's possible that the only "fix" is to have clear documentation of this failure mode. However, we should make sure that connections are successfully closed when this happens, and I'd argue that creating adapters for servers which don't have this problem would be a reasonable idea.

The text was updated successfully, but these errors were encountered:

ptaoussanis · 2015-05-14T02:32:48Z

Hi Zach,

I'm not too clear on whether you're suggesting that http-kit/Immutant are a problem, or that core.async is?

If 1024 + buffer-size messages are put! without any take! calls on the other end (which is very possible for all sorts of reasons), then put! will throw an exception, and effectively drop the message.

Is it possible you've highlighted the wrong line here? That channel is buffered (sliding by default, but configurable). All Sente channels have sliding buffers by default; put!s will only throw against fixed-sized buffers.

Given that people are using already http-kit and Immutant-in-a-servlet to good effect now, it's possible that the only "fix" is to have clear documentation of this failure mode.

I'm not sure I understand your characterization that these servers are "broken". Like most software, if they become overloaded their behaviour can become unpredictable. In the case of these servers it's quite difficult to overload them (am not sure I've ever seen a bonafide case of a dynamic application bottlenecking at the http server side before the application/db layer).

Anyway, would a large/important application not normally be behind standard load balancing for all sorts of reasons (incl. DDoS protection)?

ztellman · 2015-05-14T04:26:30Z

I'm saying that the interop between core.async and http-kit/immutant doesn't allow backpressure to be propagated to the websocket client. This is due to a fundamental limitation in both of these web servers. This is discussed in detail in my talk at Clojure/West, where I alluded to some of these issues.

Is it possible you've highlighted the wrong line here? That channel is buffered (sliding by default, but configurable). All Sente channels have sliding buffers by default; put!s will only throw against fixed-sized buffers.

Okay, I didn't realize the buffers were sliding by default. That means that exceptions won't be thrown, but that does mean that under load arbitrary number of messages will be lost, which is consistent with the description of this issue.

I'm not sure I understand your characterization that these servers are "broken". Like most software, if they become overloaded their behaviour can become unpredictable.

Very strongly disagree. TCP provides mechanisms for servers to avoid becoming overloaded to the point of breaking/losing data, and these mechanisms are not being used. I can only describe silent loss of data under load as a bug.

In the case of these servers it's quite difficult to overload them (am not sure I've ever seen a bonafide case of a dynamic application bottlenecking at the http server side before the application/db layer).

Let's say that you pass a core.async channel representing websocket messages, which then does something like this:

(go-loop []
  (when-let [msg (<! websocket-chan)]
    (>! message-queue-chan msg)
    (recur)))

This just forward messages to a message queue, which is somewhere across the network. Let's say that the message queue becomes unavailable for a bit. The go loop will hang, it will cease to read from the websocket-chan, the channel's buffer will become full, and messages will be (silently!) dropped.

The hypothetical bottlenecks that cause this behavior are not in the web server, they're anywhere downstream of the web server. The expectation of the downstream system is that if it stops accepting messages, so too should the web server. That's not happening here, and that's incorrect.

ptaoussanis · 2015-05-14T05:37:51Z

Edit: my response was written in a hurry; please excuse typos.

I'm saying that the interop between core.async and http-kit/immutant doesn't allow backpressure to be propagated to the websocket client. This is due to a fundamental limitation in both of these web servers.

This is a limitation, but I just don't think it's an important one or of much practical significance in this context.

Okay, I didn't realize the buffers were sliding by default. That means that exceptions won't be thrown, but that does mean that under load arbitrary number of messages will be lost, which is consistent with the description of this issue.

This is where we're diverging, I think. Sliding buffers are there specifically to deal with cases like this. That's their designed purpose. The servers don't supply back pressure, so core.async provides a well-defined overflow mechanism.

Messages are not arbitrarily lost; they're discarded in a well-defined +configurable way that's appropriate to what we're doing. The purpose of back pressure here would be to keep the servers from becoming overwhelmed with requests they can't fulfil, yes?

Client->server requests that travel over the internet are generally time sensitive in a step-wise way: you want them to be as fast as possible, but after a certain amount of waiting any response is equally worthless.

Waiting 1 second for a response is better than waiting 2 seconds for a response, but waiting 20 seconds is not better than waiting 30 if after 10 seconds the user's lost interest and moved on.

Sliding buffers give us 2 things here:

They protect the server from being overwhelmed, as back pressure would.
They give us a reasonable strategy for dealing with queue overflow by acknowledging that old requests are likely outdated anyway or issued by user's that have become frustrated and anyway moved on.

The internet being unreliable, all client UI that involves client<->server comms must deal gracefully with server requests that fail to return within a prescribed timeout.

That's something that'd be true regardless of the cause of the timeout: the connection may have dropped, the server may have exploded, the server may have become overloaded and is taking forever to reply because it has no overload strategy, the server may be exerting back pressure, or the server may have discarded an old request due to a sliding queue buffer.

From the user's perspective, all causes are equally frustrating ("the website's not working!").

From the server's perspective, the only thing that matters is that we avoid making the problem worse; i.e. that we avoid the particular case of unmanaged overload - which a sliding buffer does in this case just as well as server back pressure.

Very strongly disagree. TCP provides mechanisms for servers to avoid becoming overloaded to the point of breaking/losing data, and these mechanisms are not being used. I can only describe silent loss of data under load as a bug.

This just forward messages to a message queue, which is somewhere across the network. Let's say that the message queue becomes unavailable for a bit. The go loop will hang, it will cease to read from the websocket-chan, the channel's buffer will become full, and messages will be (silently!) dropped.

Again, dropping data in a well-defined (+configurable) way appropriate for the work characteristics is different from "breaking/losing-data". And whatever mechanism you use to protect the server, the result is the same: the client/user will be frustrated by not having her requests quickly fulfilled.

So the same strategies become necessary: use load balancing to try prevent server overload, and use client-side timeouts in the UI to handle server overload (and other issues) as gracefully as possible when they do unavoidably occur.

The expectation of the downstream system is that if it stops accepting messages, so too should the web server. That's not happening here, and that's incorrect.

As discussed above, the web server refusing / pushing back against new requests would be strictly worse than dropping old requests in favour of new ones. A sliding buffer literally defines the kind of behaviour we want here.

And further to a point I made earlier: the http server is rarely the bottleneck in practice; that's almost always the application/db layer. A dropping channel makes it easier for the application to control the amount of work it feels able to do.

Back pressure is a tool; it can be a invaluable when what you want is back pressure (and you often will), but I disagree with your assertion (as I currently understand it) that back pressure in this particular context would be better or even somehow necessary for the system to not be "broken".

Please feel free to correct if I've misunderstood something, but just a friendly note that I'm actually on some urgent work atm so won't likely be able to follow up on this discussion much more right now.

Do appreciate your input Zach, thank you.

Cheers! :-)

ztellman · 2015-05-14T06:34:37Z

Completely understand if other matters take priority, thanks for your quick responses so far. I'll address your points now, but respond whenever you're able.

To paraphrase your response, you don't see dropping data as meaningfully different from pausing data, since both represent a degraded state compared to a healthy, responsive system. This is only true if newer messages make older messages obsolete, which may be true for some applications that use WebSockets, but certainly isn't true for all of them.

For instance, let's say I have a chat client that has two kinds of messages, join-room and send-message. Maybe it's not the end of the world if some of the messages are lost, but it certainly would be confusing if my join-room is silently dropped and all my subsequent messages are sent to the old room.

WebSockets emulates TCP, in that messages are strongly ordered and have reliable delivery. If I send A, B, and C, there is no way that the server will receive and process C without having received A and B first, in that order. We could avoid the above scenario by adding an application-level ACK of the join-room message, but because of the guarantees the WebSocket protocol provides us, we don't have to. I would consider this a central, critical property of the WebSocket spec.

Conversely, what Sente provides is something weaker. Messages won't arrive out-of-order, but any of them might fail to arrive. If we need to be sure that a message was received before sending other messages, we need to add an application-level ACK, which adds both complexity to our code, and latency to a message that we could otherwise just immediately send. It's basically the UDP to WebSocket's TCP, which is not at all obvious from the documentation.

Further to a point I made earlier: in practice the http server is also rarely the bottleneck in practice; that's almost always the application/db layer. A dropping channel let's the application control the amount of work it feels able to do.

The point here is that you're not letting the application make the decision of how to deal with too much data, you're making the decision for them. If dropping data is the appropriate thing to do if there's too much data, the application can put a sliding buffer downstream of your channel. If only some of the messages can be dropped, it can apply that application-specific logic. And if all messages are important, it can use backpressure, just like every other TCP-based network service in the world.

Trying to make that decision at a TCP level neglects the reality (IMO) that it's rarely the TCP layer that's causing the backup.
Back pressure is a tool; it can be a invaluable when what you want is back pressure, but I disagree with your assertion (as I currently understand it) that back pressure in this particular context would be better or even somehow necessary for the system to not be "broken".

This has nothing to do with whether the network or server is "to blame" for the issue, the point is that unless you know that the application can drop data, or what it can send in response to reject the data, the only correct choice is to use backpressure and defer to the author of the application, who may prefer slow and complete to fast and lossy.

Not using backpressure makes Sente broken because it's making assumptions that only hold true for a certain class of applications AND doesn't document this behavior. If there were documentation it wouldn't be broken, but it would be making some questionable design decisions. If it used backpressure, it would work for all classes of applications, and its current behavior could be trivially emulated by any application that desired it.

I hope this clarifies the intent behind this issue. I realize it's both presumptuous and a little self-important to ask that you watch the 40 minute video I linked above, but it probably will explain this better and in more depth than I can here.

ptaoussanis · 2015-05-14T12:01:18Z

To paraphrase your response, you don't see dropping data as meaningfully different from pausing data, since both represent a degraded state compared to a healthy, responsive system.

That's correct.

This is only true if newer messages make older messages obsolete, which may be true for some applications that use WebSockets, but certainly isn't true for all of them.

What I'm saying is that in practice you'll almost always need to assume that messages will sometimes go missing for a multitude of reasons, not just server overload.

I.e. the solution isn't back pressure but writing your application to be resilient to the whole class of such issues, of which server overload is only one element.

This isn't academic or difficult to achieve; it's how most robust client applications are currently built. In Sente's case, any message sent to the server can request a server ack - with timeouts.

In your chat room example, you'll want to request an ack for important side effects like changing a room. That's something you'd need to do anyway even if the server had perfect back pressure and no load since the user may be driving through a tunnel, or there may be a temporary net split, etc.

Does that make sense?

As an analogy - instead of trying to build a server that never dies, Google & co. discovered pretty early on that it makes sense to just assume that servers will routinely die and program around that.

My assertion is that the internet is flaky; therefore you need the client to be resilient to msg delivery issues. And once you are, the particular mechanism that the server uses to make sure it doesn't explode is pretty much irrelevant to the client. In this case, the mechanism is a sliding buffer.

ztellman · 2015-05-14T14:18:07Z

If we're talking about a single TCP connection, then I think "the internet is flaky" is overly reductive. If I send A, B, and C, then the sets of messages that can be lost are {C}, {B,C}, and {A,B,C}. I can't just lose {B}, or {A,C}, or anything else like that. Whatever else fails over the network (and pretty much everything can and does), this is something we can rely on, and isn't an invariant we should easily give up.

However, re-reading your response and documentation, I think I may have overlooked something: when you lose a connection, you transparently reconnect without notifying the client that they're on a new TCP session. This is a nice simplification, but it also means that the connection may drop after/during any message, which means that any message can be lost without that carrying over to the next message, which can be sent on a different connection.

If I've understood this correctly, then pretty much everything I've said above doesn't apply; if any message can be lost at the transport level, then dropping messages on the server side doesn't break any guarantees.

So maybe the thing to do is just close this issue, with my apologies. I will say, though, that the guarantees, or lack thereof, weren't obvious to me based on the documentation. Maybe this is because I'm used to dealing with individual TCP sessions rather than higher-level abstractions, and isn't representative of your typical user, but take it for whatever it's worth.

Thanks for your prompt and detailed responses to my messages.

ptaoussanis · 2015-05-15T04:21:42Z

Definitely no apologies necessary, appreciate all the time you took looking into this. It's often handy to double check the details :-)

I will say, though, that the guarantees, or lack thereof, weren't obvious to me based on the documentation. Maybe this is because I'm used to dealing with individual TCP sessions rather than higher-level abstractions, and isn't representative of your typical user, but take it for whatever it's worth.

Thanks for mentioning this. A little short on time recently, but I'll try make some clarifications when I'm in the docs next (next release probably).

Thanks for your prompt and detailed responses to my messages.

You're very welcome, and likewise. I'll try check out your talk this weekend.

Quick point re: something else you mentioned earlier -

I'd argue that creating adapters for servers which don't have this problem would be a reasonable idea.

I'd of course be happy to see a PR for an Aleph adapter (or other servers) if you or anyone else felt like contributing. It's pretty straightforward; there's info on what's required here. Just juggling too many things atm to look into it myself right now.

I'll leave this issue for you to close if you're satisfied. Cheers :-)

ztellman · 2015-05-18T16:58:03Z

I'll close this, and follow up with an Aleph adapter at some point (soon, hopefully). Thanks again.

nha · 2016-01-13T17:05:39Z

Is there any progress on this ?

Also, I understand that sente supports long-polling or websockets, would it be possible to implement some middle-ground too with those interface (ie. SSE) ?

danielcompton · 2016-01-13T18:56:57Z

@nha I think the reason why SSE isn't implemented is that Websockets is the preferred connection method, but if that's not available in legacy browsers it will fall back to Long Polling. In that context, SSE doesn't add much as most browsers support Websockets already, adding SSE doesn't buy you much.

ptaoussanis · 2016-01-14T02:32:58Z

@nha Sorry Nicolas, any progress on what- not sure I follow?

Re: SSE, @danielcompton is correct- doesn't seem much reason to add any other implementations; is there a specific reason you had in mind?

nha · 2016-01-19T10:20:19Z

@ptaoussanis Ah the progress was more directed to @ztellman for an Aleph adapter to sente
I will implement it if I can find time, as I use aleph/yada

@ptaoussanis @danielcompton The reason I had in mind were :

interoperability. Ie. I could then use a simple curl command to get streaming updates from a sente enabled route/webserver.
in the case of a mostly-read client, it would probably be faster/more efficient (could still use a shim in the browser if SSE is not supported, like https://github.com/Yaffle/EventSource/ )

I may very well be wrong, and it may be out of scope for sente though.

ztellman · 2016-01-19T19:08:34Z

@domkm and I were planning to write something for Sente by month's end. I'll let you know if something prevents us from finishing it up.

ptaoussanis · 2016-01-21T12:06:40Z

Just a request please to help keep things organized: would appreciate if future discussions about a new adapter could go either to #102 or a separate "Aleph adapter" issue, etc. Lots of unrelated text here to get through if someone's just looking for adapter info.

Thanks :-)

settinghead · 2016-03-04T21:09:48Z

Hi. Any update on the sente-aleph adapter?

ztellman · 2016-03-04T21:14:56Z

I've just opened #208.

ztellman closed this as completed May 18, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

using `core.async/put!` w/o backpressure will drop messages #124

using `core.async/put!` w/o backpressure will drop messages #124

ztellman commented May 13, 2015

ptaoussanis commented May 14, 2015

ztellman commented May 14, 2015

ptaoussanis commented May 14, 2015

ztellman commented May 14, 2015

ptaoussanis commented May 14, 2015

ztellman commented May 14, 2015

ptaoussanis commented May 15, 2015

ztellman commented May 18, 2015

nha commented Jan 13, 2016

danielcompton commented Jan 13, 2016

ptaoussanis commented Jan 14, 2016

nha commented Jan 19, 2016

ztellman commented Jan 19, 2016

ptaoussanis commented Jan 21, 2016

settinghead commented Mar 4, 2016

ztellman commented Mar 4, 2016

using core.async/put! w/o backpressure will drop messages #124

using core.async/put! w/o backpressure will drop messages #124

Comments

ztellman commented May 13, 2015

ptaoussanis commented May 14, 2015

ztellman commented May 14, 2015

ptaoussanis commented May 14, 2015

ztellman commented May 14, 2015

ptaoussanis commented May 14, 2015

ztellman commented May 14, 2015

ptaoussanis commented May 15, 2015

ztellman commented May 18, 2015

nha commented Jan 13, 2016

danielcompton commented Jan 13, 2016

ptaoussanis commented Jan 14, 2016

nha commented Jan 19, 2016

ztellman commented Jan 19, 2016

ptaoussanis commented Jan 21, 2016

settinghead commented Mar 4, 2016

ztellman commented Mar 4, 2016

using `core.async/put!` w/o backpressure will drop messages #124

using `core.async/put!` w/o backpressure will drop messages #124