-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stats for congestion control and bandwidth estimation #21
Comments
I agree that providing information about the transport is important for the application to adapt to network conditions. The app may want estimated bandwidth and RTT. Currently the impl in Chromium provides three pieces of info easily in the C++ that could be exposed up to JS: They are: bandwidth_estimate, pacing_rate (the rate at which the congestion controller wants to send right now, which may be higher or lower than the estimated bandwidth), and RTT, and they seem to be fired whenever we receive an ack: Every ack might be a bit too frequent, but perhaps an event throttled to some reasonable frequency would make sense (rather than polling). |
Thank you for looping me in. I've skimmed the current WebTransport draft (at least the DatagramTransport-related parts) and pondered your question for a bit. I think what you're trying to do here is super-interesting, and the If But there's a difference between the way most UDP apps use this information and the way the current draft wants it to work. In Salsify/Sprout/Mosh, the transport gives the application its best guess about how much data can safely be transmitted (with some high probability) in the next n milliseconds, and then the app basically sends that amount of data in reliance on the estimate. If the estimate turns out to have been too high (e.g., the congestion controller changes its mind after receiving more information and becomes more conservative), these apps can react by bailing out midway (and sending a shorter message instead), or by proceeding headfirst and sending a too-big message and then making up for it later, effectively borrowing against a future allocation. (E.g., RFC 3448/5348 "TCP-friendly rate control" -- matching the long-term allocation of RFC-compliant AIMD TCP but on a longer timescale with slower variation than actual AIMD TCP.) In the current draft, the app doesn't have this kind of flexibility -- there is a congestion controller on the other side of the API boundary that blocks or drops datagrams and doesn't have a specified way for the application to borrow against a future allocation (to achieve slowly-varying rate control) or even to give the app advance notice that it's not going to adhere to a prior estimate. That seems like a real challenge and it defeats the reason a lot of apps want to use UDP. As far as I know, this kind of arm's-length separation between a congestion-control scheme and a datagram-based application has never been successfully executed by anybody (e.g., DCCP was a prior not-really-successful attempt). I'm not saying it's impossible, but I don't know of a successful example ready for standardization. I can also see some additional practical challenges -- is the congestion controller really going to have no internal sender-side buffer and be willing to wait for the application each time it's "ready" to send something? This could be an unpleasant wait and could hurt performance (or the congestion-controller might even no longer by "ready" by the time the promise gets around to calling I wonder if you have a corpus in your heads of interesting UDP-based systems and some shared understanding of which ones you want to be implementable with this API and which you don't. Because my thinking is that the underlying congestion control, and the way you draw the API, is going to be key to how broadly useful this turns out to be. Applications end up using datagram interfaces for a bunch of different reasons, including:
The current draft seems pretty heavily directed at use cases 3 and 4, and this issue is flirting in the direction of 2 (but see my comments above). My concern would be that in practice, a lot of UDP-using apps really care about case 1: they don't just want more information from the congestion controller; they really do want different congestion-control behavior, or tighter integration between the app and the congestion controller, or "binding" estimates (instead of just best guesses) from the congestion controller, or a different latency-vs-throughput tradeoff than they get by default, etc. So if you want to support those use-cases (real-time video, probably some first-person-shooters, etc.), the API may need to support some "actuation" and not just information. As a first step, you could imagine letting the app express where it wants to be on the latency-vs-throughput tradeoff space and maybe the short-term fairness vs. long-term fairness tradeoff space, and having the browser choose an appropriate (but still safe) congestion-control behavior as a result. I do think app developers who want to push the envelope (and most apps use UDP because they want to push the envelope somehow) are going to be curious about the threat model underlying, "All stream data is encrypted and congestion-controlled" and what this really means. We're talking about datagrams sent from an origin-controlled JavaScript program, to the origin. Is the congestion control going to be on a per-stream basis? Then if my needs are different from the default congestion control, I'm going to want to open 1024 streams and round-robin my datagrams among them, and then do the congestion control myself. Or on a per-origin basis? Then I'm going to want to have 1024 iframes from different origins, all going back to the same place, and again do the congestion control myself. Given that most users are one click away from downloading an Android or iPhone or normal-computer app that has free access to the operating system's datagram interface and can send whenever it wants (and given that most webpages already cause browsers to open tens or hundreds of TCP connections, with the kernel doing congestion-control on a per-connection basis), what is the anti-congestion or safety-against-bad-apps property that the spec really wants the browser to enforce? |
WebRTC-over-QUIC efforts have discussed this point at some length internally, and with Victor and others on the gQUIC team. We're now looking at using WebTransport for media as part of this effort, and I'm particularly interested in the implications for congestion control. I'd summarize what I want as a more cooperative relationship between congestion control and the application. I'd like to see something that lies somewhere on a spectrum between:
The closer we go to option 2, the more insight I'd want into transport-layer feedback: send and receive timestamps, acks, etc. I'd effectively be writing portions of the transport (eg. the send algorithm) myself and shipping them as a WASM module. I'm perfectly comfortable doing that, but I'm not sure it would make a great API for the web. Related to the threat model, Victor pointed out to me that unlike raw UDP, QUIC datagrams still have acks and the transport can still see what's happening on the network. There might be a middle-ground option between "you get the congestion controller we give you" and "no congestion controller", where the transport runs a safety-net congestion controller. I haven't put a lot of thought into this yet, but what I have in mind is something like BBR, except if I keep latency under control myself, it won't interrupt me for PROBE_RTT and low-gain cycles, and it will let me choose when I want to probe for more bandwidth (enter a high-gain cycle), but it will enforce some reasonable limits. If I send for a whole high-gain cycle unsuccessfully, it might force a subsequent low-gain cycle, and it might enforce some 'cooldown' between high-gain cycles. |
The standardized "safety-net" approach would probably be to have the browser enforce an RFC 8084 "Network Transport Circuit Breaker" on whatever the app decides to send. The threat model of trying to restrain an unfair/oversending app still doesn't quite make sense to me if the circuit breaker is per-stream or per-origin, since anybody who wants to circumvent the control is just going to open up a lot of streams and keep switching after the circuit breaker is triggered. You could consider having a single per-page circuit breaker (and, like, severely throttle all WebTransport for 60 seconds when the circuit breaker triggers?), but I wonder if you might find that simultaneously too restrictive (because it's a big penalty for the whole page) and also not restrictive enough (because the streams aren't really congestion-controlled by the browser until there's been a violation of basic norms for a significant amount of time). My view is also that you'd want the API to encourage a pretty arm's-length relationship between the app and browser, and not introduce a sensitivity to the behavior or exposed state variables of a particular congestion-control scheme. Other browsers are going to choose different schemes or a different circuit breaker, and you wouldn't want apps written against this API for Chrome (using BBR or GCC) to end up with dramatically lower performance elsewhere because they have an unwitting latent sensitivity to whatever Chrome does or exposes. (BBRv1 turned out to be an "unfair/oversending app" itself in some cases [1], and BBRv2 is still under development and internal to Google. These things are evolving and the community is not always in agreement about how to evaluate new schemes. So I don't think baking BBR into a web standard, even de facto in that apps would be coded in a way that ends up depending on its behavior or its state variables, would be wise.) Which is to say, an API that ends up like your "option 2 plus a long-term per-page circuit breaker/safety-net with a big penalty on breaking it" would seem reasonable to me, so maybe we are in agreement, but that's pretty far from the spec's current language on having all datagrams be congestion-controlled. [1] https://platformlab.stanford.edu/Presentations/2019/retreat-2019/Keith%20Winstein.pdf, slides 15-18 |
Yeah, I think we're in agreement that no built-in congestion-control plus an RFC 8084 circuit breaker sounds reasonable. I also agree that the specifics of the congestion controller shouldn't be exposed to the app. I used BBR as an example because I'm familiar with it, and I think it's what the RTCQuicTransport origin trial uses, and I've heard it thrown around as the proposed congestion controller for WebTransport, too. I'd like to either see one of:
The circuit-breaker idea does raise the question of whether QUIC datagrams should be congestion-controlled at the transport layer. I recall a leaning in that direction at the last IETF side-meeting on QUIC datagrams, but there were definitely use-cases like VPN raised which make it less clear. |
Approach 2) makes sense to me, for the reasons articulated above. I do suspect an uncapped transport * N open transports presents a different sort of risk than simply N open transports, but I think we're now focusing on the issue of abuse rather than fairness, which seems more tractable. |
I have read this thread with great interest and I want to provide my two cents. I really like the idea of allowing WebApps to innovate on congestion control algorithms and not have to rely on whatever congestion control algorithms are built into browsers.
I suspect the answer to these questions is “no”. This is most clear in the question in the last bullet since Window.setTimeout to my knowledge has very limited accuracy and that would need to be the mechanism that triggers the sending of a UDP packet. Because of this, I think it could be very helpful if the API included possibilities to:
With these two primitives, I believe that most rate-based congestion control algorithms - including GCC - could be implemented in JavaScript without the need to handle each UDP packet in real time. One use case I am particularly interested in is P2P systems for large scale distribution of video data while it is being consumed. Several such commercial systems built on top of WebRTC DataChannels already exist today. Such systems want to be very non-aggressive in the upstream direction of end-users. The reason for this is that an individual end-user does not benefit from contributing upstream to other end-users. So it is desirable to only use upstream bandwidth if it does not impact other traffic. For such a use case, I think it would be useful if the browser implemented an additional congestion control algorithm (such as TCP Reno, BBR, GCC, Ledbat, whatever) on top of the congestion control algorithm implemented in JavaScript. This would ensure that even down on the level of each individual packet, we would never be more aggressive than this additional algorithm. A similar approach is used by Ledbat which makes sure it is never more aggressive that TCP Reno. With such a mechanism we would not need to have an RFC8084-style circuit breaker in the browser and ill-behaving WebApps will have a harder time congesting the network than if we relied on a circuit breaker. But obviously it will also rule out some use cases that a circuit breaker would allow for, so an idea could be to allow JavaScript to select between the two mechanisms. |
Regarding letting the application to choose the congestion control algorithm, I've been thinking about it, and there are various extents to which we can go. I'll refer to them as "levels". Level 0: the browser just uses its default congestion control algorithm that it uses for HTTP traffic. This is where the spec is currently. Level 1: we allow the web application to switch between different algorithms. We could just export the full list of algorithms with their names, but I would prefer us to let the application to specify only the category of CC ("bulk" for Reno/CUBIC/BBR, "best-effort" for LEDBAT, "real-time" for WebRTC-like algorithms). This should be relatively simple to add to the spec, so we should just do it. Level 2: we notify the web application about CC-level events (packet sent, acked, lost) and let the application set pacing rate and congestion window. This, of course, requires a "limiter" CC algorithm to run on top of whatever the web app runs, and designing one is a research topic (Keith points to RFC 8084, and that's a good start, though I am not sure it's enough). Designing a good API for this is also a research topic, but there's some prior work that might be helpful (e.g. this). Level 3: instead of providing an API to set pacing rate and congestion window, we can let the web app load a WASM blob that is ran by the QUIC stack itself instead of the congestion control. This has almost native-level capabilities, but a much higher complexity and worse security properties, so I am not sure it's worth it. |
@cwmos The current draft spec seems to assume that:
If these assumptions can't practically be upheld in an implementation (you seem to also be a bit dubious about this in your comment), it seems to me that this is a bigger issue than just bandwidth prediction; the DatagramTransport interface will need to be refactored from what's there now. |
@vasilvv I do want to keep asking: what is the threat model behind the "security" risks that can be cured with a "limiter" CC algorithm? To slightly expand on what I wrote above, given that:
... What is the anti-congestion/fairness/safety-against-bad-apps property that the spec really wants the browser to enforce? I think it might be best to first answer this question (i.e. specify the threat model and desired properties) and then work backwards to figure out what features of this new API need to be governed or limited by mandatory controls running in the browser. Here would be my own suggestion as a straw-man along these lines: "the safety property that the browser enforces is to make sure that no matter how the page uses the WebTransport interface, each page will send outgoing traffic that is, in total across all WebTransport connections from that page, no more aggressive than four classical AIMD connections averaged over a 5-second sliding window. Downstream traffic is out of scope and is uncontrolled by the browser." This gives some wiggle room for "type 1" apps (e.g. innovations in app-specific congestion control) because the control is done over a longish-term sliding window and does not have to match the packet-for-packet cwnd evolution of classical TCP Reno. But it also governs the behavior of the page in the aggregate to make sure that a page cannot be arbitrarily abusive or unfair by opening lots of DatagramTransports. On the separate question of what API to support, my straw-man suggestion would be to give apps a choice between your "level 1" and "level 3."
|
The most "fun" things to defend aginst are channel monopolization and (D)DOS attacks. The Web security model is intended to ensure that "nothing fatal happens if you run your enemy's Javascript". I think we can't get away from running a CC model in the browser that the user can't override or disable - in "ultimate freedom mode" (Vasili's mode 3), the user should be given free choice of what packet to send when - but the browser should refuse to put it on the network unless it fits within the CC envelope that the browser's CC has computed. I see Vaslii's lower modes more as the browser giving more help with choosing what packet to send. |
Hmm, I think we're still talking past one another. Let me try to say it a different way.
For a few reasons:
This is why I proposed the "you get to send as much as four CC-controlled flows, averaged over a 5-second timescale" language. We could amend this to make it a little more restrictive -- e.g., "the page gets to send, in total across all WebTransport connections, no more aggressively than 16 CC-controlled flows. This limit is imposed at all times. In addition, the average upstream traffic over a 5-second timescale must be less than 4 CC-controlled flows."
I definitely love the idea of an "upcall"-based API to congestion control (this is what Mosh does -- it only calculates the payload contents once the transport is willing to send a packet), but I just don't know how practical it is when the "upcall" really is an arm's length API between the browser and some origin-controlled JavaScript and we're talking about doing stuff on short timescales. Maybe it really is practical (meaning, maybe at least CUBIC, GCC, and BBR can be refactored in this way, and there's a reasonable answer for what happens if the CC's opinion of the congestion window has changed in between when it resolved the promise and when the promise actually produced a payload), in which case, great! |
In terms of API's I think you could look into the Web Audio API for possible inspiration. "Level 1" sounds a bit like what that spec does in terms of offering various "native" audio processing capabilities to web developpers by way of various And then "level 3" sound a bit like the
With regards to performance and security, I think again the audio API could be quite interesting as a source of inspiration. It seems mostly build around a separation between the "control thread", which is essentially the "webpage" from where the audio capabilities are used, but not where the actual audio processing happens. Then there is an "rendering thread", where the actual audio processing happens, via user-configered but natively implemented nodes and/or fully programmable audio worklets. Both the "control thread" and the "rendering thread" I believe would be running in the same low-capability "content process"(where user content runs), and communicate with a backend in another process with access to system resources. It's that backend that would, via IPC, call into the "rendering thread" at each processing interval. So the "rendering thread" is a bit like a web worker, although it runs a specialized loop meant to be usable in a low-latency context of audio processing. The "control thread" is just running a normal HTML event-loop, and does receives some events and so on, but those are not involved in the actual audio processing, hence are not subjected to the same kind of performance requirements. See https://webaudio.github.io/web-audio-api/#processing-model |
From RTP over QUIC Section 4.1: " Additionally, a QUIC implementation MUST expose the recorded RTT |
RTP over QUIC Section 5.1 describes the statistics necessary for application congestion control (most of which are not provided in WebTransport). |
We have half of those ( Of the remaining, That leaves:
Correlating departure times with arrival times from the server then seems like an app problem. Is providing these two stats all we need to offload this problem to the app? If so, should we add them? |
There were two reasons why we included different RTT values in the first draft. 1. If the congestion controller uses the acknowledgements and departure and arrival times only to calculate an RTT, we can directly use the RTT that is calculated by QUIC. 2. Since QUIC acknowledgements do not include an arrival timestamp ( If the arrival time is not available in QUIC through any extension, it could also still be implemented at the application layer, but that would use more bandwidth and may be less precise if the application cannot access the exact timestamp at which packets were sent/received. I think quic-go implements ECNs, but I don't know if the congestion controller acts on it. I don't know about other implementations. |
Meeting:
|
IIUC RTP over QUIC uses datagrams, and the promise returned by writeDatagrams gives you effectively nothing. |
@yutakahirano I've filed #400 on this. |
It sounds like we want a stat for Regarding where this stat goes, I have a question for the group:
Are we only talking about media streaming over datagrams? |
Meeting:
|
Does the application require absolute packet-arrival and packet-departure times? I would think that any algorithm attempting RTP over WT would care about the delta between the two (i.e packet-transfer-time) more so than the absolute timestamps? If so, how does this differ from RTT/2 ? |
https://www.rfc-editor.org/rfc/rfc8888.pdf gives details of a feedback format that seemed to the authors to support all the requirements of NADA, SCReAM and the Google Congestion Control algorithm as they were understood at the time. The important thing for GCC (which I was a bit familiar with at one point) is that it tries to detect changes in transit delay that indicate queue buildup and tries to act on it before the queue is full; for that, the more information you can have about packet arrival times, the better. |
While the browser has the ability to measure packet departure and arrival times, an application will have difficulty measuring this with much accuracy. It's important to not to mix in queueing delays (in WHATWG streams or the QUIC stack send/receive queues). There are proposals for QUIC timestamps, Receiver timestamps and ACK frequency that can help provide the estimates with greater accuracy and frequency. |
Meeting:
|
Question: Are things that are to be used for congestion control better provided via events rather than stats? You don't want to encourage frequent polling of stats. But if the application is looking to calculate a target bitrate based on the info described in RFC 8888 or draft-ietf-avtcore-rtp-over-quic, then an Event might make more sense. |
One of the use cases we are interested in is media streaming; when streaming media, the application can often decide to change the amount of data it sends based on how much bandwidth it expects to have available. Since all of the transports we define are congestion-controlled, we already naturally have to make some form of a guess regarding how much data the path can handle (even though it can be as rudimentary as CWND/RTT).
We should provide an API that lets the underlying transport library expose this kind of data to the Web application. My intuitive idea would be
estimateBytesAvailable(time)
, or evenestimateBytesAvailable(time, p)
, for models that accept the target probability of not oversending (for fancier models that accept such parameter, e.g. Sprout).cc @keithw, who is an expert on this topic and might have a much better idea of how this API should look.
The text was updated successfully, but these errors were encountered: