Fetch as many Os as possible during discovery before a shorter timeout #1853

yondonfu · 2021-04-27T19:41:43Z

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

The discovery mechanism currently to create a list of Os to select from of size min(poolSize, maxInflight * 2) where poolSize is the total # of Os that B is aware of and maxInflight = 4 which is based on an 8 second HTTP timeout and an assumed segment length of 2 seconds [1].

A few implications of the current behavior of discovery:

B will cancel all in-flight discovery requests once it hits its target list size. If the target list size is hit very quickly, the requests are cancelled very quickly and in some cases can abruptly drop the connection in the middle of a TLS handshake which will a) cause an error to be logged until (although Use customer error logger to catch & handle errors from net/http #1646 will help with this) and b) if this happens a lot services can flag this as suspicious DoS like behavior. It would be nice to avoid such quick cancellations if possible
B refreshes its list once it is at half of its target list size. So, if the target list size is 8, B refreshes the list i.e. sends discovery requests to all known Os after 4 Os have been removed from the list. It would be nice to refresh the list less often
B prioritizes selecting from Os that are very quick to respond. B does not get an opportunity to select from Os that might be a bit slower to respond, but also able to return transcoded results quickly. It would be nice to give an opportunity to Os that are within an acceptable range of B to be selected

[1] The reader will notice that the maxInflight formula is a bit outdated due to the popularity of non-uniform segment lengths on the network today.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Use the OrchestratorPool size as the target # of Os for discovery as is done in this commit
Reduce the discovery timeout to 1 second
There may be some other considerations around the numOrchs field stored in BroadcastSessionsManager which influences how long Os are temporarily suspended for and also when to refresh the session list

This commit is a good starting point.

Note that a larger list size and less frequent refreshes will also mean that it will be more common for B to refresh an individual sessions (i.e. due to an expired auth token) by calling GetOrchestrator on an O right before it sends a segment to it. Might be ok.

Note that these changes will mean it will either be the case that the 1 second timeout has elapsed or all available Os have responded during discovery. The former is more likely. Thus, in most cases we should expect discovery to take 1 second before B's session list is updated. During a stream this is fine since the refresh happens in the background. But, at the start of a stream, the first segment can't be submitted until the refresh completes which means there will always be a 1 second delay at the start of a stream before a segment is submitted. We should observe the behavior of this in practice and if needed we can consider reducing the discovery timeout further as well.

Additional context
Add any other context or screenshots about the feature request here.

We should experiment with this and observe the following data points:

Distribution of work before and after these changes
Frequency of session list refreshes
Impact of a fixed 1 second delay of discovery on the first segment submission of a stream

The text was updated successfully, but these errors were encountered:

kyriediculous · 2021-04-30T12:21:53Z

I pushed some changes to this branch

Decreased the timeout to 1 second
Changed the suspension time to poolSize/numSessions (not sure if this is better but at least this avoids poolSize/numOrchs which would always be 1)
Changed the cut-off point at which we do a refresh to be at less or equal than 20% of the original selected set. I believe this ensures we would always refresh the set before it falls below 1 transcoder (the cutoff point would be 2 at the minimum).

yondonfu · 2021-05-04T18:28:30Z

@kyriediculous Can you open a PR?

yondonfu mentioned this issue Apr 27, 2021

Minimize discovery requests from broadcaster #1850

Closed

yondonfu assigned kyriediculous May 3, 2021

kyriediculous mentioned this issue May 5, 2021

server,discovery: fetch as many Os as possible during discovery before a shorter timeout #1874

Closed

5 tasks

yondonfu closed this as completed Dec 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fetch as many Os as possible during discovery before a shorter timeout #1853

Fetch as many Os as possible during discovery before a shorter timeout #1853

yondonfu commented Apr 27, 2021

kyriediculous commented Apr 30, 2021

yondonfu commented May 4, 2021

Fetch as many Os as possible during discovery before a shorter timeout #1853

Fetch as many Os as possible during discovery before a shorter timeout #1853

Comments

yondonfu commented Apr 27, 2021

kyriediculous commented Apr 30, 2021

yondonfu commented May 4, 2021