You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
The discovery mechanism currently to create a list of Os to select from of size min(poolSize, maxInflight * 2) where poolSize is the total # of Os that B is aware of and maxInflight = 4 which is based on an 8 second HTTP timeout and an assumed segment length of 2 seconds [1].
A few implications of the current behavior of discovery:
B will cancel all in-flight discovery requests once it hits its target list size. If the target list size is hit very quickly, the requests are cancelled very quickly and in some cases can abruptly drop the connection in the middle of a TLS handshake which will a) cause an error to be logged until (although Use customer error logger to catch & handle errors from net/http #1646 will help with this) and b) if this happens a lot services can flag this as suspicious DoS like behavior. It would be nice to avoid such quick cancellations if possible
B refreshes its list once it is at half of its target list size. So, if the target list size is 8, B refreshes the list i.e. sends discovery requests to all known Os after 4 Os have been removed from the list. It would be nice to refresh the list less often
B prioritizes selecting from Os that are very quick to respond. B does not get an opportunity to select from Os that might be a bit slower to respond, but also able to return transcoded results quickly. It would be nice to give an opportunity to Os that are within an acceptable range of B to be selected
[1] The reader will notice that the maxInflight formula is a bit outdated due to the popularity of non-uniform segment lengths on the network today.
Describe the solution you'd like
A clear and concise description of what you want to happen.
Use the OrchestratorPool size as the target # of Os for discovery as is done in this commit
There may be some other considerations around the numOrchs field stored in BroadcastSessionsManager which influences how long Os are temporarily suspended for and also when to refresh the session list
Note that a larger list size and less frequent refreshes will also mean that it will be more common for B to refresh an individual sessions (i.e. due to an expired auth token) by calling GetOrchestrator on an O right before it sends a segment to it. Might be ok.
Note that these changes will mean it will either be the case that the 1 second timeout has elapsed or all available Os have responded during discovery. The former is more likely. Thus, in most cases we should expect discovery to take 1 second before B's session list is updated. During a stream this is fine since the refresh happens in the background. But, at the start of a stream, the first segment can't be submitted until the refresh completes which means there will always be a 1 second delay at the start of a stream before a segment is submitted. We should observe the behavior of this in practice and if needed we can consider reducing the discovery timeout further as well.
Additional context
Add any other context or screenshots about the feature request here.
We should experiment with this and observe the following data points:
Distribution of work before and after these changes
Frequency of session list refreshes
Impact of a fixed 1 second delay of discovery on the first segment submission of a stream
The text was updated successfully, but these errors were encountered:
Changed the suspension time to poolSize/numSessions (not sure if this is better but at least this avoids poolSize/numOrchs which would always be 1)
Changed the cut-off point at which we do a refresh to be at less or equal than 20% of the original selected set. I believe this ensures we would always refresh the set before it falls below 1 transcoder (the cutoff point would be 2 at the minimum).
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
The discovery mechanism currently to create a list of Os to select from of size min(poolSize, maxInflight * 2) where poolSize is the total # of Os that B is aware of and maxInflight = 4 which is based on an 8 second HTTP timeout and an assumed segment length of 2 seconds [1].
A few implications of the current behavior of discovery:
[1] The reader will notice that the maxInflight formula is a bit outdated due to the popularity of non-uniform segment lengths on the network today.
Describe the solution you'd like
A clear and concise description of what you want to happen.
numOrchs
field stored inBroadcastSessionsManager
which influences how long Os are temporarily suspended for and also when to refresh the session listThis commit is a good starting point.
Note that a larger list size and less frequent refreshes will also mean that it will be more common for B to refresh an individual sessions (i.e. due to an expired auth token) by calling GetOrchestrator on an O right before it sends a segment to it. Might be ok.
Note that these changes will mean it will either be the case that the 1 second timeout has elapsed or all available Os have responded during discovery. The former is more likely. Thus, in most cases we should expect discovery to take 1 second before B's session list is updated. During a stream this is fine since the refresh happens in the background. But, at the start of a stream, the first segment can't be submitted until the refresh completes which means there will always be a 1 second delay at the start of a stream before a segment is submitted. We should observe the behavior of this in practice and if needed we can consider reducing the discovery timeout further as well.
Additional context
Add any other context or screenshots about the feature request here.
We should experiment with this and observe the following data points:
The text was updated successfully, but these errors were encountered: