Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetch as many Os as possible during discovery before a shorter timeout #1853

Closed
yondonfu opened this issue Apr 27, 2021 · 2 comments
Closed
Assignees

Comments

@yondonfu
Copy link
Member

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

The discovery mechanism currently to create a list of Os to select from of size min(poolSize, maxInflight * 2) where poolSize is the total # of Os that B is aware of and maxInflight = 4 which is based on an 8 second HTTP timeout and an assumed segment length of 2 seconds [1].

A few implications of the current behavior of discovery:

  • B will cancel all in-flight discovery requests once it hits its target list size. If the target list size is hit very quickly, the requests are cancelled very quickly and in some cases can abruptly drop the connection in the middle of a TLS handshake which will a) cause an error to be logged until (although Use customer error logger to catch & handle errors from net/http #1646 will help with this) and b) if this happens a lot services can flag this as suspicious DoS like behavior. It would be nice to avoid such quick cancellations if possible
  • B refreshes its list once it is at half of its target list size. So, if the target list size is 8, B refreshes the list i.e. sends discovery requests to all known Os after 4 Os have been removed from the list. It would be nice to refresh the list less often
  • B prioritizes selecting from Os that are very quick to respond. B does not get an opportunity to select from Os that might be a bit slower to respond, but also able to return transcoded results quickly. It would be nice to give an opportunity to Os that are within an acceptable range of B to be selected

[1] The reader will notice that the maxInflight formula is a bit outdated due to the popularity of non-uniform segment lengths on the network today.

Describe the solution you'd like
A clear and concise description of what you want to happen.

This commit is a good starting point.

Note that a larger list size and less frequent refreshes will also mean that it will be more common for B to refresh an individual sessions (i.e. due to an expired auth token) by calling GetOrchestrator on an O right before it sends a segment to it. Might be ok.

Note that these changes will mean it will either be the case that the 1 second timeout has elapsed or all available Os have responded during discovery. The former is more likely. Thus, in most cases we should expect discovery to take 1 second before B's session list is updated. During a stream this is fine since the refresh happens in the background. But, at the start of a stream, the first segment can't be submitted until the refresh completes which means there will always be a 1 second delay at the start of a stream before a segment is submitted. We should observe the behavior of this in practice and if needed we can consider reducing the discovery timeout further as well.

Additional context
Add any other context or screenshots about the feature request here.

We should experiment with this and observe the following data points:

  • Distribution of work before and after these changes
  • Frequency of session list refreshes
  • Impact of a fixed 1 second delay of discovery on the first segment submission of a stream
@kyriediculous
Copy link
Contributor

I pushed some changes to this branch

  • Decreased the timeout to 1 second
  • Changed the suspension time to poolSize/numSessions (not sure if this is better but at least this avoids poolSize/numOrchs which would always be 1)
  • Changed the cut-off point at which we do a refresh to be at less or equal than 20% of the original selected set. I believe this ensures we would always refresh the set before it falls below 1 transcoder (the cutoff point would be 2 at the minimum).

@yondonfu
Copy link
Member Author

yondonfu commented May 4, 2021

@kyriediculous Can you open a PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants