Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RSDK-6515 Add dial timeout and stats #188

Merged
merged 8 commits into from
Apr 12, 2024

Conversation

benjirewis
Copy link
Member

@benjirewis benjirewis commented Apr 3, 2024

https://viam.atlassian.net/browse/RSDK-6515

Adds a timeout field to DialOptions that defaults to 10s. Adds a few basic metrics to dialing collected through debug logs.

mDNS does not seem to work well with the flutter SDK, so I disabled it in testing. Note that mDNS is also disabled in the RC app, too, as its apparent advantage in "limited connectivity" situations is not really relevant with a phone, as phones are usually operating with some kind of network connection.

Here are some statistics I collected with the new instrumentation that led me to choose a 10s default dial timeout (about 10x "ICE connected" times across different subnets). These statistics are collected across 100 runs in both "Situation"s.

Relevant Log Metric Same Subnet Different Subnet
link Dial function 0.003s 0.006s
link ICE connected 0.976s 1.170s
link ICE completed 1.144s 16.574s
link Average call update time 0.161s 0.150s
link Max call update time 0.171s 0.175s
link Number of call updates 8.000 8.000
link Auth 0.176s 0.196s
link Signaling server connection 0.177s 0.198s
link Peer connection creation 0.106s 0.069s

cc @dgottlieb , @maximpertsov , @ale7714

Copy link
Member Author

@benjirewis benjirewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if there are other metrics we're interested in grabbing in this code. @cheukt and I spoke a bit offline about somehow reporting these metrics back to the signaling server, but I think debug logs are fine for now until we know the shape of the data we're trying to collect.

if (disableWebRtc) {
return _dialDirectGrpc(address, opts, sessionCallback);
chan = _dialDirectGrpc(address, opts, sessionCallback).timeout(opts.timeout);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sense that this might not be the idiomatic way to provide a timeout to a Future, @clintpurser / @njooma . I went down quite a rabbit-hole of using Completers to try to guard wherever we call dial in the SDK, but what I've done here does seem to limit the amount of time we're able to await the Future<ClientChannelBase>... Hitting the timeout creates a "context deadline exceeded" error that seems to break app, though, so would love to talk offline about how to catch that and gracefully report a timeout in connection.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idiomatically i think this is the correct way to add a timeout to a future, you can still await it normally this way and it will throw a TimeException upon timing out. Alternatively you can add a method into the .timeout() call to handle how you prefer, i believe using this would resolve your deadline exceeded exception, but we most likely want to catch that somewhere as well. Happy to dig in with you if you want.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool; I think an onTimeout function will likely be the way to go. Looking into it now, but may need some advice from other Flutterers to get a sense of how to not break the mobile/example apps. Thanks for the input!

@benjirewis benjirewis marked this pull request as ready for review April 3, 2024 17:36
@benjirewis benjirewis requested a review from a team as a code owner April 3, 2024 17:36
Copy link
Member

@purplenicole730 purplenicole730 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably want to wait until the others weigh in since I don't have the necessary background information, but looks good!

Copy link
Member

@clintpurser clintpurser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some suggested changes, but nothing blocking

lib/src/rpc/dial.dart Outdated Show resolved Hide resolved
@@ -129,16 +134,23 @@ Future<ClientChannelBase> dial(String address, DialOptions? options, String Func
} catch (e) {
_logger.d('Error dialing with mDNS; falling back to other methods', error: e);
}
mdnsSW.stop();
_logger.d('STATS: mDNS discovery took ${mdnsSW.elapsed}');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would love to capture this data somehow so we could historically track.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah NetCode will be working on that eventually (gathering this data and tracking it to get a sense of how our connection establishment performance looks over time). For now, I think we're ok just debug-logging a number of measurements in this dial code and in other connection establishment paths.

if (disableWebRtc) {
return _dialDirectGrpc(address, opts, sessionCallback);
chan = _dialDirectGrpc(address, opts, sessionCallback).timeout(opts.timeout);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idiomatically i think this is the correct way to add a timeout to a future, you can still await it normally this way and it will throw a TimeException upon timing out. Alternatively you can add a method into the .timeout() call to handle how you prefer, i believe using this would resolve your deadline exceeded exception, but we most likely want to catch that somewhere as well. Happy to dig in with you if you want.

lib/src/rpc/dial.dart Outdated Show resolved Hide resolved
lib/src/rpc/dial.dart Outdated Show resolved Hide resolved
lib/src/rpc/dial.dart Outdated Show resolved Hide resolved
lib/src/rpc/dial.dart Show resolved Hide resolved
@cheukt cheukt self-requested a review April 11, 2024 22:13
@cheukt
Copy link
Member

cheukt commented Apr 11, 2024

can you add a column to the table that specifies which log you were looking at to get those times? and can flutter use structured logging? would be nice to have the specific data separated out from the message

@benjirewis
Copy link
Member Author

can you add a column to the table that specifies which log you were looking at to get those times? and can flutter use structured logging? would be nice to have the specific data separated out from the message

Yep can add that column! Updating the logging a little bit now. I'm not sure about structured logging without an external library like structured_logger, but looking into it.

Copy link
Member

@cheukt cheukt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! can you add a ticket for flutter sdk to maybe add structured logging

@benjirewis benjirewis merged commit 2a86005 into viamrobotics:main Apr 12, 2024
2 checks passed
@benjirewis benjirewis deleted the instrumentation branch April 12, 2024 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants