-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider adding activities for potentially long running network operations #93832
Comments
Tagging subscribers to this area: @dotnet/ncl Issue DetailsWhile diving into more distributed traces it would be interesting to break down outbound HTTP request time into a few more pieces to understand exactly where that time is spent. I think it would be interesting to explore adding an Activity for socket connection attempts and DNS resolution. This activity could be coupled to HTTP traffic, but I think it would make more sense to push this down into the Socket/Dns APIs. That would make it work for all sorts of client APIs that use the underlying socket API (redis, various database drivers etc).
|
I think instrumenting DNS/Sockets with more information would be fine, but I want to point out that it could become misleading / not match user expectations. Related: #63159 (comment) You will rarely see a "perfect" trace like this
When a request comes in it may kick off a new connection attempt, but then get picked up by a different connection that became available before the new connection is established.
Or even
From the HTTP perspective, you could avoid some confusion by inserting a different activity instead
That highlights the issue that the HTTP request is decoupled from the establishment of new connections. |
This seems ideal for the HTTP request case. It seems then it's possible that we'd also want to disable the dns and socket activities for the HTTP case? I'd be interested to see if we can get a similar trace for connection establishment in other clients and see if a similar problem exists (presumably it does if there's smart connection pooling going on). |
This feels like something that should be done with Events on activities. Putting too many extra activities for this sub-operations will be expensive and noisy. |
I agree in general but not for expensive networking operations.
Lets explore this then, do you have something more concrete we could investigate and visualize in tools? |
I think Sam is proposing that you visualize this list of span events that is included in the OTLP proto: I assume when you look at it right now it would be empty, but if you want to experiment with putting some data in there you can use the Activity.AddEvent API.
This sounds an awful lot like Activity.IsAllDataRequested property (doc guidance). In general Activity events aren't designed to scale to a high verbosity level but a handful of messages with key timings should work fine. |
Lets take @MihaZupan's scenario, would be add begin/end events for these operations and visualize those as named events on the overall activity timeline? e.g. HttpClient does the following: Activity.Current.AddEvent(new("wait for connection"));
await GetConnectionAsync();
Activity.Current.AddEvent(new("resolved connection")); Is that the idea? |
@davidfowl is there a specific customer ask behind this proposal (eg, from Aspire)? |
We were debugging connection slowness with aspire and more details would have helped speed that up. |
I would suggest that for the events you don't want begin/end items, just the ones that have been achieved, so you'd get something like:
If any of them are skipped it doesn't so much matter. But if there is a DNS item in there you can see how that relates to the overall time for the span. |
I like the idea of using events and also doing it in HttpClient so that we have a better view of the http connection time (based on #93832 (comment)) |
Triage: this doesn't look critical compared to other asks, tentatively pushing to future. @samsp-msft @davidfowl if you have concerns with that, let's sync! |
It's not critical no, but it seems cheap to add. Do you have an idea of the cost? |
This should be a relatively low cost way to instrument http client in a way that can really help developers understand what is going on in the http stack in a complex production environment. Without this kind of telemetry, they don't know if it's DNS that is taking time, if a new connection is needed etc. |
I fail to understand how does switching from sub-activities to events answer the concerns from #93832 (comment). If an HTTP request initiates connection attempt A, but then quickly gets served by a connection B that becomes free in the meanwhile (closing the HTTP span), the potentially long running A connection attempt would have no
Is this really what we want? Or am I'm misunderstanding something? Note: the |
We have the |
I thought we agreed that doing it at the http connection layer makes that problem go away. We would step back from trying to do it at the dns layer and focus on http only.
I don't why it would be better. If I had to pick one though I would pick this one. |
For the purpose of this - we shouldn't be tracking connections as activities as they are not specific to a single request - they are potentially long running, and for the most case outside the control of the user.
Metrics tell you the aggregate story, but not what happened for an individual request. In the above image, the bottom blue bar is the http client activity for making a request to the API service. The yellow bar is the API service handling that request. There is a 129ms gap between them - currently the developer has no idea what that is due to. The goal of this should be to supply events so the beginning of the outgoing trace includes the info as to why. |
I talked with @antonfirsov and have a bit more of an idea of the problem. Because of the connection pool the requests and connections are pretty independent, its only when the request is sent on an established connection that the relationship is created, and then its only for the duration of that request. |
Looks like this should be feasible with ActivityLinks, however the best way to implement it would be by utilizing #97680:
|
Can I suggest that if this is implemented with Links that a separate ActivitySource is used. This will ensure that a user from the outside will be able to opt-in to create those Activity object and not have to do post-processing in the collector, or in a custom sampler. For events, there is an issue around the size of the individual Activity/Span if it get really noisy (from a backend perspective). I suppose overall, the ask is, can we make this opt-in vs opt-out, and for links the best way to do this is a separate ActivitySource. |
That makes sense. Use something like |
That's the idea, but it would be a check on the listeners for the ActivitySource for the connections instead of the Activity null check I would imagine. If its a link from the http Activity, to the connection though, I'd imagine you don't actually need the The other question I had was the structure of the connection trace, and specifically, which Activity are you planning on linking to in that tree. The edge case I'm thinking about is the fact that you could add listeners to the Connections ActivitySource after the application has started, and since those connections are long lived, there may not have been an Activity created. You're also having to store the trace and Span information in memory for every connection, so that may add up in memory? |
The activity is being created and started in runtime/src/libraries/System.Net.Http/src/System/Net/Http/DiagnosticsHandler.cs Lines 112 to 117 in 9068070
There would be a root
I have yet to understand and figure out what's best here, but I think the conditions for creating a connection activity would be similar to the conditions for creating the request activity just with other ActivitySource & DiagnosticListener: runtime/src/libraries/System.Net.Http/src/System/Net/Http/DiagnosticsHandler.cs Lines 59 to 70 in 9068070
This is an existing problem with connection metrics already. I think we have no other choice but to accept it, I don't se a solution with the existing distributed tracing APIs.
I'm not sure if connection tracing costs would be significant compared to request tracing costs under high load, we need to benchmark to see. It will be an opt-in feature anyways. |
🥳 |
Does this need any extra work from consumers to be emitted? Is a new property being added to the OpenTelemetry Instrumentation packages to add these links, or will they just start showing up automatically? Asking as I'm interested in getting this into our traces when it becomes available. |
Edit by @antonfirsov: The plan is to implement this feature request by introducing a separate Activity for each pooled connection, and link those to the HTTP Request Activity via ActivityLink. See #93832 (comment).
While diving into more distributed traces it would be interesting to break down outbound HTTP request time into a few more pieces to understand exactly where that time is spent. I think it would be interesting to explore adding an Activity for socket connection attempts and DNS resolution.
This activity could be coupled to HTTP traffic, but I think it would make more sense to push this down into the Socket/Dns APIs. That would make it work for all sorts of client APIs that use the underlying socket API (redis, various database drivers etc).
cc @noahfalk @samsp-msft
The text was updated successfully, but these errors were encountered: