Consider adding activities for potentially long running network operations #93832

davidfowl · 2023-10-22T20:27:59Z

Edit by @antonfirsov: The plan is to implement this feature request by introducing a separate Activity for each pooled connection, and link those to the HTTP Request Activity via ActivityLink. See #93832 (comment).

While diving into more distributed traces it would be interesting to break down outbound HTTP request time into a few more pieces to understand exactly where that time is spent. I think it would be interesting to explore adding an Activity for socket connection attempts and DNS resolution.

This activity could be coupled to HTTP traffic, but I think it would make more sense to push this down into the Socket/Dns APIs. That would make it work for all sorts of client APIs that use the underlying socket API (redis, various database drivers etc).

cc @noahfalk @samsp-msft

ghost · 2023-10-22T20:28:02Z

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

While diving into more distributed traces it would be interesting to break down outbound HTTP request time into a few more pieces to understand exactly where that time is spent. I think it would be interesting to explore adding an Activity for socket connection attempts and DNS resolution.

This activity could be coupled to HTTP traffic, but I think it would make more sense to push this down into the Socket/Dns APIs. That would make it work for all sorts of client APIs that use the underlying socket API (redis, various database drivers etc).

Author:	davidfowl
Assignees:	-
Labels:	`area-System.Net`
Milestone:	-

MihaZupan · 2023-10-23T15:26:52Z

I think instrumenting DNS/Sockets with more information would be fine, but I want to point out that it could become misleading / not match user expectations. Related: #63159 (comment)

You will rarely see a "perfect" trace like this

◄────────────────────HTTP Request────────────────────►
  ◄──DNS──►
           ◄──Socket──►
                       ◄──TLS──►
                                  ◄──Dependency──►

When a request comes in it may kick off a new connection attempt, but then get picked up by a different connection that became available before the new connection is established.
As a result, you could see traces like

◄─────────HTTP Request───────────►
    ◄──────DNS─────►
      ◄──Backend dependency─►
                    ◄────Socket────►
                                    ◄─────TLS────►

Or even

◄─────────HTTP Request───────────►
      ◄──Backend dependency─►
                                       ◄──────DNS─────►
                                                       ◄────Socket────► ...

From the HTTP perspective, you could avoid some confusion by inserting a different activity instead

◄────────────────────────HTTP Request───────────────────────────►
  ◄──Waiting for an available connection──►
                                             ◄──Dependency──►

That highlights the issue that the HTTP request is decoupled from the establishment of new connections.

davidfowl · 2023-10-23T15:44:31Z

From the HTTP perspective, you could avoid some confusion by inserting a different activity instead

This seems ideal for the HTTP request case. It seems then it's possible that we'd also want to disable the dns and socket activities for the HTTP case? I'd be interested to see if we can get a similar trace for connection establishment in other clients and see if a similar problem exists (presumably it does if there's smart connection pooling going on).

samsp-msft · 2023-10-24T17:04:54Z

This feels like something that should be done with Events on activities. Putting too many extra activities for this sub-operations will be expensive and noisy.
I wonder if the Activity class should have a verbosity setting that would be inherited from its parent, and then code can query the current activity for it, and use that to determine whether to add extra data. That could be extended to be somewhat similar to ILogger, with methods to add event data that will be conditional on the requested verbosity level.
The advantage of using Events on the Activity is that they include a timestamp, which makes sense for things like DNS or creating a connection. They could be skipped if the connection already exists.

davidfowl · 2023-10-24T18:32:38Z

Putting too many extra activities for this sub-operations will be expensive and noisy.

I agree in general but not for expensive networking operations.

The advantage of using Events on the Activity is that they include a timestamp, which makes sense for things like DNS or creating a connection. They could be skipped if the connection already exists.

Lets explore this then, do you have something more concrete we could investigate and visualize in tools?

noahfalk · 2023-10-25T04:51:06Z

Lets explore this then, do you have something more concrete we could investigate and visualize in tools?

I think Sam is proposing that you visualize this list of span events that is included in the OTLP proto:

https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/trace/v1/trace.proto#L232

I assume when you look at it right now it would be empty, but if you want to experiment with putting some data in there you can use the Activity.AddEvent API.

I wonder if the Activity class should have a verbosity setting that would be inherited from its parent, and then code can query the current activity for it, and use that to determine whether to add extra data

This sounds an awful lot like Activity.IsAllDataRequested property (doc guidance). In general Activity events aren't designed to scale to a high verbosity level but a handful of messages with key timings should work fine.

davidfowl · 2023-10-25T06:13:21Z

Lets take @MihaZupan's scenario, would be add begin/end events for these operations and visualize those as named events on the overall activity timeline? e.g. HttpClient does the following:

Activity.Current.AddEvent(new("wait for connection"));
await GetConnectionAsync();
Activity.Current.AddEvent(new("resolved connection"));

Is that the idea?

antonfirsov · 2024-01-18T21:48:48Z

@davidfowl is there a specific customer ask behind this proposal (eg, from Aspire)?

davidfowl · 2024-01-18T21:55:34Z

We were debugging connection slowness with aspire and more details would have helped speed that up.

samsp-msft · 2024-01-18T22:30:07Z

Lets take @MihaZupan's scenario, would be add begin/end events for these operations and visualize those as named events on the overall activity timeline? e.g. HttpClient does the following:
Activity.Current.AddEvent(new("wait for connection"));
await GetConnectionAsync();
Activity.Current.AddEvent(new("resolved connection"));
Is that the idea?

I would suggest that for the events you don't want begin/end items, just the ones that have been achieved, so you'd get something like:

DNS address resolved
Socket opened
TLS Negotiated
Request sent

If any of them are skipped it doesn't so much matter. But if there is a DNS item in there you can see how that relates to the overall time for the span.

davidfowl · 2024-01-18T23:13:10Z

I like the idea of using events and also doing it in HttpClient so that we have a better view of the http connection time (based on #93832 (comment))

antonfirsov · 2024-04-09T14:56:42Z

Triage: this doesn't look critical compared to other asks, tentatively pushing to future. @samsp-msft @davidfowl if you have concerns with that, let's sync!

davidfowl · 2024-04-09T15:22:40Z

It's not critical no, but it seems cheap to add. Do you have an idea of the cost?

samsp-msft · 2024-04-10T17:46:20Z

This should be a relatively low cost way to instrument http client in a way that can really help developers understand what is going on in the http stack in a complex production environment. Without this kind of telemetry, they don't know if it's DNS that is taking time, if a new connection is needed etc.
A handful of events that can act as markers/signposts in the trace will go a long way to resolving that issue.

antonfirsov · 2024-04-11T18:06:19Z

I fail to understand how does switching from sub-activities to events answer the concerns from #93832 (comment). If an HTTP request initiates connection attempt A, but then quickly gets served by a connection B that becomes free in the meanwhile (closing the HTTP span), the potentially long running A connection attempt would have no Activity.Current to log its events to, unless we are fine to define an artificial Connection activity that is allowed to have a lifetime mismatching the HTTP Request span:

◄────────HTTP Request───────►
      ▲ ◄────Backend──────►
   ◄──|────────────────────────────────Connection A ────────────────────────►
      |                         ▲                  ▲                      ▲
      |                         ╚⭍ DNS resolved    ╚⭍ socket connected    ╚⭍ TLS done
      |
      ╚ served by connection B

Is this really what we want? Or am I'm misunderstanding something?

Note: the Waiting for an available connection activity proposed at the end of #93832 (comment) has no knowledge of DNS/socket/TLS. It's about the request sitting in the queue until a connection becomes available.

antonfirsov · 2024-04-11T18:12:55Z

they don't know if it's DNS that is taking time

We have the dns.lookup.duration metric for this. Instead of headaches with activity lifetimes, wouldn't it be better to add more metrics instead?

davidfowl · 2024-04-15T19:05:36Z

I fail to understand how does switching from sub-activities to events answer the concerns from #93832 (comment). If an HTTP request initiates connection attempt A, but then quickly gets served by a connection B that becomes free in the meanwhile (closing the HTTP span), the potentially long running A connection attempt would have no Activity.Current to log its events to, unless we are fine to define an artificial Connection activity that is allowed to have a lifetime mismatching the HTTP Request span:

I thought we agreed that doing it at the http connection layer makes that problem go away. We would step back from trying to do it at the dns layer and focus on http only.

We have the dns.lookup.duration metric for this. Instead of headaches with activity lifetimes, wouldn't it be better to add #88383 instead?

I don't why it would be better. If I had to pick one though I would pick this one.

samsp-msft · 2024-04-15T20:19:12Z

For the purpose of this - we shouldn't be tracking connections as activities as they are not specific to a single request - they are potentially long running, and for the most case outside the control of the user.
The get requests from the client are what the developer has control over, and so wants to understand what is happening. That's why I am suggesting putting events on that activity that correspond to how that request was processed. If it needs to establish a connection, and we can do the association between that connection and the request, then lets tag the request with the events that did occur with processing the request. If it turns out that dns was slow, or establishing the connection was slow, then those will show as late markers on the activity. It it takes time to get a connection from the pool, that will also show. The beauty of firing events is that they are optional - it doesn't have to try and line up parent/child relationships for them.

We have the dns.lookup.duration metric for this. Instead of headaches with activity lifetimes, wouldn't it be better to add #88383 instead?

Metrics tell you the aggregate story, but not what happened for an individual request.

In the above image, the bottom blue bar is the http client activity for making a request to the API service. The yellow bar is the API service handling that request. There is a 129ms gap between them - currently the developer has no idea what that is due to. The goal of this should be to supply events so the beginning of the outgoing trace includes the info as to why.

samsp-msft · 2024-04-15T20:58:09Z

I talked with @antonfirsov and have a bit more of an idea of the problem. Because of the connection pool the requests and connections are pretty independent, its only when the request is sent on an established connection that the relationship is created, and then its only for the duration of that request.
One possibility that I am warming up to is to create separate activities/spans for the connections, and then use links between the activities/spands(rather than a parent child) to establish the relationships. The connection could have events for each of its stages of operation, including DNS, TLS handshake etc. Some of those events could be for specific requests, and then tag the activity for the request as part of the event. Similarly the activity for the request could be given back the id for the activity/span for the connection, which it would add as link.
The connection would then not have a parent/child relationship with any of the requests, but would have events that link to the spans.
The request would have a link to the connection, and then events on that connection would show when that request was handled.

antonfirsov · 2024-04-16T22:08:20Z

Looks like this should be feasible with ActivityLinks, however the best way to implement it would be by utilizing #97680:

Without Activity.AddLink(link) we need to know the connection ActivityContext in advance which forces us to create an additional activity request being processed by connection when a connection becomes available. Such an activity would bring overhead with little or no practical value.
@samsp-msft 's Consider adding activities for potentially long running network operations #93832 (comment) proposes to link the connection being assigned to the request. This is an unambiguous link between a request and connection, however there would be no links to failing connections since they are never dispatched to requests. This feels like loss of important diagnostic data to me. To overcome this, we may consider also linking the connection an incoming request spans (if any). If a connection fails it would be visible on its originating request telemetry data, however note that the failure of a connection doesn't necessarily imply that the originating request failed. If we decide to add this link as well, we definitely need Support adding links to Activity after it is created #97680.

martinjt · 2024-04-17T07:52:55Z

Can I suggest that if this is implemented with Links that a separate ActivitySource is used.

This will ensure that a user from the outside will be able to opt-in to create those Activity object and not have to do post-processing in the collector, or in a custom sampler.

For events, there is an issue around the size of the individual Activity/Span if it get really noisy (from a backend perspective).

I suppose overall, the ask is, can we make this opt-in vs opt-out, and for links the best way to do this is a separate ActivitySource.

samsp-msft · 2024-04-17T17:19:43Z

Can I suggest that if this is implemented with Links that a separate ActivitySource is used.

That makes sense. Use something like System.Net.Http.Connections for the connection ActivitySource. If there is no connection Activity, it won't get linked on the request, and vice versa, the connection won't link back for the request.

martinjt · 2024-04-17T18:01:59Z

That's the idea, but it would be a check on the listeners for the ActivitySource for the connections instead of the Activity null check I would imagine.

If its a link from the http Activity, to the connection though, I'd imagine you don't actually need the .AddLink() support since the Http Activity would be created at a point when the connection activity has been created.

The other question I had was the structure of the connection trace, and specifically, which Activity are you planning on linking to in that tree.

The edge case I'm thinking about is the fact that you could add listeners to the Connections ActivitySource after the application has started, and since those connections are long lived, there may not have been an Activity created. You're also having to store the trace and Span information in memory for every connection, so that may add up in memory?

antonfirsov · 2024-04-18T13:01:18Z

If its a link from the http Activity, to the connection though, I'd imagine you don't actually need the .AddLink() support since the Http Activity would be created at a point when the connection activity has been created.

The activity is being created and started in DiagnosticHandler before forwarding the request to the innermost platform handler. We have no knowledge of the concept of connections at that point.

runtime/src/libraries/System.Net.Http/src/System/Net/Http/DiagnosticsHandler.cs

Lines 112 to 117 in 9068070

    
           Activity? activity = CreateActivity(request); 
        
           // Start activity anyway if it was created. 
        
           if (activity is not null) 
        
           { 
        
               activity.Start();

The other question I had was the structure of the connection trace, and specifically, which Activity are you planning on linking to in that tree.

There would be a root Activity for every connection which can be linked to the requests using that connection when using SocketsHttpHandler connection tracing enabled. DNS+TCP+TLS diagnostic info would be added either by spinning up sub-activities from the connection activity or by emitting ActivityEvents to the connection activity.

it would be a check on the listeners for the ActivitySource for the connections instead of the Activity null check

I have yet to understand and figure out what's best here, but I think the conditions for creating a connection activity would be similar to the conditions for creating the request activity just with other ActivitySource & DiagnosticListener:

runtime/src/libraries/System.Net.Http/src/System/Net/Http/DiagnosticsHandler.cs

Lines 59 to 70 in 9068070

    
           if (s_activitySource.HasListeners()) 
        
           { 
        
               activity = s_activitySource.CreateActivity(DiagnosticsHandlerLoggingStrings.ActivityName, ActivityKind.Client); 
        
           } 
        
           if (activity is null) 
        
           { 
        
               if (Activity.Current is not null || s_diagnosticListener.IsEnabled(DiagnosticsHandlerLoggingStrings.ActivityName, requestMessage)) 
        
               { 
        
                   activity = new Activity(DiagnosticsHandlerLoggingStrings.ActivityName); 
        
               } 
        
           }

The edge case I'm thinking about is the fact that you could add listeners to the Connections ActivitySource after the application has started, and since those connections are long lived, there may not have been an Activity created.

This is an existing problem with connection metrics already. I think we have no other choice but to accept it, I don't se a solution with the existing distributed tracing APIs.

You're also having to store the trace and Span information in memory for every connection, so that may add up in memory?

I'm not sure if connection tracing costs would be significant compared to request tracing costs under high load, we need to benchmark to see. It will be an opt-in feature anyways.

davidfowl · 2024-07-14T01:31:30Z

🥳

julealgon · 2024-07-15T13:43:14Z

Does this need any extra work from consumers to be emitted? Is a new property being added to the OpenTelemetry Instrumentation packages to add these links, or will they just start showing up automatically?

Asking as I'm interested in getting this into our traces when it becomes available.

davidfowl added the area-System.Net label Oct 22, 2023

ghost added the untriaged New issue has not been triaged by the area owner label Oct 22, 2023

noahfalk added the enhancement Product code improvement that does NOT require public API changes/additions label Oct 25, 2023

antonfirsov mentioned this issue Mar 8, 2024

Move HTTP client metrics out of .NET conventions open-telemetry/semantic-conventions#801

Merged

3 tasks

antonfirsov removed the untriaged New issue has not been triaged by the area owner label Apr 9, 2024

antonfirsov added this to the Future milestone Apr 9, 2024

antonfirsov modified the milestones: Future, 9.0.0 Apr 16, 2024

lmolkova mentioned this issue Apr 16, 2024

[API Proposal]: Allow adding links after activity creation #101146

Closed

antonfirsov mentioned this issue Apr 16, 2024

Support adding links to Activity after it is created #97680

Closed

antonfirsov self-assigned this Apr 22, 2024

This was referenced May 1, 2024

Child Activity outliving its parent corrupts Activity.Current #101775

Open

Connection Activities in SocketsHttpHandler #101814

Closed

dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label May 2, 2024

antonfirsov mentioned this issue Jun 24, 2024

Activities for Http Connections, Dns, Sockets and SslStream #103922

Merged

antonfirsov closed this as completed in #103922 Jul 13, 2024

github-actions bot locked and limited conversation to collaborators Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider adding activities for potentially long running network operations #93832

Consider adding activities for potentially long running network operations #93832

davidfowl commented Oct 22, 2023 •

edited by antonfirsov

Loading

ghost commented Oct 22, 2023

MihaZupan commented Oct 23, 2023 •

edited

Loading

davidfowl commented Oct 23, 2023

samsp-msft commented Oct 24, 2023

davidfowl commented Oct 24, 2023

noahfalk commented Oct 25, 2023

davidfowl commented Oct 25, 2023

antonfirsov commented Jan 18, 2024

davidfowl commented Jan 18, 2024

samsp-msft commented Jan 18, 2024

davidfowl commented Jan 18, 2024 •

edited

Loading

antonfirsov commented Apr 9, 2024

davidfowl commented Apr 9, 2024

samsp-msft commented Apr 10, 2024

antonfirsov commented Apr 11, 2024 •

edited

Loading

antonfirsov commented Apr 11, 2024 •

edited

Loading

davidfowl commented Apr 15, 2024

samsp-msft commented Apr 15, 2024

samsp-msft commented Apr 15, 2024

antonfirsov commented Apr 16, 2024 •

edited

Loading

martinjt commented Apr 17, 2024

samsp-msft commented Apr 17, 2024

martinjt commented Apr 17, 2024

antonfirsov commented Apr 18, 2024

davidfowl commented Jul 14, 2024

julealgon commented Jul 15, 2024

Consider adding activities for potentially long running network operations #93832

Consider adding activities for potentially long running network operations #93832

Comments

davidfowl commented Oct 22, 2023 • edited by antonfirsov Loading

ghost commented Oct 22, 2023

MihaZupan commented Oct 23, 2023 • edited Loading

davidfowl commented Oct 23, 2023

samsp-msft commented Oct 24, 2023

davidfowl commented Oct 24, 2023

noahfalk commented Oct 25, 2023

davidfowl commented Oct 25, 2023

antonfirsov commented Jan 18, 2024

davidfowl commented Jan 18, 2024

samsp-msft commented Jan 18, 2024

davidfowl commented Jan 18, 2024 • edited Loading

antonfirsov commented Apr 9, 2024

davidfowl commented Apr 9, 2024

samsp-msft commented Apr 10, 2024

antonfirsov commented Apr 11, 2024 • edited Loading

antonfirsov commented Apr 11, 2024 • edited Loading

davidfowl commented Apr 15, 2024

samsp-msft commented Apr 15, 2024

samsp-msft commented Apr 15, 2024

antonfirsov commented Apr 16, 2024 • edited Loading

martinjt commented Apr 17, 2024

samsp-msft commented Apr 17, 2024

martinjt commented Apr 17, 2024

antonfirsov commented Apr 18, 2024

davidfowl commented Jul 14, 2024

julealgon commented Jul 15, 2024

davidfowl commented Oct 22, 2023 •

edited by antonfirsov

Loading

MihaZupan commented Oct 23, 2023 •

edited

Loading

davidfowl commented Jan 18, 2024 •

edited

Loading

antonfirsov commented Apr 11, 2024 •

edited

Loading

antonfirsov commented Apr 11, 2024 •

edited

Loading

antonfirsov commented Apr 16, 2024 •

edited

Loading