Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding activities for potentially long running network operations #93832

Closed
davidfowl opened this issue Oct 22, 2023 · 26 comments · Fixed by #103922
Closed

Consider adding activities for potentially long running network operations #93832

davidfowl opened this issue Oct 22, 2023 · 26 comments · Fixed by #103922
Assignees
Labels
area-System.Net enhancement Product code improvement that does NOT require public API changes/additions in-pr There is an active PR which will close this issue when it is merged
Milestone

Comments

@davidfowl
Copy link
Member

davidfowl commented Oct 22, 2023

Edit by @antonfirsov: The plan is to implement this feature request by introducing a separate Activity for each pooled connection, and link those to the HTTP Request Activity via ActivityLink. See #93832 (comment).


While diving into more distributed traces it would be interesting to break down outbound HTTP request time into a few more pieces to understand exactly where that time is spent. I think it would be interesting to explore adding an Activity for socket connection attempts and DNS resolution.

This activity could be coupled to HTTP traffic, but I think it would make more sense to push this down into the Socket/Dns APIs. That would make it work for all sorts of client APIs that use the underlying socket API (redis, various database drivers etc).

cc @noahfalk @samsp-msft

@ghost
Copy link

ghost commented Oct 22, 2023

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

While diving into more distributed traces it would be interesting to break down outbound HTTP request time into a few more pieces to understand exactly where that time is spent. I think it would be interesting to explore adding an Activity for socket connection attempts and DNS resolution.

This activity could be coupled to HTTP traffic, but I think it would make more sense to push this down into the Socket/Dns APIs. That would make it work for all sorts of client APIs that use the underlying socket API (redis, various database drivers etc).

Author: davidfowl
Assignees: -
Labels:

area-System.Net

Milestone: -

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Oct 22, 2023
@MihaZupan
Copy link
Member

MihaZupan commented Oct 23, 2023

I think instrumenting DNS/Sockets with more information would be fine, but I want to point out that it could become misleading / not match user expectations. Related: #63159 (comment)

You will rarely see a "perfect" trace like this

◄────────────────────HTTP Request────────────────────►
  ◄──DNS──►
           ◄──Socket──►
                       ◄──TLS──►
                                  ◄──Dependency──►

When a request comes in it may kick off a new connection attempt, but then get picked up by a different connection that became available before the new connection is established.
As a result, you could see traces like

◄─────────HTTP Request───────────►
    ◄──────DNS─────►
      ◄──Backend dependency─►
                    ◄────Socket────►
                                    ◄─────TLS────►

Or even

◄─────────HTTP Request───────────►
      ◄──Backend dependency─►
                                       ◄──────DNS─────►
                                                       ◄────Socket────► ...

From the HTTP perspective, you could avoid some confusion by inserting a different activity instead

◄────────────────────────HTTP Request───────────────────────────►
  ◄──Waiting for an available connection──►
                                             ◄──Dependency──►

That highlights the issue that the HTTP request is decoupled from the establishment of new connections.

@davidfowl
Copy link
Member Author

From the HTTP perspective, you could avoid some confusion by inserting a different activity instead

This seems ideal for the HTTP request case. It seems then it's possible that we'd also want to disable the dns and socket activities for the HTTP case? I'd be interested to see if we can get a similar trace for connection establishment in other clients and see if a similar problem exists (presumably it does if there's smart connection pooling going on).

@samsp-msft
Copy link
Member

This feels like something that should be done with Events on activities. Putting too many extra activities for this sub-operations will be expensive and noisy.
I wonder if the Activity class should have a verbosity setting that would be inherited from its parent, and then code can query the current activity for it, and use that to determine whether to add extra data. That could be extended to be somewhat similar to ILogger, with methods to add event data that will be conditional on the requested verbosity level.
The advantage of using Events on the Activity is that they include a timestamp, which makes sense for things like DNS or creating a connection. They could be skipped if the connection already exists.

@davidfowl
Copy link
Member Author

Putting too many extra activities for this sub-operations will be expensive and noisy.

I agree in general but not for expensive networking operations.

The advantage of using Events on the Activity is that they include a timestamp, which makes sense for things like DNS or creating a connection. They could be skipped if the connection already exists.

Lets explore this then, do you have something more concrete we could investigate and visualize in tools?

@noahfalk noahfalk added the enhancement Product code improvement that does NOT require public API changes/additions label Oct 25, 2023
@noahfalk
Copy link
Member

Lets explore this then, do you have something more concrete we could investigate and visualize in tools?

I think Sam is proposing that you visualize this list of span events that is included in the OTLP proto:

https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/trace/v1/trace.proto#L232

I assume when you look at it right now it would be empty, but if you want to experiment with putting some data in there you can use the Activity.AddEvent API.

I wonder if the Activity class should have a verbosity setting that would be inherited from its parent, and then code can query the current activity for it, and use that to determine whether to add extra data

This sounds an awful lot like Activity.IsAllDataRequested property (doc guidance). In general Activity events aren't designed to scale to a high verbosity level but a handful of messages with key timings should work fine.

@davidfowl
Copy link
Member Author

Lets take @MihaZupan's scenario, would be add begin/end events for these operations and visualize those as named events on the overall activity timeline? e.g. HttpClient does the following:

Activity.Current.AddEvent(new("wait for connection"));
await GetConnectionAsync();
Activity.Current.AddEvent(new("resolved connection"));

Is that the idea?

@antonfirsov
Copy link
Member

@davidfowl is there a specific customer ask behind this proposal (eg, from Aspire)?

@davidfowl
Copy link
Member Author

We were debugging connection slowness with aspire and more details would have helped speed that up.

@samsp-msft
Copy link
Member

Lets take @MihaZupan's scenario, would be add begin/end events for these operations and visualize those as named events on the overall activity timeline? e.g. HttpClient does the following:

Activity.Current.AddEvent(new("wait for connection"));
await GetConnectionAsync();
Activity.Current.AddEvent(new("resolved connection"));

Is that the idea?

I would suggest that for the events you don't want begin/end items, just the ones that have been achieved, so you'd get something like:

  • DNS address resolved
  • Socket opened
  • TLS Negotiated
  • Request sent

If any of them are skipped it doesn't so much matter. But if there is a DNS item in there you can see how that relates to the overall time for the span.

@davidfowl
Copy link
Member Author

davidfowl commented Jan 18, 2024

I like the idea of using events and also doing it in HttpClient so that we have a better view of the http connection time (based on #93832 (comment))

@antonfirsov
Copy link
Member

Triage: this doesn't look critical compared to other asks, tentatively pushing to future. @samsp-msft @davidfowl if you have concerns with that, let's sync!

@antonfirsov antonfirsov removed the untriaged New issue has not been triaged by the area owner label Apr 9, 2024
@antonfirsov antonfirsov added this to the Future milestone Apr 9, 2024
@davidfowl
Copy link
Member Author

It's not critical no, but it seems cheap to add. Do you have an idea of the cost?

@samsp-msft
Copy link
Member

This should be a relatively low cost way to instrument http client in a way that can really help developers understand what is going on in the http stack in a complex production environment. Without this kind of telemetry, they don't know if it's DNS that is taking time, if a new connection is needed etc.
A handful of events that can act as markers/signposts in the trace will go a long way to resolving that issue.

@antonfirsov
Copy link
Member

antonfirsov commented Apr 11, 2024

I fail to understand how does switching from sub-activities to events answer the concerns from #93832 (comment). If an HTTP request initiates connection attempt A, but then quickly gets served by a connection B that becomes free in the meanwhile (closing the HTTP span), the potentially long running A connection attempt would have no Activity.Current to log its events to, unless we are fine to define an artificial Connection activity that is allowed to have a lifetime mismatching the HTTP Request span:

◄────────HTTP Request───────►
      ▲ ◄────Backend──────►
   ◄──|────────────────────────────────Connection A ────────────────────────►
      |                         ▲                  ▲                      ▲
      |                         ╚⭍ DNS resolved    ╚⭍ socket connected    ╚⭍ TLS done
      |
      ╚ served by connection B 

Is this really what we want? Or am I'm misunderstanding something?

Note: the Waiting for an available connection activity proposed at the end of #93832 (comment) has no knowledge of DNS/socket/TLS. It's about the request sitting in the queue until a connection becomes available.

@antonfirsov
Copy link
Member

antonfirsov commented Apr 11, 2024

they don't know if it's DNS that is taking time

We have the dns.lookup.duration metric for this. Instead of headaches with activity lifetimes, wouldn't it be better to add more metrics instead?

@davidfowl
Copy link
Member Author

I fail to understand how does switching from sub-activities to events answer the concerns from #93832 (comment). If an HTTP request initiates connection attempt A, but then quickly gets served by a connection B that becomes free in the meanwhile (closing the HTTP span), the potentially long running A connection attempt would have no Activity.Current to log its events to, unless we are fine to define an artificial Connection activity that is allowed to have a lifetime mismatching the HTTP Request span:

I thought we agreed that doing it at the http connection layer makes that problem go away. We would step back from trying to do it at the dns layer and focus on http only.

We have the dns.lookup.duration metric for this. Instead of headaches with activity lifetimes, wouldn't it be better to add #88383 instead?

I don't why it would be better. If I had to pick one though I would pick this one.

@samsp-msft
Copy link
Member

For the purpose of this - we shouldn't be tracking connections as activities as they are not specific to a single request - they are potentially long running, and for the most case outside the control of the user.
The get requests from the client are what the developer has control over, and so wants to understand what is happening. That's why I am suggesting putting events on that activity that correspond to how that request was processed. If it needs to establish a connection, and we can do the association between that connection and the request, then lets tag the request with the events that did occur with processing the request. If it turns out that dns was slow, or establishing the connection was slow, then those will show as late markers on the activity. It it takes time to get a connection from the pool, that will also show. The beauty of firing events is that they are optional - it doesn't have to try and line up parent/child relationships for them.

We have the dns.lookup.duration metric for this. Instead of headaches with activity lifetimes, wouldn't it be better to add #88383 instead?

Metrics tell you the aggregate story, but not what happened for an individual request.

image

In the above image, the bottom blue bar is the http client activity for making a request to the API service. The yellow bar is the API service handling that request. There is a 129ms gap between them - currently the developer has no idea what that is due to. The goal of this should be to supply events so the beginning of the outgoing trace includes the info as to why.

@samsp-msft
Copy link
Member

I talked with @antonfirsov and have a bit more of an idea of the problem. Because of the connection pool the requests and connections are pretty independent, its only when the request is sent on an established connection that the relationship is created, and then its only for the duration of that request.
One possibility that I am warming up to is to create separate activities/spans for the connections, and then use links between the activities/spands(rather than a parent child) to establish the relationships. The connection could have events for each of its stages of operation, including DNS, TLS handshake etc. Some of those events could be for specific requests, and then tag the activity for the request as part of the event. Similarly the activity for the request could be given back the id for the activity/span for the connection, which it would add as link.
The connection would then not have a parent/child relationship with any of the requests, but would have events that link to the spans.
The request would have a link to the connection, and then events on that connection would show when that request was handled.

@antonfirsov
Copy link
Member

antonfirsov commented Apr 16, 2024

Looks like this should be feasible with ActivityLinks, however the best way to implement it would be by utilizing #97680:

  • Without Activity.AddLink(link) we need to know the connection ActivityContext in advance which forces us to create an additional activity request being processed by connection when a connection becomes available. Such an activity would bring overhead with little or no practical value.
  • @samsp-msft 's Consider adding activities for potentially long running network operations #93832 (comment) proposes to link the connection being assigned to the request. This is an unambiguous link between a request and connection, however there would be no links to failing connections since they are never dispatched to requests. This feels like loss of important diagnostic data to me. To overcome this, we may consider also linking the connection an incoming request spans (if any). If a connection fails it would be visible on its originating request telemetry data, however note that the failure of a connection doesn't necessarily imply that the originating request failed. If we decide to add this link as well, we definitely need Support adding links to Activity after it is created #97680.

@martinjt
Copy link

Can I suggest that if this is implemented with Links that a separate ActivitySource is used.

This will ensure that a user from the outside will be able to opt-in to create those Activity object and not have to do post-processing in the collector, or in a custom sampler.

For events, there is an issue around the size of the individual Activity/Span if it get really noisy (from a backend perspective).

I suppose overall, the ask is, can we make this opt-in vs opt-out, and for links the best way to do this is a separate ActivitySource.

@samsp-msft
Copy link
Member

Can I suggest that if this is implemented with Links that a separate ActivitySource is used.

That makes sense. Use something like System.Net.Http.Connections for the connection ActivitySource. If there is no connection Activity, it won't get linked on the request, and vice versa, the connection won't link back for the request.

@martinjt
Copy link

That's the idea, but it would be a check on the listeners for the ActivitySource for the connections instead of the Activity null check I would imagine.

If its a link from the http Activity, to the connection though, I'd imagine you don't actually need the .AddLink() support since the Http Activity would be created at a point when the connection activity has been created.

The other question I had was the structure of the connection trace, and specifically, which Activity are you planning on linking to in that tree.

The edge case I'm thinking about is the fact that you could add listeners to the Connections ActivitySource after the application has started, and since those connections are long lived, there may not have been an Activity created. You're also having to store the trace and Span information in memory for every connection, so that may add up in memory?

@antonfirsov
Copy link
Member

If its a link from the http Activity, to the connection though, I'd imagine you don't actually need the .AddLink() support since the Http Activity would be created at a point when the connection activity has been created.

The activity is being created and started in DiagnosticHandler before forwarding the request to the innermost platform handler. We have no knowledge of the concept of connections at that point.

Activity? activity = CreateActivity(request);
// Start activity anyway if it was created.
if (activity is not null)
{
activity.Start();

The other question I had was the structure of the connection trace, and specifically, which Activity are you planning on linking to in that tree.

There would be a root Activity for every connection which can be linked to the requests using that connection when using SocketsHttpHandler connection tracing enabled. DNS+TCP+TLS diagnostic info would be added either by spinning up sub-activities from the connection activity or by emitting ActivityEvents to the connection activity.

it would be a check on the listeners for the ActivitySource for the connections instead of the Activity null check

I have yet to understand and figure out what's best here, but I think the conditions for creating a connection activity would be similar to the conditions for creating the request activity just with other ActivitySource & DiagnosticListener:

if (s_activitySource.HasListeners())
{
activity = s_activitySource.CreateActivity(DiagnosticsHandlerLoggingStrings.ActivityName, ActivityKind.Client);
}
if (activity is null)
{
if (Activity.Current is not null || s_diagnosticListener.IsEnabled(DiagnosticsHandlerLoggingStrings.ActivityName, requestMessage))
{
activity = new Activity(DiagnosticsHandlerLoggingStrings.ActivityName);
}
}

The edge case I'm thinking about is the fact that you could add listeners to the Connections ActivitySource after the application has started, and since those connections are long lived, there may not have been an Activity created.

This is an existing problem with connection metrics already. I think we have no other choice but to accept it, I don't se a solution with the existing distributed tracing APIs.

You're also having to store the trace and Span information in memory for every connection, so that may add up in memory?

I'm not sure if connection tracing costs would be significant compared to request tracing costs under high load, we need to benchmark to see. It will be an opt-in feature anyways.

@davidfowl
Copy link
Member Author

🥳

@julealgon
Copy link

Does this need any extra work from consumers to be emitted? Is a new property being added to the OpenTelemetry Instrumentation packages to add these links, or will they just start showing up automatically?

Asking as I'm interested in getting this into our traces when it becomes available.

@github-actions github-actions bot locked and limited conversation to collaborators Aug 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Net enhancement Product code improvement that does NOT require public API changes/additions in-pr There is an active PR which will close this issue when it is merged
Projects
None yet
7 participants