Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric Mismatch between Readme and Actual Behaviour #18743

Closed
forestsword opened this issue Feb 17, 2023 · 9 comments
Closed

Metric Mismatch between Readme and Actual Behaviour #18743

forestsword opened this issue Feb 17, 2023 · 9 comments
Labels
bug Something isn't working processor/servicegraph Service graph processor

Comments

@forestsword
Copy link

Component(s)

processor/servicegraph

What happened?

Description

In the metrics description of the README, there should be two histograms, traces_service_graph_request_server_seconds and traces_service_graph_request_client_seconds. Instead there is only one called traces_service_graph_request_duration_seconds:

reqDurationSecondsSum map[string]float64
reqDurationSecondsCount map[string]uint64
reqDurationBounds []float64
reqDurationSecondsBucketCounts map[string][]uint64

It looks like the documented metrics are a holdover from the tempo servicegraph generator which does create those metrics. Tempo requires the ones documented: https://github.com/grafana/tempo/blob/8901bf5e09c316e78343ab536d6c3c704bf3de64/modules/generator/processor/servicegraphs/servicegraphs.go#L48-L49

I'm able to work-around this by, perhaps incorrectly, renaming the duration metric to the server metric which tempo expects.

I'm not sure what the plans are but I would expect to be able to use this with the tempo grafana datasource as is, since it's called out as supporting this by default. Will this be corrected into the future? At the very least it would be helpful to call out the differences or what one needs to do to have this work correctly in tempo.

Thanks!

Steps to Reproduce

Run as documented.

Expected Result

To have both metrics: traces_service_graph_request_server_seconds traces_service_graph_request_client_seconds

Actual Result

It produces traces_service_graph_request_duration_seconds

Collector version

v0.71.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

No response

@forestsword forestsword added bug Something isn't working needs triage New item requiring triage labels Feb 17, 2023
@github-actions github-actions bot added the processor/servicegraph Service graph processor label Feb 17, 2023
@github-actions
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@jpkrohling
Copy link
Member

@mapno, @kovrus, would you like to handle this one?

@kovrus
Copy link
Member

kovrus commented Mar 2, 2023

Looking at code https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/servicegraphprocessor/processor.go#L217-L255 traces_service_graph_request_duration_seconds used to observe measurements for both server and client spans, but now the attributes are prefixed with client_ or server_. It should probably be documented.

@mapno @jpkrohling would it make sense instead of prefixing all attributes to have an attribute, e.g. kind that can be either server or client? If yes, maybe this breaking change can be done in the servicegraph connector, since it is not yet used.

@jpkrohling
Copy link
Member

I would attempt to keep the connector compatible with the processor counterpart so that people can move from the processor to the connector, and back if they need (like, for performance reasons).

@mapno
Copy link
Contributor

mapno commented Mar 3, 2023

traces_service_graph_request_duration_seconds used to observe measurements for both server and client spans, but now the attributes are prefixed with client_ or server_. It should probably be documented.

The processor measures both client and server spans. The prefix now is made to differentiate labels when both spans have the same label, so one doesn't overwrite the other. I agree that it should documented if it's not, but it's not related to this issue.

maybe this breaking change can be done in the servicegraph connector

Connector and processor are basically the same codebase: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/connector/servicegraphconnector/factory.go#L17-L19

The idea of having traces_service_graph_request_server_seconds and traces_service_graph_request_client_seconds is measuring latency as seen from the client and the server. Honestly, I don't know why I merged it into one in here, but staid separate in Tempo. Most likely an oversight.

IMO, we should split traces_service_graph_request_duration_seconds into two and measure latency from both sides. We can keep emitting traces_service_graph_request_duration_seconds for a while to make it backwards compatible. Maybe under a flag to reduce the number of active series.

@kovrus
Copy link
Member

kovrus commented Mar 6, 2023

IMO, we should split traces_service_graph_request_duration_seconds into two and measure latency from both sides. We can keep emitting traces_service_graph_request_duration_seconds for a while to make it backwards compatible. Maybe under a flag to reduce the number of active series.

Alternatively, we can keep traces_service_graph_request_duration_seconds but introduce an additional attribute/dimension that would take the server or client value and get rid of those prefixes.

@github-actions
Copy link
Contributor

github-actions bot commented May 8, 2023

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label May 8, 2023
@kovrus kovrus removed the Stale label May 15, 2023
@jpkrohling
Copy link
Member

I believe this has been addressed, right @mapno?

@mapno
Copy link
Contributor

mapno commented Jun 27, 2023

Yes! Addressed in #21098. This one should be closed as well #16578

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working processor/servicegraph Service graph processor
Projects
None yet
Development

No branches or pull requests

5 participants