Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define telemetry stability guarantees #1301

Closed
tedsuo opened this issue Dec 22, 2020 · 10 comments
Closed

Define telemetry stability guarantees #1301

tedsuo opened this issue Dec 22, 2020 · 10 comments
Assignees
Labels
area:semantic-conventions Related to semantic conventions priority:p2 Medium priority level release:allowed-for-ga Editorial changes that can still be added before GA since they don't require action by SIGs spec:metrics Related to the specification/metrics directory spec:trace Related to the specification/trace directory

Comments

@tedsuo
Copy link
Contributor

tedsuo commented Dec 22, 2020

It is critical that OpenTelemetry produces telemetry which remains stable. Changes to telemetry produced by OpenTelemetry instrumentation should avoid breaking analysis tools, such as dashboards and alerts. However, it is not clear at this time what type of instrumentation changes (for example, adding additional spans and labels) would actually cause a breaking change.

Related questions:

  • What types of changes are known to definitely break the analysis tools which currently target OpenTelemetry?
  • Better is always different. If we must make a breaking change in order to offer better instrumentation, what should the procedure be?
  • How should we convey changes in telemetry to operators and application owners?
  • How should auto-installers handle cases where application owners do not wish to install the latest version of an instrumentation library, in order to manage breaking changes?

Until this issue is resolved, instrumentation packages must not be marked as stable. API and SDK packages may still be marked as stable. The lack of telemetry stability should be clearly communicated in the documentation for every OpenTelemetry client, to avoid confusion.

@andrewhsu andrewhsu added area:semantic-conventions Related to semantic conventions priority:p1 Highest priority level release:required-for-ga Must be resolved before GA release, or nice to have before GA spec:metrics Related to the specification/metrics directory spec:trace Related to the specification/trace directory labels Jan 5, 2021
@tigrannajaryan
Copy link
Member

tigrannajaryan commented Jan 5, 2021

What constitutes a breaking change

For metrics here is an incomplete list of changes that may be breaking for readers of the telemetry (e.g. for dashboards in the backends):

  • Renaming of the metric or of a metric label.
  • Elimination of a label (dimension) of a metric.
  • Introduction of a label to a metric.
  • Splitting of a metric into multiple metric across a label dimension.
  • Combining of multiple metrics into one metric with an additional label dimension.
  • Change of value data type, aggregation or temporality.
  • Change of the unit.
  • Change of the scale of the values.

For spans:

  • Change of the span name.
  • Renaming of an attribute.
  • Elimination of an attribute.
  • Change of the data type of the attribute value.
  • Possibly, addition of an attribute.
  • Change of an event name or change of an attribute as described above.

For logs:

  • Change of the log record name.
  • Renaming of an attribute.
  • Elimination of an attribute.
  • Change of the data type of the attribute value.
  • Possibly, addition of an attribute.
  • Change of the body (similar to attribute changes).

[EDIT] We will also need to define relevant changes for Resources.

For future discussions I suggest we call the shape and structure of all of the data that is referenced above the schema of telemetry.

How we handle the changes

Unlike API, I believe changes described above are more likely to occur during the lifelime of an instrumentation library. I do not think we should aim to lock the telemetry schema and disallow changes listed above. Such locking would place a huge limitation on how instrumentation can evolve and would make it nearly impossible to fix mistakes in the semantic conventions, in the schema or in the implementation of the instrumentation (which will inevitably happen sooner or later).

[EDIT] I extracted full proposal here #1324 to avoid derailing this issue.

In the context of this issue here is a shorter summary:

  • For 1.0 GA we do not provide stability guarantees for telemetry schema emitted by instrumentation.
  • We introduce telemetry schema concept which allows instrumentation to change emitted data structure and allows recipients to treat in a backwards compatible manner.
  • Telemetry schema will become available as a concept in one of 1.x releases. Until then OpenTelemetry users should be advised that breaking changes to instrumentation may happen (however, we will try to minimize such changes until the telemetry schemas are introduced).

@tedsuo tedsuo added release:allowed-for-ga Editorial changes that can still be added before GA since they don't require action by SIGs and removed release:required-for-ga Must be resolved before GA release, or nice to have before GA labels Jan 21, 2021
@tedsuo
Copy link
Contributor Author

tedsuo commented Jan 21, 2021

Thanks @tigrannajaryan. After more conversations I agree. We should issue a v1.0 without these data guarantees. We can't even begin to address this until everything else in tracing and metrics is complete. I've removed the required-for-ga label.

@andrewhsu andrewhsu added priority:p2 Medium priority level and removed priority:p1 Highest priority level labels Jan 22, 2021
@Oberon00
Copy link
Member

Should we give this priority for the next GA release? It might have slipped through so far because it has the spec:metrics label on it.

@tigrannajaryan
Copy link
Member

@Oberon00 I believe we should. I am going to work on #1324 in the next few weeks. I need to do some experimentation but I hope it will be the answer to this issue as well.

@tigrannajaryan
Copy link
Member

Assigning this to myself since I plan to work on a very related issue.

@tigrannajaryan
Copy link
Member

Related reference from k8s about metric stability: https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#metric-lifecycle

IMO, k8s is not fully solving the problem, but rather makes it manageable, creates clarity in the deprecation process and slows it down appropriately to give affected parties time to react. While certainly an improvement over just making breaking changes I think we need to go in a different, more promising direction: #1324

@NathanielRN
Copy link
Contributor

@tigrannajaryan Is there an issue to add schemas to the specification now that open-telemetry/oteps#152 is merged? Otherwise, what is the next step for it?

I see we already have this folder: https://github.com/open-telemetry/opentelemetry-specification/tree/main/schemas so does that mean the OTEP is adopted?

@tigrannajaryan
Copy link
Member

Yes, OTEP is adopted and relevant bits are added to the spec, e.g. here or here.

@tedsuo
Copy link
Contributor Author

tedsuo commented Jul 27, 2021

@tigrannajaryan in that case, can this issue be closed?

@tigrannajaryan
Copy link
Member

Yes, I think we can close this. We will create new issues for any follow up tasks/problems as needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:semantic-conventions Related to semantic conventions priority:p2 Medium priority level release:allowed-for-ga Editorial changes that can still be added before GA since they don't require action by SIGs spec:metrics Related to the specification/metrics directory spec:trace Related to the specification/trace directory
Projects
None yet
Development

No branches or pull requests

5 participants