-
Notifications
You must be signed in to change notification settings - Fork 164
Introduce Mandatory Unique Identifier For Telemetry Sources #194
Conversation
|
||
3. This may introduce a breaking change with `service.name` being not mandatory anymore in that broad sense. This would need further investigation. Also, this approach might lead to further additional sets of attributes which will be used by different telemetry sources for unique identification (devices, cronjobs, bots, ...) | ||
|
||
4. This will introduce a breaking change because `service.name` will be replaced with `telemetry.source.name`. This could be mitigated by a fallback mechanism, e.g. if `telemetry.source.name` is not provided check `service.name`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related proposal for solving this: #161
|
||
As stated above, there are multiple approaches to obtain that common unique identifier. Depending on the approach, there are different ways to accomplish it: | ||
|
||
1. Introduce `telemetry.sdk.instance_id` (or similar) and make it mandatory. Make `service.name` only mandatory for backend services. Other telemetry sources can make different attributes mandatory, like `app.name`. Optionally, remove `service.instance_id` from `service.*` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One goal we should have here, is that this is not some machine-generated-id, but a human-readable name that allows simple filtering for users on telemetry generated for their "idea" of an observable unit. E.g. if I'm running a checkout service, this name should be used across ALL instances of components I'm using related to that checkout service. Similarly if I'm running a "Coffee Rewards Mobile Application", this id should be the same across all rollouts and instances of that application I'm observing.
I want to make sure we don't loose that, and having a name instance_id
I think doesn't convey what we really are asking users to provide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also not a big fan of only having the machine-generated-id, at the end you want to have a combination of both, e.g. if you have 10 instances of your "Checkout Service" and one of them is in an error state, you want to identify it uniquely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this tries to deviate from Otel's current philosophy of identification, which is:
- Sources of telemetries are defined in semantic conventions by specifying a list of attributes that describe them.
- For every source the semantic conventions are specifically defined to say which attributes are used for identification purposes in a particular scope.
For example:
- we have Service which is identified by (service.namespace,service.name,service.instance.id) tuple globally.
- we have Kubernetes Node which is identified by (k8s.node.uid) within its cluster.
- we have Kubernetes Namespace which is identified by (k8s.namespace.name) within its cluster.
- we have OS Process which is identified by its (process.pid) within its host.
From what I see this tries to introduce the concept of universal and globally unique ID for all telemetry sources and mandates one ID per source. I fail to see how this is possible at all. A couple problems I see:
- How do you guarantee global uniqueness? Are IDs randomly generated? Do we rely on lack of collisions of IDs because the generators are good and ID is wide enough to make collision probability negligible? If not randomly generated how do you ensure global uniqueness?
- Which of the associated entities is the source of a particular telemetry when you have a stack of technologies? For example if I emit CPU usage of an application using Otel GO SDK, running as an OS Process inside a Container on a Kubernetes Pod, what is my source? Is it the Application? Go SDK? OS Process? Container? Pod? I can attribute CPU usage to any of these equally well and even if I choose one I still likely want to record the fact that these 5 different kinds of sources are associated with that metrics. Do we allow
telemetry.sdk.instance_id
to be an array of values?
--
While I generally agree that it is a good goal to make telemetry sources identifiable I fail to see how the premise of a single globally unique id per telemetry source can work.
I think the best we were able to do so far was to allow individual source types to solve the identification problem within their scope of operation and decide what sets of attributes they want to define in the form of semantic conventions and designate as their identifiers.
I would welcome a solution that is more uniform than the current approach but I do not see it in any of the proposed variations in this OTEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I see this tries to introduce the concept of universal and globally unique ID for all telemetry sources and mandates one ID per source. I fail to see how this is possible at all. A couple problems I see:
I might need to change my wording here: the main argument is around unique identification for SDK-based telemetry sources (backend services, frontend services or anything coming in the future using an OTel SDK to emit telemetry).
we have Service which is identified by (service.namespace,service.name,service.instance.id) tuple globally.
In open-telemetry/opentelemetry-specification#1034 @Oberon00 was arguing that service.instance.id
is not well-defined and should be replaced with telemetry.sdk.instance_id
IMHO this attribute is poorly defined right now as it may or may not be the same across service restarts, which IMHO can make quite a difference. It would be easiest if it MUST be the different for each restart, that way it could be used as primary key for all resources (not only service.*) sent by the same telemetry instance. On the other hand, maybe such an attribute would better be named telemetry.sdk.intance.id.
Applying this, this would make service.namespace,service.name,telemetry.sdk.instance_id the unique identifier.
Which of the associated entities is the source of a particular telemetry when you have a stack of technologies? For example if I emit CPU usage of an application using Otel GO SDK, running as an OS Process inside a Container on a Kubernetes Pod, what is my source? Is it the Application? Go SDK? OS Process? Container? Pod?
From my point of view, the telemetry source is the emitter of the metric (e.g. the OTel GO SDK). This does not stop you from associating the metric with the process, the container or the pod additionally.
But if things go wrong and you get -5,000,000% CPU usage reported, you want to figure out who is emitting that metric and fix it.
While I generally agree that it is a good goal to make telemetry sources identifiable I fails to see how the premise of a single globally unique id per telemetry source can work.
I think this approach here (1) is not explained correctly, see above: this telemetry.sdk.id is for telemetry coming from an otel sdk.
I would welcome a solution that is more uniform than the current approach but I do not see it in any of the proposed variations in this OTEP.
You're right, this is not proposed in this OTEP. However I am wondering if this is possible: You wrote that for every source the semantic conventions are specifically defined to say which attributes are used for identification purposes in a particular scope. If I understand this correctly, this would mean that there is always a group of attributes that could be merged (in the SDK, in the collector, in the backend) into a unique identifier (like ecommerce-checkout-)?( I am not suggesting that this should be mandated)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Many of my objections come from that fact that the OTEP appears to be talking about all telemetry sources. If this is about Otel SDKs then that's a different story. I think in fact it is very useful for each Otel SDK to have a unique runtime instance id and emit it. We have telemetry.sdk.name
, telemetry.sdk.version
, etc. I think in addition to that we also need a globally unique telemetry.sdk.instance.id
which can be autogenerated or supplied via an env var to the SDK. This will be a necessity if we want to add a remote configuration capability to the Otel SDKs, such that it uses the same management protocol as the Otel Collector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this correctly, this would mean that there is always a group of attributes that could be merged (in the SDK, in the collector, in the backend) into a unique identifier (like ecommerce-checkout-)
Yes.
Important: it is not always global uniqueness, for some sources it is merely uniqueness within a particular scope. I would love to have a stronger global uniqueness but that is more difficult to achieve and so far we refrained from making it a requirement. For example for Kubernetes Pods we describe which attributes uniquely identify it within a Kubernetes cluster.
We may have not been fully diligent in this, so some of the sources may lack this identification ingredient, but I believe this was the general sentiment for semantic conventions that describe telemetry sources.
|
||
## Open questions | ||
|
||
* What approach provides the most benefit and the least breaking changes to the current specification? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you propose one approach within this OTEP and list the other approaches as Alternatives considered?
It'll be hard for folks to "approve" this without an approach chosen.
This is a great rundown of options, tradeoffs issues. I think if you pick the option you find best, you'll see people comment pros/cons and find consensus in comments anyway. If you don't take a position, you're unlikely to see that feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, let me take the one that most people hate, so they will bring their arguments.
Seriously: I'll find some time to rewrite the proposal in such a way! thanks!
|
||
Having a way to uniquely identify a telemetry source is helpful in many ways, like in processing and storing data from that source, visualizing them in a backend UI or debugging issues with that source and it's data. | ||
|
||
As of now `service.name` (and related attributes `service.namespace` and `service.instance_id`) are the implicit standard for that due to `service.name` being enforced as mandatory by the [Resource SDK specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/resource/sdk.md#sdk-provided-resource-attributes) and [Resource Semantic Conventions](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/resource/semantic_conventions/README.md#semantic-attributes-with-sdk-provided-default-value). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is generally speaking not true if we speak about the entire OpenTelemetry. It is only true for telemetry emitted by Otel SDKs. There are other sources of telemetry which are not Otel SDKs. A good example is Otel Collector. It emits telemetry on behalf of many interesting sources which are not services, for example Processes or K8s pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I have to update this.
|
||
## Future possibilities | ||
|
||
While the discussion right now is between backend and frontend services, in the future additional telemetry sources like different kinds of devices could be introduced and run into a similar situation that `service` is not the appropriate term. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this fails to take into account already existing other types of sources which are neither backend services nor frontend services. For example: K8s nodes, k8s pods, OS processes, FaaS (Lambda). These do not necessarily fall clearly into the frontend or backend bucket (e.g. I can have an OS process both on the frontend and on the backend).
|
||
As stated above, there are multiple approaches to obtain that common unique identifier. Depending on the approach, there are different ways to accomplish it: | ||
|
||
1. Introduce `telemetry.sdk.instance_id` (or similar) and make it mandatory. Make `service.name` only mandatory for backend services. Other telemetry sources can make different attributes mandatory, like `app.name`. Optionally, remove `service.instance_id` from `service.*` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this tries to deviate from Otel's current philosophy of identification, which is:
- Sources of telemetries are defined in semantic conventions by specifying a list of attributes that describe them.
- For every source the semantic conventions are specifically defined to say which attributes are used for identification purposes in a particular scope.
For example:
- we have Service which is identified by (service.namespace,service.name,service.instance.id) tuple globally.
- we have Kubernetes Node which is identified by (k8s.node.uid) within its cluster.
- we have Kubernetes Namespace which is identified by (k8s.namespace.name) within its cluster.
- we have OS Process which is identified by its (process.pid) within its host.
From what I see this tries to introduce the concept of universal and globally unique ID for all telemetry sources and mandates one ID per source. I fail to see how this is possible at all. A couple problems I see:
- How do you guarantee global uniqueness? Are IDs randomly generated? Do we rely on lack of collisions of IDs because the generators are good and ID is wide enough to make collision probability negligible? If not randomly generated how do you ensure global uniqueness?
- Which of the associated entities is the source of a particular telemetry when you have a stack of technologies? For example if I emit CPU usage of an application using Otel GO SDK, running as an OS Process inside a Container on a Kubernetes Pod, what is my source? Is it the Application? Go SDK? OS Process? Container? Pod? I can attribute CPU usage to any of these equally well and even if I choose one I still likely want to record the fact that these 5 different kinds of sources are associated with that metrics. Do we allow
telemetry.sdk.instance_id
to be an array of values?
--
While I generally agree that it is a good goal to make telemetry sources identifiable I fail to see how the premise of a single globally unique id per telemetry source can work.
I think the best we were able to do so far was to allow individual source types to solve the identification problem within their scope of operation and decide what sets of attributes they want to define in the form of semantic conventions and designate as their identifiers.
I would welcome a solution that is more uniform than the current approach but I do not see it in any of the proposed variations in this OTEP.
|
||
1. Introduce `telemetry.sdk.instance_id` (or similar) and make it mandatory. Make `service.name` only mandatory for backend services. Other telemetry sources can make different attributes mandatory, like `app.name`. Optionally, remove `service.instance_id` from `service.*` | ||
|
||
2. Introduce a broad definition of the term _Service_ in the glossary. Unique identification could be achieved by (1) or making `service.name`, `service.namespace`, `service.instance_id` mandatory for all telemetry sources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to better understand why this doesn't work. Is it merely a presentation issue in the backends/UIs? What prevents the frontend or client-side applications to emit an additional attribute and for backends to look for this attribute and present that particular Service in a different way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The argument against this option is that frontend-developers (and others) do not think of their applications as "service" and so the Client Telemetry SIG was proposing app.name
as alternative to not confuse the end-user of the SDK. This means an SDK for frontend applications (in Java, WebJS, Swift) would send no service.name
but app.name
, like in option (3).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tigrannajaryan In a comment above, you mentioned that OTel's philosophy of identification is to specify a list of attributes that describe the source. One example of this is a Service that is identified by the set of service.* attributes.
We argue that client-side telemetry is different enough that it should be identified as separate from backend services. Therefore, we proposed introducing a different set of attributes (app.*) to identify client-side telemetry. I think that aligns with that principle, while using service
attributes for both backend and client telemetry would not allow identifying one from the other.
The core issue perhaps is the definition of service. I would interpret it as a backend service, or a service within a private infrastructure, as opposed to running on client devices. I think there is an argument that it could have a bigger scope and include client apps. I think that this would be counterintuitive and possibly confusing to client-side developers. Also, there will be additional attributes coming from client-side resources that would not make sense in the service namespace (e.g. service.bundle).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We argue that client-side telemetry is different enough that it should be identified as separate from backend services. Therefore, we proposed introducing a different set of attributes (app.*) to identify client-side telemetry. I think that aligns with that principle, while using service attributes for both backend and client telemetry would not allow identifying one from the other.
This sounds reasonable to me. From what I understand the problem that prevents this from happening is that we mandated "service.name" to be always present (I missed the moment when that change was done the spec and I think it was not a right decision). The rationale for this requirement appears to be that some backends require it. I think the solution shouldn't be that the SDKs also require "service.name". Perhaps instead the solution should be that backend-specific exporters set some default value for "service.name" if it is missing, purely as a means to satisfy the particular backends. Backend-specific exporters can also make more complicated decisions like using one of "service.name" or "app.name" depending on which one is set. This would make it possible again to put other sources, like client-side apps on equal footing with the Service in the SDKs.
The core issue perhaps is the definition of service. I would interpret it as a backend service, or a service within a private infrastructure, as opposed to running on client devices. I think there is an argument that it could have a bigger scope and include client apps. I think that this would be counterintuitive and possibly confusing to client-side developers. Also, there will be additional attributes coming from client-side resources that would not make sense in the service namespace (e.g. service.bundle).
I don't mind against this, provided that we can clearly explain why the client-side apps need to be specified differently from Services. I would prefer that we make a reasonable effort and try to fit client-side apps into the definition of the Service, but if we find that it creates too much semantic mismatch in the naming of attributes and in the definitions of the concepts then I think client-side apps should be allowed to use their own set of attributes.
I updated the file with a lot of changes provided by everyone, thank you!
|
|
||
To address all requirements outlined in those approaches, we are proposing the following combined approach for uniquely identifying a SDK-based telemetry source: | ||
|
||
* Introduce an `telemetry.sdk.source.id` attribute, which MUST either be autogenerated by the SDK at application start or be supplied via an environment variable to the SDK. This will be the unique identifier for an SDK-based telemetry.source. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why telemetry.sdk.source.id
and not telemetry.source.id
? How is "sdk" relevant to the description of the workload that is producing telemetry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's one of my open questions:
Is the namespace
telemetry.sdk.source.*
suitable? Alternative names could be used: [...]telemetry.source.*
as suggested by open-telemetry/opentelemetry-specification#2192. The difference is that[telemetry.sdk.source
] does state explicitly that only SDK-based telemetry sources are covered. This is not necessarily bad, since other telemetry sources could decide to use it as well. [...]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to drop the word "source" and have telemetry.sdk.id
(or telemetry.sdk.instance.id
) to identify the SDK. This will match other telemetry.sdk.*
attributes. This also fits the requirements of possible future remote SDK management capabilities, where the SDK is actually the entity that is being managed.
To identify the Service we will (continue to) use service.*
attributes, to identify other types of sources which are not Services we can introduce new sets of attributes (e.g. for client-side apps as was discussed earlier).
If we call this attribute telemetry.source.id
it is not clear what does it identify. Does it identify the Service? If yes then do we remove service.instance.id
? If we remove it then do we use a mix of telemetry.source.id
+service.*
attributes to identify the Service? This does not look very consistent to me.
Generally, for any particular kind of entity that we want to identify I think we should have a set of attributes <kind>.*
that are defined in semantic conventions specifically for that particular kind. This has been the practice so far and telemetry.source.id
would be a conceptual deviation from the practice which I don't see why we need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tigrannajaryan personally I am not completely sold on telemetry.source.id
and I think it should be a different OTEP (i.e. assigning an opaque unique ID to a source vs. identifying the source via a collection of domain-specific attributes, like k8s.pod etc.) from the question of how to generalize service.name
to different types of workloads.
I also find it more intuitive to have service.name
and app.name
vs. a single (and ambiguous) telemetry.source.name
. The issue is how that affects existing backends like Jaeger where service.name
is a core concept and app.name
means nothing at all. Should the exporters to such backends translate app.name
(OTEL) -> service.name
(Jaeger)? This translation becomes easier / more consistent if we use a single telemetry.source.name
instead of domain-specific attributes, but I agree that it diverges from most of the existing conventions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is how that affects existing backends like Jaeger where
service.name
is a core concept andapp.name
means nothing at all. Should the exporters to such backends translateapp.name
(OTEL) ->service.name
(Jaeger)?
I think this is the right approach. Each exporter should be responsible for specialized needs of the particular backend they target.
I don't think we should try to reduce Otel conceptually to the lowest common denominator of features supported by all backends we want to support. We should probably accept the fact that some backends may have different requirements and if a particular Otel capability is not exactly representable in a particular backend that's fine as long as there is a reasonable mapping. So the logic in Jaeger exporter can be for example:
if otel service.name is defined:
set jaeger service.name to otel service.name
else if otel app.name is defined:
set jaeger service.name to otel app.name
else
set jaeger service.name to "unknown-service:"+ otel process.executable.name // mimic current behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine if this was only Jaeger, but from the comments in different threads it seems a lot of backends would prefer a single attribute. Maybe a fair question is: are there (how many) backends that would be completely fine with not having a single attribute identifying the source by low cardinality name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question, we probably want vendors to chime in.
Just one data point: the AWS Xray exporter in Collector for example probes a series of different attributes (including but not limited to service.name) to arrive at a "name", see: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/af3d9b1dc9f1da2e4f71944a32e388368fb0f5e6/exporter/awsxrayexporter/internal/translator/segment.go#L131
I would expect each vendor to either have an equivalent logic that makes sense for their backend or perhaps we extract this as a common logic somewhere for all vendors to use.
Note that vendors likely need to be ready for this anyway since Collector doesn't necessarily set service.name for all data it collects/produces on its own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one data point: the AWS Xray exporter in Collector for example probes a series of different attributes (including but not limited to service.name) to arrive at a "name"
I think this is a very good example, why there should be clarity on how to identify a telemetry source and why I opened my issues initially and now this OTEP:
As of now it is really hard to say "this telemetry source is called "?
I would expect each vendor to either have an equivalent logic that makes sense for their backend or perhaps we extract this as a common logic somewhere for all vendors to use.
That common logic is what I am looking for. My initial assumption was that service.namespace
, service.name
and service.instance.id
are doing the job, but as we know now this has the following issues:
- not everybody thinks of their workload as a "service"
- it's not clear how
service.instance.id
is filled (see Guidance for filling (or consider removing?) service.instance.id opentelemetry-specification#1034)
The intend of telemetry.(sdk?).source.(id|name|namespace)
was to avoid that kind of lengthy checks we see in the AWS Xray exporter (@martinkuba brought up this argument here as well: open-telemetry/opentelemetry-specification#2192)
An alternative, that was suggested at some point, is a <whatever>.type
field which is then app
or service
or ... and can help the implementation to jump to the correct attributes immediately.
I feel like this discussion is really talking about 2 separate things, and we should probably separate them into 2 separate oteps.
If this otep is only concerned with 1), we should create a separate OTEP to cover how to define 2), and provide backends ways to distinguish between different types of source, because the analysis of telemetry from user/client data sources (RUM) will, in many cases, be very different than the analysis of backend service telemetry. |
When I created this OTEP I had the feeling that those 2 things are intertwined, but after the arguments everyone brought up, I agree, that while they are related, they both require their own OTEP. I can rewrite the proposal doc to only be concerned around (1). @jkwatson: I am also happy help creating one for (2) |
This OTEP aims to introduce a mandatory unique identifier for telemetry sources, which has been
service.name
implicitly until now and lead to multiple discussions: open-telemetry/opentelemetry-specification#2111, open-telemetry/opentelemetry-specification#2115, open-telemetry/opentelemetry-specification#2192, open-telemetry/opentelemetry-specification#1034It's important to provide certainty to end-users to know how to identify their telemetry sources and to future spec changes to have this made explicit.
cc: @jkwatson, @tigrannajaryan, @Oberon00, @martinkuba, @yurishkuro, @jsuereth, @jonatan-ivanov, @carlosalberto