diff --git a/specification/protocol/otlp.md b/specification/protocol/otlp.md index 5eb25f9f8c2..2c5828cca50 100644 --- a/specification/protocol/otlp.md +++ b/specification/protocol/otlp.md @@ -2,7 +2,7 @@ **Status**: [Mixed](../document-status.md) -OpenTelemetry Protocol (OTLP) specification describes the encoding, transport, +The OpenTelemetry Protocol (OTLP) specification describes the encoding, transport, and delivery mechanism of telemetry data between telemetry sources, intermediate nodes such as collectors and telemetry backends. @@ -50,7 +50,7 @@ nodes such as collectors and telemetry backends. OTLP is a general-purpose telemetry data delivery protocol designed in the scope -of OpenTelemetry project. +of the OpenTelemetry project. ## Signals Maturity Level @@ -73,20 +73,20 @@ This specification defines how OTLP is implemented over [Protocol Buffers schema](https://developers.google.com/protocol-buffers/docs/overview) that is used for the payloads. -OTLP is a request/response style protocols: the clients send requests, the -server replies with corresponding responses. This document defines one requests +OTLP is a request/response style protocol: the clients send requests, and the +server replies with corresponding responses. This document defines one request and response type: `Export`. All server components MUST support the following transport compression options: -* No compression, denotated by `none`. +* No compression, denoted by `none`. * Gzip compression, denoted by `gzip`. ### OTLP/gRPC **Status**: [Stable](../document-status.md) -After establishing the underlying gRPC transport the client starts sending +After establishing the underlying gRPC transport, the client starts sending telemetry data using unary requests using [Export*ServiceRequest](https://github.com/open-telemetry/opentelemetry-proto) messages ([ExportLogsServiceRequest](https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/collector/logs/v1/logs_service.proto) for logs, @@ -97,8 +97,8 @@ to receive a response to each request: ![Request-Response](img/otlp-request-response.png) -_Note: this protocol is concerned with reliability of delivery between one pair -of client/server nodes and aims to ensure that no data is lost in-transit +_Note: this protocol is concerned with the reliability of delivery between one +pair of client/server nodes and aims to ensure that no data is lost in transit between the client and the server. Many telemetry collection systems have intermediary nodes that the data must travel across until reaching the final destination (e.g. application -> agent -> collector -> backend). End-to-end @@ -116,8 +116,8 @@ that is not yet acknowledged by the server. ![Unary](img/otlp-sequential.png) Sequential operation is recommended when simplicity of implementation is -desirable and when the client and the server are connected via very low-latency -network, such as for example when the client is an instrumented application and +desirable, and when the client and the server are connected via very low-latency +network, such as when the client is an instrumented application and the server is an OpenTelemetry Collector running as a local daemon (agent). The implementations that need to achieve high throughput SHOULD support @@ -132,18 +132,18 @@ The number of concurrent requests SHOULD be configurable. The maximum achievable throughput is `max_concurrent_requests * max_request_size / (network_latency + server_response_time)`. -For example if the request can contain at most 100 spans, network roundtrip -latency is 200ms and server response time is 300 ms, then the maximum achievable +For example, if the request can contain at most 100 spans, network roundtrip +latency is 200ms, and server response time is 300 ms, then the maximum achievable throughput with one concurrent request is `100 spans / (200ms+300ms)` or 200 spans per second. It is easy to see that in high latency networks or when the -server response time is high to achieve good throughput the requests need to be +server response time is high to achieve good throughput, the requests need to be very big or a lot concurrent requests must be done. If the client is shutting down (e.g. when the containing process wants to exit) the client will optionally wait until all pending acknowledgements are received -or until an implementation specific timeout expires. This ensures reliable +or until an implementation-specific timeout expires. This ensures the reliable delivery of telemetry data. The client implementation SHOULD expose an option to -turn on and off the waiting during shutdown. +turn on and off the waiting during a shutdown. If the client is unable to deliver a certain request (e.g. a timer expired while waiting for acknowledgements) the client SHOULD record the fact that the data @@ -189,13 +189,13 @@ the number of spans/data points/log records it rejected. The server SHOULD populate the `error_message` field with a human-readable error message in English. The message should explain why the -server rejected parts of the data, and might offer guidance on how users +server rejected parts of the data and might offer guidance on how users can address the issues. The protocol does not attempt to define the structure of the error message. -Servers MAY also make use of the `partial_success` field to convey -warnings/suggestions to clients even when the request was fully accepted. -In such cases, the `rejected_` field MUST have a value of `0` and +Servers MAY also use the `partial_success` field to convey +warnings/suggestions to clients even when the server fully accepts the request. +In such cases, the `rejected_` field MUST have a value of `0`, and the `error_message` field MUST be non-empty. The client MUST NOT retry the request when it receives a partial success @@ -203,17 +203,19 @@ response where the `partial_success` is populated. ##### Failures -When an error is returned by the server it falls into 2 broad categories: +When the server returns an error, it falls into 2 broad categories: retryable and not-retryable: -- Retryable errors indicate that processing of telemetry data failed and the - client SHOULD record the error and may retry exporting the same data. This can - happen when the server is temporarily unable to process the data. +- Retryable errors indicate that telemetry data processing failed, and the + client SHOULD record the error and may retry exporting the same data. + For example, this can happen when the server is temporarily unable to + process the data. -- Not-retryable errors indicate that processing of telemetry data failed and the - client MUST NOT retry sending the same telemetry data. The telemetry data MUST - be dropped. This can happen, for example, when the request contains bad data - and cannot be deserialized or otherwise processed by the server. The client +- Not-retryable errors indicate that telemetry data processing failed, and the + client MUST NOT retry sending the same telemetry data. The client MUST drop + the telemetry data. + For example, this can happen, when the request contains bad data + and cannot be deserialized or processed by the server. The client SHOULD maintain a counter of such dropped data. The server MUST indicate retryable errors using code @@ -235,17 +237,17 @@ containing 0 value of RetryDelay. Here is a sample Go code to illustrate: return st.Err() ``` -To indicate not-retryable errors the server is recommended to use code +To indicate not-retryable errors, the server is recommended to use code [InvalidArgument](https://godoc.org/google.golang.org/grpc/codes) and MAY supply additional [details via status](https://godoc.org/google.golang.org/grpc/status#Status.WithDetails) using [BadRequest](https://github.com/googleapis/googleapis/blob/6a8c7914d1b79bd832b5157a09a9332e8cbd16d4/google/rpc/error_details.proto#L119). -Other gRPC status code may be used if it is more appropriate. Here is a sample -Go code to illustrate: +If more appropriate, another gRPC status code may be used. Here is a +snippet of sample Go code to illustrate: ```go - // Do this on server side. + // Do this on the server side. st, err := status.New(codes.InvalidArgument, "Invalid Argument"). WithDetails(&errdetails.BadRequest{}) if err != nil { @@ -283,14 +285,18 @@ When retrying, the client SHOULD implement an exponential backoff strategy. An exception to this is the Throttling case explained below, which provides explicit instructions about retrying interval. -The client SHOULD interpret `RESOURCE_EXHAUSTED` code as retryable only if the server signals that the recovery from resource exhaustion is possible. This is signalled by the server by returning [a status](https://godoc.org/google.golang.org/grpc/status#Status.WithDetails) -containing -[RetryInfo](https://github.com/googleapis/googleapis/blob/6a8c7914d1b79bd832b5157a09a9332e8cbd16d4/google/rpc/error_details.proto#L40). In this case the behavior of the server and the client is exactly as described in [OTLP/gRPC Throttling](#otlpgrpc-throttling) section. -If no such status is returned then the `RESOURCE_EXHAUSTED` code SHOULD be treated as non-retryable. +The client SHOULD interpret `RESOURCE_EXHAUSTED` code as retryable only if the +server signals that the recovery from resource exhaustion is possible. +This is signaled by the server by returning +[a status](https://godoc.org/google.golang.org/grpc/status#Status.WithDetails) containing +[RetryInfo](https://github.com/googleapis/googleapis/blob/6a8c7914d1b79bd832b5157a09a9332e8cbd16d4/google/rpc/error_details.proto#L40). +In this case the behavior of the server and the client is exactly as described in +[OTLP/gRPC Throttling](#otlpgrpc-throttling) section. If no such status is returned, +then the `RESOURCE_EXHAUSTED` code SHOULD be treated as non-retryable. #### OTLP/gRPC Throttling -OTLP allows backpressure signalling. +OTLP allows backpressure signaling. If the server is unable to keep up with the pace of data it receives from the client then it SHOULD signal that fact to the client. The client MUST then @@ -302,10 +308,10 @@ and MAY supply additional [details via status](https://godoc.org/google.golang.org/grpc/status#Status.WithDetails) using [RetryInfo](https://github.com/googleapis/googleapis/blob/6a8c7914d1b79bd832b5157a09a9332e8cbd16d4/google/rpc/error_details.proto#L40). -Here is a sample Go code to illustrate: +Here is a snippet of sample Go code to illustrate: ```go - // Do this on server side. + // Do this on the server side. st, err := status.New(codes.Unavailable, "Server is unavailable"). WithDetails(&errdetails.RetryInfo{RetryDelay: &duration.Duration{Seconds: 30}}) if err != nil { @@ -316,7 +322,7 @@ Here is a sample Go code to illustrate: ... - // Do this on client side. + // Do this on the client side. st := status.Convert(err) for _, detail := range st.Details() { switch t := detail.(type) { @@ -328,13 +334,13 @@ Here is a sample Go code to illustrate: } ``` -When the client receives this signal it SHOULD follow the recommendations +When the client receives this signal, it SHOULD follow the recommendations outlined in documentation for [RetryInfo](https://github.com/googleapis/googleapis/blob/6a8c7914d1b79bd832b5157a09a9332e8cbd16d4/google/rpc/error_details.proto#L40): ``` // Describes when the clients can retry a failed request. Clients could ignore -// the recommendation here or retry when this information is missing from error +// the recommendation here or retry when this information is missing from the error // responses. // // It's always recommended that clients should use exponential backoff when @@ -342,16 +348,16 @@ outlined in documentation for // // Clients should wait until `retry_delay` amount of time has passed since // receiving the error response before retrying. If retrying requests also -// fail, clients should use an exponential backoff scheme to gradually increase -// the delay between retries based on `retry_delay`, until either a maximum -// number of retires have been reached or a maximum retry delay cap has been +// fail, clients should use an exponential backoff scheme to increase gradually +// the delay between retries based on `retry_delay` until either a maximum +// number of retries has been reached, or a maximum retry delay cap has been // reached. ``` The value of `retry_delay` is determined by the server and is implementation dependant. The server SHOULD choose a `retry_delay` value that is big enough to -give the server time to recover, yet is not too big to cause the client to drop -data while it is throttled. +give the server time to recover yet is not too big to cause the client to drop +data while being throttled. #### OTLP/gRPC Service and Protobuf Definitions @@ -390,8 +396,8 @@ connection cannot be established. Binary Protobuf encoded payloads use proto3 [encoding standard](https://developers.google.com/protocol-buffers/docs/encoding). -The client and the server MUST set "Content-Type: application/x-protobuf" request and -response headers when sending binary Protobuf encoded payload. +The client and the server MUST set "Content-Type: application/x-protobuf" +request and response headers when sending binary Protobuf encoded payload. #### JSON Protobuf Encoding @@ -403,30 +409,29 @@ for mapping between Protobuf and JSON, with the following deviations from that m - The `trace_id` and `span_id` byte arrays are represented as [case-insensitive hex-encoded strings](https://tools.ietf.org/html/rfc4648#section-8); - they are not base64-encoded like as it is defined in the standard + they are not base64-encoded as is defined in the standard [Protobuf JSON Mapping](https://developers.google.com/protocol-buffers/docs/proto3#json). - The hex encoding is used for `trace_id` and `span_id` fields in all OTLP - Protobuf messages, e.g. the `Span`, `Link`, `LogRecord`, etc. messages. + Hex encoding is used for `trace_id` and `span_id` fields in all OTLP + Protobuf messages, e.g., the `Span`, `Link`, `LogRecord`, etc. messages. For example, the `trace_id` field in a Span can be represented like this: { "trace_id": "5B8EFFF798038103D269B633813FC60C", ... } - Values of enum fields MUST be encoded as integer values. Unlike the standard [Protobuf JSON Mapping](https://developers.google.com/protocol-buffers/docs/proto3#json), - which allows values of enum fields to be encoded as either integer values or as enum - name strings, only integer enum values are allowed in OTLP JSON Protobuf Encoding; - the enum name strings MUST NOT be used. - For example, the `kind` field with a value of SPAN_KIND_SERVER in a Span can be - represented like this: - { "kind": 2, ... } - -- OTLP/JSON receivers MUST ignore message fields with unknown names and MUST unmarshal the - message as if the unknown field was not present in the payload. - This aligns with the behavior of the Binary Protobuf unmarshaler and ensures that adding - new fields to OTLP messages does not break existing receivers. + which allows values of enum fields to be encoded as either integer values or + as enum name strings, only integer enum values are allowed in OTLP JSON + Protobuf Encoding; the enum name strings MUST NOT be used. For example, the + `kind` field with a value of SPAN_KIND_SERVER in a Span can be represented + like this: { "kind": 2, ... } + +- OTLP/JSON receivers MUST ignore message fields with unknown names and MUST + unmarshal the message as if the unknown field was not present in the payload. + This aligns with the behavior of the Binary Protobuf unmarshaler and ensures + that adding new fields to OTLP messages does not break existing receivers. - The keys of JSON objects are field names converted to lowerCamelCase. Original - field names are not valid to use a keys of JSON objects. - For example this is a valid JSON representation of a Resource: + field names are not valid to use as keys for JSON objects. + For example, this is a valid JSON representation of a Resource: `{ "attributes": {...}, "droppedAttributesCount": 123 }`, and this is NOT a valid representation: `{ "attributes": {...}, "dropped_attributes_count": 123 }`. @@ -465,8 +470,8 @@ sides. #### OTLP/HTTP Response -The response body MUST be the appropriate serialized Protobuf message (see below for -the specific message to use in the [Full Success](#full-success-1), +The response body MUST be the appropriate serialized Protobuf message (see +below for the specific message to use in the [Full Success](#full-success-1), [Partial Success](#partial-success-1) and [Failure](#failures-1) cases). The server MUST set "Content-Type: application/x-protobuf" header if the @@ -514,13 +519,13 @@ the number of spans/data points/log records it rejected. The server SHOULD populate the `error_message` field with a human-readable error message in English. The message should explain why the -server rejected parts of the data, and might offer guidance on how users +server rejected parts of the data and might offer guidance on how users can address the issues. The protocol does not attempt to define the structure of the error message. -Servers MAY also make use of the `partial_success` field to convey -warnings/suggestions to clients even when the request was fully accepted. -In such cases, the `rejected_` field MUST have a value of `0` and +Servers MAY also use the `partial_success` field to convey +warnings/suggestions to clients even when it fully accepts the request. +In such cases, the `rejected_` field MUST have a value of `0`, and the `error_message` field MUST be non-empty. The client MUST NOT retry the request when it receives a partial success @@ -528,11 +533,11 @@ response where the `partial_success` is populated. ##### Failures -If the processing of the request fails the server MUST respond with appropriate -`HTTP 4xx` or `HTTP 5xx` status code. See sections below for more details about +If the processing of the request fails, the server MUST respond with appropriate +`HTTP 4xx` or `HTTP 5xx` status code. See the sections below for more details about specific failure cases and HTTP status codes that should be used. -Response body for all `HTTP 4xx` and `HTTP 5xx` responses MUST be a +The response body for all `HTTP 4xx` and `HTTP 5xx` responses MUST be a Protobuf-encoded [Status](https://godoc.org/google.golang.org/genproto/googleapis/rpc/status#Status) message that describes the problem. @@ -550,7 +555,8 @@ below about what this field can contain in each specific failure case. The server SHOULD use HTTP response status codes to indicate retryable and not-retryable errors for a particular erroneous situation. The client SHOULD honour HTTP response status codes as retryable or not-retryable. -The requests that receive a response status code listed in following table SHOULD be retried. +The requests that receive a response status code listed in following table SHOULD +be retried. All other `4xx` or `5xx` response status codes MUST NOT be retried. |HTTP response status code| @@ -563,7 +569,7 @@ All other `4xx` or `5xx` response status codes MUST NOT be retried. ##### Bad Data If the processing of the request fails because the request contains data that -cannot be decoded or is otherwise invalid and such failure is permanent then the +cannot be decoded or is otherwise invalid and such failure is permanent, then the server MUST respond with `HTTP 400 Bad Request`. The `Status.details` field in the response SHOULD contain a [BadRequest](https://github.com/googleapis/googleapis/blob/d14bf59a446c14ef16e9931ebfc8e63ab549bf07/google/rpc/error_details.proto#L166) @@ -575,29 +581,29 @@ response. ##### OTLP/HTTP Throttling If the server receives more requests than the client is allowed or the server is -overloaded the server SHOULD respond with `HTTP 429 Too Many Requests` or +overloaded, the server SHOULD respond with `HTTP 429 Too Many Requests` or `HTTP 503 Service Unavailable` and MAY include ["Retry-After"](https://tools.ietf.org/html/rfc7231#section-7.1.3) header with a recommended time interval in seconds to wait before retrying. -The client SHOULD honour the waiting interval specified in "Retry-After" header -if it is present. If the client receives `HTTP 429` or `HTTP 503` response and -"Retry-After" header is not present in the response then the client SHOULD -implement an exponential backoff strategy between retries. +The client SHOULD honour the waiting interval specified in the "Retry-After" +header if it is present. If the client receives an `HTTP 429` or an `HTTP 503` +response and the "Retry-After" header is not present in the response, then the +client SHOULD implement an exponential backoff strategy between retries. ##### All Other Responses All other HTTP responses that are not explicitly listed in this document should -be treated according to HTTP specification. +be treated according to HTTP specifications. -If the server disconnects without returning a response the client SHOULD retry +If the server disconnects without returning a response, the client SHOULD retry and send the same request. The client SHOULD implement an exponential backoff strategy between retries to avoid overwhelming the server. #### OTLP/HTTP Connection -If the client is unable to connect to the server the client SHOULD retry the -connection using exponential backoff strategy between retries. The interval +If the client cannot connect to the server, the client SHOULD retry the +connection using an exponential backoff strategy between retries. The interval between retries must have a random jitter. The client SHOULD keep the connection alive between requests. @@ -613,8 +619,8 @@ on the "Content-Type" request header. #### OTLP/HTTP Concurrent Requests -To achieve higher total throughput the client MAY send requests using several -parallel HTTP connections. In that case the maximum number of parallel +To achieve higher total throughput, the client MAY send requests using several +parallel HTTP connections. In that case, the maximum number of parallel connections SHOULD be configurable. #### OTLP/HTTP Default Port @@ -625,14 +631,13 @@ The default network port for OTLP/HTTP is 4318. ### Multi-Destination Exporting -When the telemetry data from one client must be sent to more than one -destination server there is an additional complication that must be accounted -for. When one of the servers acknowledges the data and the other server does not -(yet) acknowledges the client needs to make a decision about how to move -forward. +An additional complication must be accounted for when one client must send +telemetry data to more than one destination server. When one of the servers +acknowledges the data and the other server does not (yet), the client needs +to decide how to move forward. -In such situation the client SHOULD implement queuing, acknowledgement -handling and retrying logic per destination. This ensures that servers do not +In such a situation, the client SHOULD implement queuing, acknowledgment +handling, and retrying logic per destination. This ensures that servers do not block each other. The queues SHOULD reference shared, immutable data to be sent, thus minimizing the memory overhead caused by having multiple queues. @@ -659,27 +664,27 @@ deliberate choice and is considered to be the right tradeoff for telemetry data. OTLP will evolve and change over time. Future versions of OTLP must be designed and implemented in a way that ensures that clients and servers that implement different versions of OTLP can interoperate and exchange telemetry data. Old -clients must be able to talk to new servers and vice versa. If new versions of -OTLP introduce new functionality that cannot be understood and supported by -nodes implementing the old versions of OTLP the protocol must regress to the -lowest common denominator from functional perspective. +clients must be able to talk to new servers and vice versa. Suppose new versions +of OTLP introduce new functionality that cannot be understood and supported by +nodes implementing the old versions of OTLP. In that case, the protocol must +regress to the lowest common denominator from a functional perspective. -When possible the interoperability MUST be ensured between all versions of +When possible, the interoperability MUST be ensured between all versions of OTLP that are not declared obsolete. OTLP does not use explicit protocol version numbering. OTLP's interoperability of clients and servers of different versions is based on the following concepts: 1. OTLP (current and future versions) defines a set of capabilities, some of - which are mandatory, others are optional. Clients and servers must implement - mandatory capabilities and can choose implement only a subset of optional + which are mandatory, while others are optional. Clients and servers must implement + mandatory capabilities and can choose to implement only a subset of optional capabilities. -2. For minor changes to the protocol future versions and extension of OTLP are - encouraged to use the ability of Protobufs to evolve message schema in - backwards compatible manner. Newer versions of OTLP may add new fields to +2. For minor changes to the protocol, future versions and extensions of OTLP are + encouraged to use the Protobuf's ability to evolve the message schema in + a backward-compatible manner. Newer versions of OTLP may add new fields to messages that will be ignored by clients and servers that do not understand - these fields. In many cases careful design of such schema changes and correct + these fields. In many cases, careful design of such schema changes and correct choice of default values for new fields is enough to ensure interoperability of different versions without nodes explicitly detecting that their peer node has different capabilities. @@ -687,11 +692,11 @@ of clients and servers of different versions is based on the following concepts: 3. More significant changes must be explicitly defined as new optional capabilities in future OTEPs. Such capabilities SHOULD be discovered by client and server implementations after establishing the underlying - transport. The exact discovery mechanism SHOULD be described in future OTEPs + transport. The exact discovery mechanism SHOULD be described in future OTEPs, which define the new capabilities and typically can be implemented by making a discovery request/response message exchange from the client to server. The mandatory capabilities defined by this specification are implied and do not - require a discovery. The implementation which supports a new, optional + require discovery. The implementation which supports a new, optional capability MUST adjust its behavior to match the expectation of a peer that does not support a particular capability.