diff --git a/specification/metrics/api-user.md b/specification/metrics/api-user.md index e9ea38f8d2d..6fb48a32d1b 100644 --- a/specification/metrics/api-user.md +++ b/specification/metrics/api-user.md @@ -24,20 +24,16 @@ -Note: This specification for the v0.3 OpenTelemetry milestone does not -include specification related to the Observer instrument, as described -in the [overview](api.md). Observer instruments were detailed -in [OTEP -72-metric-observer](https://github.com/open-telemetry/oteps/blob/master/text/0072-metric-observer.md) -and will be added to this document following the v0.3 milestone. -Gauge instruments will be removed from this specification folowing the -v0.3 milestone too, as discussed in [OTEP -80-remove-metric-gauge](https://github.com/open-telemetry/oteps/blob/master/text/0080-remove-metric-gauge.md). - ## Overview Metric instruments are the entry point for application and framework developers to instrument their code using counters, gauges, and measures. -Metrics are created by calling methods on a `Meter` which is in turn created by a global `MeterProvider`. +Metrics are created by calling methods on a `Meter` which is in turn created by a `MeterProvider`. + +### Obtaining a MeterProvider + +A MeterProvider instance can be obtained by initializing and configuring an OpenTracing SDK. The application or library chooses whether it will use a global instance of the MeterProvider interface, or whether it will use dependency injection to allow the caller or `main()` function to configure the provider. The same pattern is used for other aspects of the OpenTelemetry API, for configuring the Tracing SDK and Propagators. + +Use of a global instance may be seen as an anti-pattern in many situations, but it most cases it is the correct pattern for telemetry data, in order to combine telemetry data from inter-dependent libraries _without use of dependency injection_. OpenTelemetry language APIs SHOULD offer a global instance for this reason. Languges that offer a global instance MUST ensure that `Meter` instances allocated through the global `MeterProvider` and instruments allocated through those `Meter` instances have their initialization deferred until the global SDK is initialized. ### Obtaining a Meter @@ -66,6 +62,7 @@ external systems. Metric instrument names conform to the following syntax: Metric instrument names belong to a namespace, which is the `name` of the associated `Meter`, allowing the same metric name to be used in multiple libraries of code, unambiguously, within the same application. +`Meter` implementations MUST not register the same instrument twice; implementations MUST return errors in this case. Metric instrument names SHOULD be semantically meaningful, even when viewed outside of the context of the originating Meter name. For example, when instrumenting @@ -76,20 +73,20 @@ as it would inform the viewer of the semantic meaning of the latency being track be tracked elsewhere in the specifications.) Metric instruments are defined using a `Meter` instance, using a variety -of `New` methods specific to the kind of metric and type of input (integer -or floating point). The Meter will return an error when a metric name is -already registered with a different kind for the same name. Metric systems -are expected to automatically prefix exported metrics by the namespace, if +of `New` methods detailed below. Exporters SHOULD NOT automatically +prefix the instrument name by the `Meter` name, so that alternate instrumentation libraries can be configured to report identical instrument names. Metric exporters +are expected to automatically sanitize metric names, if necessary, in a manner consistent with the target system. For example, a Prometheus exporter SHOULD use the namespace followed by `_` as the [application prefix](https://prometheus.io/docs/practices/naming/#metric-names). ### Format of a metric event -Regardless of the instrument kind or method of input, metric events -include the instrument, a numerical value, and an optional -set of labels. The instrument, discussed in detail below, contains -the metric name and various optional settings. +As [stated in the general API +specification](api.md#metric-event-format), metric events consist of +the timestamp, the instrument definition (name, kind, description, +unit), a numerical value, an optional label set, and a resource label +set. Labels are key:value pairs associated with events describing various dimensions or categories that describe the event. A "label key" refers to the key @@ -97,39 +94,39 @@ component while "label value" refers to the correlated value component of a label. Label refers to the pair of label key and value. Labels are passed in to the metric event at construction time. -Metric events always have an associated component name, the name -passed when constructing the corresponding `Meter`. Metric events are -associated with the current (implicit or explicit) OpenTelemetry -context, including distributed correlation context and span context. +Metric events always have an associated reporting library name and +optional version, which are passed when constructing the corresponding +`Meter`. Synchronous metric events are additionally associated with +the the OpenTelemetry [Context](../context/api.md), including +distributed correlation context and span context. ### New constructors -The `Meter` interface allows creating of a registered metric -instrument using methods specific to each kind of metric. There are -six constructors representing the three kinds of instrument taking -either floating point or integer inputs, see the detailed design below. +The `Meter` interface allows creating registered metric instruments +using a specific constructor for each kind of instrument. There are +at least six constructors representing the six kinds of instrument, +and possibly more as dictated by the language. For example, if +specializations are provided for integer and floating pointer numbers, +the OpenTelemetry API would support 12 constructors. Binding instruments to a single `Meter` instance has two benefits: 1. Instruments can be exported from the zero state, prior to first use, with no explicit `Register` call -2. The name provided by the `Meter` satisfies a namespace requirement - -The recommended practice is to define structures to contain the -instruments in use and keep references only to the instruments that -are specifically needed. +2. The library-name and version are implicitly included in the metric event. We recognize that many existing metric systems support allocating metric instruments statically and providing the `Meter` interface at -the time of use. In this example, typical of statsd clients, existing +the time of use. In one example, typical of statsd clients, existing code may not be structured with a convenient place to store new metric instruments. Where this becomes a burden, it is recommended to use -the global meter provider to construct a static `Meter`, to -construct metric instruments. - -The situation is similar for users of Prometheus clients, where -instruments are allocated statically and there is an implicit global. -Such code may not have access to the appropriate `Meter` where -instruments are defined. Where this becomes a burden, it is +the global `MeterProvider` to construct a static `Meter`, and to +construct and use globally-scoped metric instruments. + +The situation is similar for users of existing Prometheus clients, where +instruments can be allocated to the global `Registerer`. +Such code may not have access to an appropriate `MeterProvider` or `Meter` +instance at the location where instruments are defined. +Where this becomes a burden, it is recommended to use the global meter provider to construct a static named `Meter`, to construct metric instruments. @@ -140,86 +137,111 @@ is no method to delete them. #### Metric instrument constructor example code In this Golang example, a struct holding four instruments is built -using the provided, non-global `Meter` instance. +using the provided, non-global `Meter` instance. An example `server` +type is shown, which holds a reference to instruments struct. There +are three synchronous instruments and one asynchronous instrument. ```golang -type instruments struct { - counter1 metric.Int64Counter - counter2 metric.Float64Counter - gauge3 metric.Int64Gauge - measure4 metric.Float64Measure -} +// server runs a service. It is initialized with the `Meter` used for +// metric instrumentation. +type server struct { + meter metric.Meter + instruments *instruments -func newInstruments(metric.Meter meter) *instruments { - return &instruments{ - counter1: meter.NewCounter("counter1", ...), // Optional parameters - counter2: meter.NewCounter("counter2", ...), // are discussed below. - gauge3: meter.NewGauge("gauge3", ...), - measure4: meter.NewMeasure("measure4", ...), - } + // suppose a server organizes a number of things: + things []*Thing } -``` - -Code will be structured to call `newInstruments` somewhere in a -constructor and keep the `instruments` reference for use at runtime. -Here's an example of building a server with configured instruments and -a single metric operation. - -```golang -type server struct { - meter metric.Meter - instruments *instruments - // ... other fields +// instruments are the set of instruments used by this server. +type instruments struct { + counter1 metric.Int64Counter + counter2 metric.Float64Counter + recorder3 metric.Float64ValueRecorder + observer4 metric.Int64SumObserver } -func newServer(meter metric.Meter) *server { - return &server{ - meter: meter, - instruments: newInstruments(meter), - // ... other fields - } +// setInstruments configures a server for the passed `MeterProvider` instance, +// initializing the instruments it uses. +func (s *server) setInstruments(provider metric.MeterProvider) { + // Must() causes constructor errors to panic, which could only happen + // if another Meter named "server-library" has already registered the + // metric names below. + meter := provider.Meter("server-library") + must := metric.Must(meter) + s.meter = meter + s.instruments = &instruments{ + counter1: must.NewInt64Counter("counter1"), // Optional parameters + counter2: must.NewFloat64Counter("counter2"), // are discussed below. + recorder3: must.NewFloat64ValueRecorder("recorder3"), + observer4: must.NewNewInt64SumObserver("observer4", + metric.NewInt64ObserverCallback(server.observeSumNumber4)), + } } -// ... +// newServer returns a server with fully initialized metric instruments. +func newServer(provider metric.MeterProvider) *server { + s := &server{ + // ... other fields + } + s.setInstruments(provider) + return s +} -func (s *server) operate(ctx context.Context) { - // ... other work +// operate processes one request. it uses the synchronous instruments to +// support monitoring request performance. +func (s *server) operate(ctx context.Context, req *request) { + thing := s.thing[req.thingNumber] + // ... other work + + s.instruments.counter1.Add( + ctx, + 1, + key.String("thing_type", thing.Type()), + key.String("label1", "..."), + ) +} - s.instruments.counter1.Add(ctx, 1, - key.String("label1", "..."), - key.String("label2", "..."), +// observerSumNumber4 is an asynchronous instrument callback for the +// "observer4" instrument, which captures the current value of a sum +// for each Thing handled by this server. +func (s *server) observeSumNumber4(result metric.Int64ObserverResult) { + for _, thing := range s.things { + value := thing.measureSomething() + s.observer4.Observe( + result, + value, + key.String("thing_type", thing.Type()), + ) + } } ``` -### Metric calling conventions +The example above was structured to avoid using the global +`MeterProvider` instance, for the purposes of demonstration. With the +use of the global instance, the example `server` type is simplified by +removing the `meter` and `instruments` fields, placing them in static +variables. It is up to the application author whether to use the +global instance or not. + +### Synchronous calling conventions -The metrics API provides three semantically equivalent ways to capture measurements: +The metrics API provides three semantically equivalent ways to capture +measurements using synchronous instruments: -- calling bound metric instruments -- calling unbound metric instruments with labels -- batch recording without a metric instrument +- calling bound metric instruments, which have a pre-associated set of labels +- calling unbound metric instruments, passing the associated set of labels directly +- batch recording measurements for multiple instruments using a single set of labels. All three methods generate equivalent metric events, but offer varying degrees of performance and convenience. -This section applies to calling conventions for counter, gauge, and -measure instruments. - -As described above, metric events consist of an instrument, a set of labels, -and a numerical value, plus associated context. The performance of a metric +As described above, metric events consist of an instrument definition, a set of labels, +and a numerical value, plus associated context and resources. The performance of a metric API depends on the work done to enter a new measurement. One approach to reduce cost is to aggregate intermediate results in the SDK, so that subsequent events happening in the same collection period, for the same set of labels, combine into the same working memory. -In this document, the term "aggregation" is used to describe the -process of coalescing metric events for a complete set of labels, -whereas "grouping" is used to describe further coalescing aggregate -metric data into a reduced number of key dimensions. SDKs may be -designed to perform aggregation and/or grouping in the process, with -various trade-offs in terms of complexity and performance. - #### Bound instrument calling convention In situations where performance is a requirement and a metric instrument is @@ -230,11 +252,15 @@ re-used with specific labels. If an instrument will be used with the same labels more than once, obtaining a bound instrument corresponding to the labels ensures the highest performance available. -To bind an instrument, use the `Bind(labels)` method to return an interface -that supports the `Add()`, `Set()`, or `Record()` method of the instrument in -question. +To bind an instrument, use the `Bind(labels...)` method to return an +interface that supports the corresponding synchronous API (i.e., +`Add()` or `Record()`). Bound instruments are invoked without labels; +the corresponding metric event is associated with the labels that were +bound to the instrument. Bound instruments may consume SDK resources +indefinitely until the user calls `Unbind()` to release the bound +instrument. -Bound instruments may consume SDK resources indefinitely. +For example, to repeatedly update a counter with the same labels: ```golang func (s *server) processStream(ctx context.Context) { @@ -269,7 +295,10 @@ For example, to update a single counter: func (s *server) method(ctx context.Context) { // ... other work - s.instruments.counter1.Add(ctx, 1, ...) + s.instruments.counter1.Add(ctx, 1, + key.String("labelA", "..."), + key.String("labelB", "..."), + ) } ``` @@ -299,7 +328,7 @@ a sequence of direct calls, with the addition of atomicity. Because values are entered in a single call, the SDK is potentially able to implement an atomic update, from the exporter's point of view. Calls to `RecordBatch` may potentially -reduce costs because the SDK can enqueue a single bulk update, or take +reduce cost because the SDK can enqueue a single bulk update, or take a lock only once, for example. ##### Missing label keys @@ -309,6 +338,20 @@ an exporter, and where there are keys that are missing, the SDK is required to consider these values _explicitly unspecified_, a distinct value type of the exported data model. +##### Option: Labels using a built-in type + +Some programming languages have an existing facility that supports +passing dictionaries of unique key:value mappings. The OpenTelemetry +specification allows languages to take this optional approach. If the +use of a dictionary for key:value mappings is both idiomatic and not +considered an expensive option in the language, this is an acceptable +option to passing labels as an ordered list. + +When labels are passed as a dictionary, not as an ordered list, the +mapping should be unique. The OpenTelemetry rule for handling +duplicates only applies when labels are passed as an ordered list +(which is that the last-value in the sequence takes precedence). + ##### Option: Ordered labels As a language-level decision, APIs may support label key ordering. In this @@ -441,3 +484,164 @@ func (s *server) doThing(ctx context.Context) { // ... } ``` + +## Metric instrument selection + +To guide the user in selecting the right kind of metric instrument for +an application, we'll consider several questions about the kind of +numbers being reported. Here are some ways to help choose. Examples +are provided in the following section. + +### Counters and Measures compared + +Counters and Measures are both recommended for reporting measurements +taken during synchronous activity, driven by events in the program. +These measurements include an associated distributed context, the +effective span context (if any), the correlation context, and +user-provided LabelSet values. + +Start with an application for metrics data in mind. It is useful to +consider whether you are more likely to be interested in the sum of +values or any other aggregate value (e.g., average, histogram), as +processed by the instrument. Counters are useful when only the sum is +interesting. Measures are useful when the sum and any other kind of +summary information about the individual values are of interest. + +If only the sum is of interest, use a Counter instrument. + +If you are interested in any other kind of summary value or statistic, +such as mean, median and other quantiles, or minimum and maximum +value, use a Measure instrument. Measure instruments are used to +report any kind of measurement that is not typically expressed as a +rate or as a total sum. + +### Observer instruments + +Observer instruments are recommended for reporting measurements about +the state of the program periodically. These expose current +information about the program itself, not related to individual events +taking place in the program. Observer instruments are reported +outside of a context, thus do not have an effective span context or +correlation context. + +Observer instruments are meant to be used when measured values report +on the current state of the program, as opposed to an event or a +change of state in the program. + +## Examples + +### Reporting total bytes read + +You wish to monitor the total number of bytes read from a messaging +server that supports several protocols. The number of bytes read +should be labeled with the protocol name and aggregated in the +process. + +This is a typical application for the Counter instrument. Use one Counter for +capturing the number bytes read. When handling a request, compute a LabelSet +containing the name of the protocol and potentially other useful labels, then +call `Add()` with the same labels and the number of bytes read. + +To lower the cost of this reporting, you can `Bind()` the instrument with each +of the supported protocols ahead of time. + +### Reporting total bytes read and bytes per request + +You wish to monitor the total number of bytes read as well as the +number of bytes read per request, to have observability into total +traffic as well as typical request size. As with the example above, +these metric events should be labeled with a protocol name. + +This is a typical application for the Measure instrument. Use one +Measure for capturing the number of bytes per request. A sum +aggregation applied to this data yields the total bytes read; other +aggregations allow you to export the minimum and maximum number of +bytes read, as well as the average value, and quantile estimates. + +In this case, the guidance is to create a single instrument. Do not +create a Counter instrument to export a sum when you want to export +other summary statistics using a Measure instrument. + +### Reporting system call duration + +You wish to monitor the duration of a specific system call being made +frequently in your application, with a label to indicate a file name +associated with the operation. + +This is a typical application for the Measure instrument. Use a timer +to measure the duration of each call and `Record()` the measurement +with a label for the file name. + +### Reporting request size + +You wish to monitor a trend in request sizes, which means you are +interested in characterizing individual events, as opposed to a sum. +Label these with relevant information that may help explain variance +in request sizes, such as the type of the request. + +This is a typical application for a Measure instrument. The standard +aggregation for Measure instruments will compute a measurement sum and +the event count, which determines the mean request size, as well as +the minimum and maximum sizes. + +### Reporting a per-request finishing account balance + +There's a number that rises and falls such as a bank account balance. +You wish to monitor the average account balance at the end of +requests, broken down by transaction type (e.g., withdrawal, deposit). + +Use a Measure instrument to report the current account balance at the +end of each request. Use a label for the transaction type. + +### Reporting process-wide CPU usage + +You are interested in reporting the CPU usage of the process as a +whole, which is computed via a (relatively expensive) system call +which returns two values, process-lifetime user and system +cpu-seconds. It is not necessary to update this measurement +frequently, because it is meant to be used only for accounting +purposes. + +A single Observer instrument is recommended for this case, with a +label value to distinguish user from system CPU time. The Observer +callback will be called once per collection interval, which lowers the +cost of collecting this information. + +CPU usage is something that we naturally sum, which raises several +questions. + +- Why not use a Counter instrument? In order to use a Counter instrument, we would need to convert total usage figures into deltas. Calculating deltas from the previous measurement is easy to do, but Counter instruments are not meant to be used from callbacks. +- Why not report deltas in the Observer callback? Observer instruments are meant to be used to observe current values. Nothing prevents reporting deltas with an Observer, but the standard aggregation for Observer instruments is to sum the current value across distinct labels. The standard behavior is useful for determining the current rate of CPU usage, but special configuration would be required for an Observer instrument to use Counter aggregation. + +### Reporting per-shard memory holdings + +Suppose you have a widely-used library that acts as a client to a +sharded service. For each shard it maintains some client-side state, +holding a variable amount of memory per shard. + +Observe the current allocation per shard using an Observer instrument with a +shard label. These can be aggregated across hosts to compute cluster-wide +memory holdings by shard, for example, using the standard aggregation for +Observers, which sums the current value across distinct labels. + +### Reporting number of active requests + +Suppose your server maintains the count of active requests, which +rises and falls as new requests begin and end processing. + +Observe the number of active requests periodically with an Observer +instrument. Labels can be used to indicate which application-specific +properties are associated with these events. + +### Reporting bytes read and written correlated by end user + +An application uses storage servers to read and write from some +underlying media. These requests are made in the context of the end +user that made the request into the frontend system, with Correlation +Context passed from the frontend to the storage servers carrying these +properties. + +Use Counter instruments to report the number of bytes read and written +by the storage server. Configure the SDK to use a Correltion Context +label key (e.g., named "app.user") to aggregate events by all metric +instruments.