From b327ca90d586291f31ce02f84113a6a145496009 Mon Sep 17 00:00:00 2001 From: Josh W Lewis Date: Wed, 24 Apr 2024 10:38:36 -0500 Subject: [PATCH] Update RFC with more info and address feedback --- text/0000-build-observability.md | 235 +++++++++++++++++++++++++------ 1 file changed, 192 insertions(+), 43 deletions(-) diff --git a/text/0000-build-observability.md b/text/0000-build-observability.md index 98eb6da42..fbf9e8f93 100644 --- a/text/0000-build-observability.md +++ b/text/0000-build-observability.md @@ -12,10 +12,10 @@ # Summary [summary]: #summary -This RFC proposes leveraging [OpenTelemetry](https://opentelemetry.io/) to -grant platform operators and buildpack operators more insight into buildpack +This RFC proposes leveraging [OpenTelemetry](https://opentelemetry.io/) to +grant platform operators and buildpack operators more insight into buildpack performance and behavior. This RFC describes new opt-in functionality -for both pack and the buildpack spec such that OpenTelemetry data may be +for both pack and the buildpack spec such that OpenTelemetry data may be exported to the build file system. # Definitions @@ -29,17 +29,20 @@ exported to the build file system. # Motivation [motivation]: #motivation -Buildpack authors and platform operators desire insight into usage and -performance of builds and buildpacks on their platform. Questions like -"How long does each buildpack compile phase take?", "Which buildpacks -commonly fail to compile?", "How often is a certain buildpack used?", -"Which versions of Go are being installed?", and "How long does it take to -download node_modules?" are important questions for authors and operators that -are currently difficult to answer. +Buildpack authors and platform operators desire insight into usage, error +scenarios, and performance of builds and buildpacks on their platform. The +following questions are all important for these folks, but difficult to answer: + +- "Which buildpacks commonly fail to compile?" +- "How often does a particular error scenario occur?" +- "How long does each buildpack compile phase take?" +- "How often is a certain buildpack used?" +- "Which versions of Go are being installed?" +- "How long does it take to download node_modules?" Instrumenting lifecycle and buildpacks with opt-in OpenTelemetry tracing will -allow platform operators to better understand performance and behavior of their -builds and buildpacks and as a result, provide better service and build +allow platform operators to better understand performance and behavior of their +builds and buildpacks and as a result, provide better service and build experiences. To protect privacy and prevent unnecessary collection of data, this @@ -50,8 +53,8 @@ functionality should be optional and anonymous. This RFC aims to provide a solution for two types of OpenTelemetry traces: -1) Lifecycle tracing: Buildpack-agnostic trace data like which buildpacks were -available, which buildpacks were detected, how long the detect, build, or +1) Lifecycle tracing: Buildpack-agnostic trace data like which buildpacks were +available, which buildpacks were detected, how long the detect, build, or export phase took, and so on. This telemetry data may be exported by lifecycle. 2) Buildpack tracing: Telemetry data specific to a buildpack like how long it took to download a language binary, which language version was selected, and so @@ -61,20 +64,21 @@ Though the sources and contents of the telemetry data differ, both types may be emitted to the build file system in OpenTelemetry's [File Exporter Format](https://opentelemetry.io/docs/specs/otel/protocol/file-exporter/). +In this solution, each lifecycle phase would write a `.jsonl` file with +tracing data for that phase. For example, `lifecycle detector --telemetry` +would write to `/cnb/telemetry/lifecycle-detect.jsonl`. Additionally each +buildpack may also write tracing data to it's own `.jsonl` files (at +`/cnb/telemetry/{BUILDPACK_ID}.jsonl`). -For example, `lifecycle detector --telemetry` might save a file like this: - -```json -{"resourceSpans":[{"resource":{"attributes":[{"key":"lifecycle.version","value":{"stringValue":"0.17.1"}}]},"scopeSpans":[{"scope":{},"spans":[{"traceId":"","spanId":"","parentSpanId":"","name":"buildpack-detect","startTimeUnixNano":"1581452772000000321","endTimeUnixNano":"1581452773000000789","droppedAttributesCount":1,"events":[{"timeUnixNano":"1581452773000000123","name":"detect-pass"}],"attributes":[{"key":"buildpack-id","value":{"stringValue":"heroku/nodejs-engine"}}],"droppedAttributesCount":2,"droppedEventsCount":1}]}]}]} -{ // additional spans... // } -``` - -And a buildpack's compile phase might save a file like this: +These `.jsonl` files may be read by platform operators for consumption, +transformation, enrichment, and/or export to an OpenTelemetry backend. Given +that builds may crash or fail at any point, these files must be written to +often and regularly to prevent data loss. -```json -{"resourceSpans":[{"resource":{"attributes":[{"key":"buildpack.version","value":{"stringValue":"1.0.0"}}]},"scopeSpans":[{"scope":{},"spans":[{"traceId":"","spanId":"","parentSpanId":"","name":"install-nodejs","startTimeUnixNano":"1581452772000001321","endTimeUnixNano":"1581452773000004789","droppedAttributesCount":1,"events":[{"timeUnixNano":"1581452773000002123","name":"restored-from-cache"}],"attributes":[{"key":"nodejs.version","value":{"stringValue":"20.0.0"}}]}]}]}]} -{ // additional spans... // } -``` +Platform operators will likely want to view or analyze this data. These +telemetry files are in OTLP compatible format, so may be exported to one or +more OpenTelemetry backends like Honeycomb, Prometheus, and [many +others](https://opentelemetry.io/ecosystem/vendors/). # How it Works @@ -84,12 +88,57 @@ And a buildpack's compile phase might save a file like this: If `lifecycle` is provided the telemetry opt-in flag (such as `--telemetry`), `lifecycle` phases (such as `detect`, `build`, `export`) may emit an -OpenTelemetry File Export with tracing data to a known location, such as +OpenTelemetry File Export with tracing data to a known location, such as `/cnb/telemetry/lifecycle-detect.jsonl` with contents like this: ```json -{"resourceSpans":[{"resource":{"attributes":[{"key":"lifecycle.version","value":{"stringValue":"0.17.1"}}]},"scopeSpans":[{"scope":{},"spans":[{"traceId":"","spanId":"","parentSpanId":"","name":"buildpack-detect","startTimeUnixNano":"1581452772000000321","endTimeUnixNano":"1581452773000000789","droppedAttributesCount":1,"events":[{"timeUnixNano":"1581452773000000123","name":"detect-pass"}],"attributes":[{"key":"buildpack-id","value":{"stringValue":"heroku/nodejs-engine"}}],"droppedAttributesCount":2,"droppedEventsCount":1}]}]}]} -{ // additional spans... // } +{ + "resourceSpans": [ + { + "resource": { + "attributes": [ + { + "key": "lifecycle.version", + "value": { + "stringValue": "0.17.1" + } + } + ] + }, + "scopeSpans": [ + { + "scope": {}, + "spans": [ + { + "traceId": "", + "spanId": "", + "parentSpanId": "", + "name": "buildpack-detect", + "startTimeUnixNano": "1581452772000000321", + "endTimeUnixNano": "1581452773000000789", + "droppedAttributesCount": 2, + "events": [ + { + "timeUnixNano": "1581452773000000123", + "name": "detect-pass" + } + ], + "attributes": [ + { + "key": "buildpack-id", + "value": { + "stringValue": "heroku/nodejs-engine" + } + } + ], + "droppedEventsCount": 1 + } + ] + } + ] + } + ] +} ``` @@ -97,26 +146,101 @@ OpenTelemetry File Export with tracing data to a known location, such as During a buildpack's `detect` or `build` execution, a buildpack may emit an OpenTelemetry File Export with tracing data to `/cnb/telemetry/#{buildpack-id}.jsonl` -with contents like this: +with contents like this: ```json -{"resourceSpans":[{"resource":{"attributes":[{"key":"lifecycle.version","value":{"stringValue":"0.17.1"}}]},"scopeSpans":[{"scope":{},"spans":[{"traceId":"","spanId":"","parentSpanId":"","name":"buildpack-detect","startTimeUnixNano":"1581452772000000321","endTimeUnixNano":"1581452773000000789","droppedAttributesCount":1,"events":[{"timeUnixNano":"1581452773000000123","name":"detect-pass"}],"attributes":[{"key":"buildpack-id","value":{"stringValue":"heroku/nodejs-engine"}}],"droppedAttributesCount":2,"droppedEventsCount":1}]}]}]} -{ // additional spans... // } +{ + "resourceSpans": [ + { + "resource": { + "attributes": [ + { + "key": "lifecycle.version", + "value": { + "stringValue": "0.17.1" + } + } + ] + }, + "scopeSpans": [ + { + "scope": {}, + "spans": [ + { + "traceId": "", + "spanId": "", + "parentSpanId": "", + "name": "buildpack-detect", + "startTimeUnixNano": "1581452772000000321", + "endTimeUnixNano": "1581452773000000789", + "droppedAttributesCount": 2, + "events": [ + { + "timeUnixNano": "1581452773000000123", + "name": "detect-pass" + } + ], + "attributes": [ + { + "key": "buildpack-id", + "value": { + "stringValue": "heroku/nodejs-engine" + } + } + ], + "droppedEventsCount": 1 + } + ] + } + ] + } + ] +} ``` ### Lifetime -The telemetry files may be written at any point during the build. They should -exist as a part of the build file system for the duration of the build. -Telemetry files will not be included in the final image. +Telemetry files may be written at any point during the build, so that they +are persisted in cases of failures to detect, failures to build, process +terminations, or crashes. The `jsonl` format allows telemetry libraries to +safely append additional json objects to the end of a telemetry file, so +telemetry data can be flushed to the file frequently. Telemetry files should +not be truncated or deleted so that telemetry processing by a platform can +happen during or after a build. Telemetry files should not be included in the +build result, as they are not relevant, and would likely negatively impact +image size and reproduceability. ### Access -The telemetry files should remain readable so that they may be analyzed by -the user and/or platform. However, they should be write protected in some way to prevent -malicious buildpacks from injecting tracing data into other buildpack's -telemetry file. +The telemetry files should be readable so that they may be analyzed by +the user and/or platform. However, they should be write protected +to prevent malicious buildpacks from injecting tracing data into other +buildpack or lifecycle telemetry files. + + +### Consumption + +This RFC leaves the consumption of telemetry files to the platform operator. +Platform operators choosing to use these metrics need to read them either during +or after the build. This can be done using existing OpenTelemetry libraries. +Platform operators may choose to optionally enrich or modify the tracing data +as they see fit (with data like `instance_id` or `build_id`). Platform +operators will likely want to export this data to an OpenTelemetry backend for +persistence and analysis, and again, this may be done with existing +OpenTelemetry libraries. +### Viewing and Analyzing + +Once the lifecycle and buildpack traces are exported to an OpenTelemetry +backend, platform operators should be able to (depending on the features of the +backend): + +- View the complete trace for a build +- View or query attributes attached to spans (e.g. `buildpack_id`, + `nodejs_version`) +- View or query span durations +- View or query error types and/or messages +- and more # Migration [migration]: #migration @@ -142,6 +266,13 @@ design: usernames, IP addresses, etc.), so the telemetry data emitted by `lifecycle` will also be free of user-identifiaible data. +### File Export Format Status + +While the [File Exporter +Format](https://opentelemetry.io/docs/specs/otel/protocol/file-exporter/) is +an official format, and matches the OTLP format nearly exactly (and thus seems +unlikely to change), it is listed as experimental status. + # Alternatives [alternatives]: #alternatives @@ -154,6 +285,28 @@ provide statistical information in aggregate. Since `lifecycle` and `pack` only run one build at a time, there is no way to aggregate information about multiple builds in `pack` or `lifecycle`. +### OTLP + +The [OpenTelemetryProtocol](https://opentelemetry.io/docs/specs/otlp/) is a +network delivery protocol for OpenTelemetry data. Instead of emitting files as +this RFC describes, lifecycle and buildpacks could instead connect to an +OpenTelemetry collector provided by the platform operator. This pattern is +well supported and well known. + +However, there are drawbacks: + +- In local `pack build` scenarios, it's unlikely that users would have an + OpenTelemetry collector running. This RFC solution does not require a + collector. +- lifecycle and buildpacks would need to know where the OpenTelemetry collector + is and how to authenticate with it. Lifecycle and buildpacks that wish to + emit telemetry may not want to deal with the mountain of configuration to + support various collectors. +- Platform operators may have complex network topology that may make supporting + this feature challenging (e.g. a firewall between lifecycle and the collector + may still be perceived as a lifecycle malfunction). + +There is an [RFC for this alternative](https://github.com/buildpacks/rfcs/pull/300). # Prior Art [prior-art]: #prior-art @@ -177,10 +330,6 @@ Discuss prior art, both the good and bad. shouldn't be a part of the build result image. -- What parts of the design do you expect to be resolved before this gets merged? -- What parts of the design do you expect to be resolved through implementation of the feature? -- What related issues do you consider out of scope for this RFC that could be addressed in the future independently of the solution that comes out of this RFC? - # Spec. Changes (OPTIONAL) [spec-changes]: #spec-changes