Skip to content

Commit

Permalink
Update RFC with more info and address feedback
Browse files Browse the repository at this point in the history
  • Loading branch information
joshwlewis committed Apr 24, 2024
1 parent a463510 commit b327ca9
Showing 1 changed file with 192 additions and 43 deletions.
235 changes: 192 additions & 43 deletions text/0000-build-observability.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,10 @@
# Summary
[summary]: #summary

This RFC proposes leveraging [OpenTelemetry](https://opentelemetry.io/) to
grant platform operators and buildpack operators more insight into buildpack
This RFC proposes leveraging [OpenTelemetry](https://opentelemetry.io/) to
grant platform operators and buildpack operators more insight into buildpack
performance and behavior. This RFC describes new opt-in functionality
for both pack and the buildpack spec such that OpenTelemetry data may be
for both pack and the buildpack spec such that OpenTelemetry data may be
exported to the build file system.

# Definitions
Expand All @@ -29,17 +29,20 @@ exported to the build file system.
# Motivation
[motivation]: #motivation

Buildpack authors and platform operators desire insight into usage and
performance of builds and buildpacks on their platform. Questions like
"How long does each buildpack compile phase take?", "Which buildpacks
commonly fail to compile?", "How often is a certain buildpack used?",
"Which versions of Go are being installed?", and "How long does it take to
download node_modules?" are important questions for authors and operators that
are currently difficult to answer.
Buildpack authors and platform operators desire insight into usage, error
scenarios, and performance of builds and buildpacks on their platform. The
following questions are all important for these folks, but difficult to answer:

- "Which buildpacks commonly fail to compile?"
- "How often does a particular error scenario occur?"
- "How long does each buildpack compile phase take?"
- "How often is a certain buildpack used?"
- "Which versions of Go are being installed?"
- "How long does it take to download node_modules?"

Instrumenting lifecycle and buildpacks with opt-in OpenTelemetry tracing will
allow platform operators to better understand performance and behavior of their
builds and buildpacks and as a result, provide better service and build
allow platform operators to better understand performance and behavior of their
builds and buildpacks and as a result, provide better service and build
experiences.

To protect privacy and prevent unnecessary collection of data, this
Expand All @@ -50,8 +53,8 @@ functionality should be optional and anonymous.

This RFC aims to provide a solution for two types of OpenTelemetry traces:

1) Lifecycle tracing: Buildpack-agnostic trace data like which buildpacks were
available, which buildpacks were detected, how long the detect, build, or
1) Lifecycle tracing: Buildpack-agnostic trace data like which buildpacks were
available, which buildpacks were detected, how long the detect, build, or
export phase took, and so on. This telemetry data may be exported by lifecycle.
2) Buildpack tracing: Telemetry data specific to a buildpack like how long it
took to download a language binary, which language version was selected, and so
Expand All @@ -61,20 +64,21 @@ Though the sources and contents of the telemetry data differ, both types may
be emitted to the build file system in OpenTelemetry's [File Exporter
Format](https://opentelemetry.io/docs/specs/otel/protocol/file-exporter/).

In this solution, each lifecycle phase would write a `.jsonl` file with
tracing data for that phase. For example, `lifecycle detector --telemetry`
would write to `/cnb/telemetry/lifecycle-detect.jsonl`. Additionally each
buildpack may also write tracing data to it's own `.jsonl` files (at
`/cnb/telemetry/{BUILDPACK_ID}.jsonl`).

For example, `lifecycle detector --telemetry` might save a file like this:

```json
{"resourceSpans":[{"resource":{"attributes":[{"key":"lifecycle.version","value":{"stringValue":"0.17.1"}}]},"scopeSpans":[{"scope":{},"spans":[{"traceId":"","spanId":"","parentSpanId":"","name":"buildpack-detect","startTimeUnixNano":"1581452772000000321","endTimeUnixNano":"1581452773000000789","droppedAttributesCount":1,"events":[{"timeUnixNano":"1581452773000000123","name":"detect-pass"}],"attributes":[{"key":"buildpack-id","value":{"stringValue":"heroku/nodejs-engine"}}],"droppedAttributesCount":2,"droppedEventsCount":1}]}]}]}
{ // additional spans... // }
```

And a buildpack's compile phase might save a file like this:
These `.jsonl` files may be read by platform operators for consumption,
transformation, enrichment, and/or export to an OpenTelemetry backend. Given
that builds may crash or fail at any point, these files must be written to
often and regularly to prevent data loss.

```json
{"resourceSpans":[{"resource":{"attributes":[{"key":"buildpack.version","value":{"stringValue":"1.0.0"}}]},"scopeSpans":[{"scope":{},"spans":[{"traceId":"","spanId":"","parentSpanId":"","name":"install-nodejs","startTimeUnixNano":"1581452772000001321","endTimeUnixNano":"1581452773000004789","droppedAttributesCount":1,"events":[{"timeUnixNano":"1581452773000002123","name":"restored-from-cache"}],"attributes":[{"key":"nodejs.version","value":{"stringValue":"20.0.0"}}]}]}]}]}
{ // additional spans... // }
```
Platform operators will likely want to view or analyze this data. These
telemetry files are in OTLP compatible format, so may be exported to one or
more OpenTelemetry backends like Honeycomb, Prometheus, and [many
others](https://opentelemetry.io/ecosystem/vendors/).


# How it Works
Expand All @@ -84,39 +88,159 @@ And a buildpack's compile phase might save a file like this:

If `lifecycle` is provided the telemetry opt-in flag (such as `--telemetry`),
`lifecycle` phases (such as `detect`, `build`, `export`) may emit an
OpenTelemetry File Export with tracing data to a known location, such as
OpenTelemetry File Export with tracing data to a known location, such as
`/cnb/telemetry/lifecycle-detect.jsonl` with contents like this:

```json
{"resourceSpans":[{"resource":{"attributes":[{"key":"lifecycle.version","value":{"stringValue":"0.17.1"}}]},"scopeSpans":[{"scope":{},"spans":[{"traceId":"","spanId":"","parentSpanId":"","name":"buildpack-detect","startTimeUnixNano":"1581452772000000321","endTimeUnixNano":"1581452773000000789","droppedAttributesCount":1,"events":[{"timeUnixNano":"1581452773000000123","name":"detect-pass"}],"attributes":[{"key":"buildpack-id","value":{"stringValue":"heroku/nodejs-engine"}}],"droppedAttributesCount":2,"droppedEventsCount":1}]}]}]}
{ // additional spans... // }
{
"resourceSpans": [
{
"resource": {
"attributes": [
{
"key": "lifecycle.version",
"value": {
"stringValue": "0.17.1"
}
}
]
},
"scopeSpans": [
{
"scope": {},
"spans": [
{
"traceId": "",
"spanId": "",
"parentSpanId": "",
"name": "buildpack-detect",
"startTimeUnixNano": "1581452772000000321",
"endTimeUnixNano": "1581452773000000789",
"droppedAttributesCount": 2,
"events": [
{
"timeUnixNano": "1581452773000000123",
"name": "detect-pass"
}
],
"attributes": [
{
"key": "buildpack-id",
"value": {
"stringValue": "heroku/nodejs-engine"
}
}
],
"droppedEventsCount": 1
}
]
}
]
}
]
}
```


### Buildpack telemetry files

During a buildpack's `detect` or `build` execution, a buildpack may emit
an OpenTelemetry File Export with tracing data to `/cnb/telemetry/#{buildpack-id}.jsonl`
with contents like this:
with contents like this:

```json
{"resourceSpans":[{"resource":{"attributes":[{"key":"lifecycle.version","value":{"stringValue":"0.17.1"}}]},"scopeSpans":[{"scope":{},"spans":[{"traceId":"","spanId":"","parentSpanId":"","name":"buildpack-detect","startTimeUnixNano":"1581452772000000321","endTimeUnixNano":"1581452773000000789","droppedAttributesCount":1,"events":[{"timeUnixNano":"1581452773000000123","name":"detect-pass"}],"attributes":[{"key":"buildpack-id","value":{"stringValue":"heroku/nodejs-engine"}}],"droppedAttributesCount":2,"droppedEventsCount":1}]}]}]}
{ // additional spans... // }
{
"resourceSpans": [
{
"resource": {
"attributes": [
{
"key": "lifecycle.version",
"value": {
"stringValue": "0.17.1"
}
}
]
},
"scopeSpans": [
{
"scope": {},
"spans": [
{
"traceId": "",
"spanId": "",
"parentSpanId": "",
"name": "buildpack-detect",
"startTimeUnixNano": "1581452772000000321",
"endTimeUnixNano": "1581452773000000789",
"droppedAttributesCount": 2,
"events": [
{
"timeUnixNano": "1581452773000000123",
"name": "detect-pass"
}
],
"attributes": [
{
"key": "buildpack-id",
"value": {
"stringValue": "heroku/nodejs-engine"
}
}
],
"droppedEventsCount": 1
}
]
}
]
}
]
}
```

### Lifetime

The telemetry files may be written at any point during the build. They should
exist as a part of the build file system for the duration of the build.
Telemetry files will not be included in the final image.
Telemetry files may be written at any point during the build, so that they
are persisted in cases of failures to detect, failures to build, process
terminations, or crashes. The `jsonl` format allows telemetry libraries to
safely append additional json objects to the end of a telemetry file, so
telemetry data can be flushed to the file frequently. Telemetry files should
not be truncated or deleted so that telemetry processing by a platform can
happen during or after a build. Telemetry files should not be included in the
build result, as they are not relevant, and would likely negatively impact
image size and reproduceability.

### Access

The telemetry files should remain readable so that they may be analyzed by
the user and/or platform. However, they should be write protected in some way to prevent
malicious buildpacks from injecting tracing data into other buildpack's
telemetry file.
The telemetry files should be readable so that they may be analyzed by
the user and/or platform. However, they should be write protected
to prevent malicious buildpacks from injecting tracing data into other
buildpack or lifecycle telemetry files.


### Consumption

This RFC leaves the consumption of telemetry files to the platform operator.
Platform operators choosing to use these metrics need to read them either during
or after the build. This can be done using existing OpenTelemetry libraries.
Platform operators may choose to optionally enrich or modify the tracing data
as they see fit (with data like `instance_id` or `build_id`). Platform
operators will likely want to export this data to an OpenTelemetry backend for
persistence and analysis, and again, this may be done with existing
OpenTelemetry libraries.

### Viewing and Analyzing

Once the lifecycle and buildpack traces are exported to an OpenTelemetry
backend, platform operators should be able to (depending on the features of the
backend):

- View the complete trace for a build
- View or query attributes attached to spans (e.g. `buildpack_id`,
`nodejs_version`)
- View or query span durations
- View or query error types and/or messages
- and more

# Migration
[migration]: #migration
Expand All @@ -142,6 +266,13 @@ design:
usernames, IP addresses, etc.), so the telemetry data emitted by `lifecycle`
will also be free of user-identifiaible data.

### File Export Format Status

While the [File Exporter
Format](https://opentelemetry.io/docs/specs/otel/protocol/file-exporter/) is
an official format, and matches the OTLP format nearly exactly (and thus seems
unlikely to change), it is listed as experimental status.

# Alternatives
[alternatives]: #alternatives

Expand All @@ -154,6 +285,28 @@ provide statistical information in aggregate. Since `lifecycle` and `pack`
only run one build at a time, there is no way to aggregate information about
multiple builds in `pack` or `lifecycle`.

### OTLP

The [OpenTelemetryProtocol](https://opentelemetry.io/docs/specs/otlp/) is a
network delivery protocol for OpenTelemetry data. Instead of emitting files as
this RFC describes, lifecycle and buildpacks could instead connect to an
OpenTelemetry collector provided by the platform operator. This pattern is
well supported and well known.

However, there are drawbacks:

- In local `pack build` scenarios, it's unlikely that users would have an
OpenTelemetry collector running. This RFC solution does not require a
collector.
- lifecycle and buildpacks would need to know where the OpenTelemetry collector
is and how to authenticate with it. Lifecycle and buildpacks that wish to
emit telemetry may not want to deal with the mountain of configuration to
support various collectors.
- Platform operators may have complex network topology that may make supporting
this feature challenging (e.g. a firewall between lifecycle and the collector
may still be perceived as a lifecycle malfunction).

There is an [RFC for this alternative](https://github.com/buildpacks/rfcs/pull/300).

# Prior Art
[prior-art]: #prior-art
Expand All @@ -177,10 +330,6 @@ Discuss prior art, both the good and bad.
shouldn't be a part of the build result image.


- What parts of the design do you expect to be resolved before this gets merged?
- What parts of the design do you expect to be resolved through implementation of the feature?
- What related issues do you consider out of scope for this RFC that could be addressed in the future independently of the solution that comes out of this RFC?

# Spec. Changes (OPTIONAL)
[spec-changes]: #spec-changes

Expand Down

0 comments on commit b327ca9

Please sign in to comment.