Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracing KEP: Move to Implementable #1458

Merged
merged 11 commits into from
Oct 8, 2020

Conversation

dashpole
Copy link
Contributor

cc @kubernetes/sig-api-machinery-misc @lavalamp @logicalhan @deads2k @liggitt @mikedanese

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. labels Jan 16, 2020
@dashpole dashpole changed the title Tracing KEP: Address feedback from sig-apimachinery [WIP] Tracing KEP: Move to Implementable Mar 16, 2020
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 16, 2020
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 16, 2020
@dashpole
Copy link
Contributor Author

I decided to repurpose this PR as the PR to move the KEP to implementable, and will include other changes to address feedback in it.

cc @kubernetes/sig-instrumentation-api-reviews

@k8s-ci-robot k8s-ci-robot added the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Mar 16, 2020
@dashpole
Copy link
Contributor Author

/assign @brancz

@brancz
Copy link
Member

brancz commented Mar 24, 2020

Any particular reason this is [WIP]?

@dashpole dashpole changed the title [WIP] Tracing KEP: Move to Implementable Tracing KEP: Move to Implementable Mar 24, 2020
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 24, 2020
@dashpole
Copy link
Contributor Author

Removed the WIP tag. This is really a
/hold
for discussion on OT Collector.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 24, 2020
@bgrant0607 bgrant0607 removed their request for review March 24, 2020 19:14
@serathius
Copy link
Contributor

I think we should follow similar versioning conversion like other annotations e.g. "alpha.trace.kubernetes.io/context"

@dashpole
Copy link
Contributor Author

I think we should follow similar versioning conversion like other annotations e.g. "alpha.trace.kubernetes.io/context"

Makes sense. I'm going to wait until the discussion about updating annotations with subresource update is completed before updating.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 4, 2020
@dashpole
Copy link
Contributor Author

dashpole commented Oct 1, 2020

@BenTheElder can you take a pass on the testing paragraph? It should be straight-forward, but it would be good to get someone's eyes from sig-testing on it.

Copy link
Member

@kikisdeliveryservice kikisdeliveryservice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is for #647 then the dir should be: 647-apiserver-tracing

KEP-647, etc...

As we track via the issue number and issue 34 that you are referencing seems to be something... completely different. :)

@dashpole
Copy link
Contributor Author

dashpole commented Oct 8, 2020

@kikisdeliveryservice I updated the KEP number. The KEP was created when KEP numbers were based on a next kep number file...

keps/sig-instrumentation/647-apiserver-tracing/README.md Outdated Show resolved Hide resolved

### Test Plan

We will e2e test this feature by deploying an OpenTelemetry Collector on the master, and configure it to export traces using the [stdout exporter](https://github.com/open-telemetry/opentelemetry-go/tree/master/exporters/stdout), which logs the spans in json format. We can then verify that the logs contain our expected traces when making calls to the API Server.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan this to ever be conformance tested? This doesn't sound easy to conformance test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would be good to have this conformance tested. I added it as a GA requirement. With the current configuration options, we can run the OTel Collector as a service in the cluster instead of on the master node.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is, users likely won't be able to configure that config file. But yeah, I agree with testing via running this in the cluster.

## Production Readiness Survey

* Feature enablement and rollback
- How can this feature be enabled / disabled in a live cluster? **Feature-gate: APIServerTracing. The API Server must be restarted to enable/disable exporting spans.**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the feature is GA and we remove the feature gate, how will one turn it off if it's not desired?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I specified that if the configuration file path isn't specified, tracing is disabled. Alternatively, we can add a specific configuration option in the config to specify if it is disabled.

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah going off of the config file presence is fine.

keps/sig-instrumentation/647-apiserver-tracing/README.md Outdated Show resolved Hide resolved

We will wrap the API Server's http server and http clients with [otelhttp](https://github.com/open-telemetry/opentelemetry-go-contrib/tree/master/instrumentation/net/http/otelhttp) to get spans for incoming and outgoing http requests. This generates spans for all sampled incoming requests and propagates context with all client requests. For incoming requests, this would go below [WithRequestInfo](https://github.com/kubernetes/kubernetes/blob/9eb097c4b07ea59c674a69e19c1519f0d10f2fa8/staging/src/k8s.io/apiserver/pkg/server/config.go#L676) in the filter stack, as it must be after authentication and authorization, before the panic filter, and is closest in function to the WithRequestInfo filter.

Note that some clients of the API Server, such as webhooks, may make reentrant calls to the API Server. To gain the full benefit of tracing, such clients should propagate context with requests back to the API Server.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please expand this, how do clients do that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an extremely strong interest in apiserver knowing when an incoming API call is re-entrant, and ideally, what its parent call is. Additionally, I'd like this to work in a multiple apiserver setup. So:

  • Can the tracing header be reverse-engineered to figure this out?
  • Can I pack metadata into the tracing header to make this possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave an example of how a webhook would do that. The tracing header will tell you if it is a direct child of the API Server trace, but you can't tell if it is an indirect child (e.g. if the admission controller webhook adds a span in-between). You can pack metadata into the tracing header (well, it is a different header, but still), using "Baggage": https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/baggage/api.md#baggage-api.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, love the name "Baggage"


This KEP proposes the use of the [OpenTelemetry tracing framework](https://opentelemetry.io/) to create and export spans to configured backends.

The API Server will use the [OpenTelemetry exporter format](https://github.com/open-telemetry/opentelemetry-proto), and the [OTlp exporter](https://github.com/open-telemetry/opentelemetry-go/tree/master/exporters/otlp#opentelemetry-collector-go-exporter) which can export traces. This format is easy to use with the [OpenTelemetry Collector](https://github.com/open-telemetry/opentelemetry-collector), which allows importing and configuring exporters for trace storage backends to be done out-of-tree in addition to other useful features.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm OK for approving this for alpha; for beta I think @deads2k should also approve and at least one of us actually needs to look at the OT code, I haven't...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I added a "revisit format" item for beta.

@evankanderson
Copy link

I just noticed this and the Structured Logging KEP.

I see this is listed as a non-goal:

Change metrics or logging (e.g. to support trace-metric correlation)

Having recently worked on observability for Knative, I discovered that having a common set of labels across all the observability components (metrics, logs, tracing) helped with both reducing boilerplate and increasing accuracy/coverage of labels. It's also really painful to align these after you've done a bunch of instrumentation work for each separately and downstream teams have taken dependencies on the specific labels.

(The interface I'm looking at extracts a logger from the context, and that logger can be augmented with trace labels before it is returned.)

It looks like KEP 1602 doesn't define many well-known keys (ts, v, err, msg), but the implementation (e.g. this PR https://github.com/kubernetes/kubernetes/pull/91576/files) seems to have a bunch more labels that might apply to tracing, including object, kind, objectUID, as well as some kind-specific strings such as service, node, and deployment.

https://github.com/kubernetes/kubernetes/pull/91773/files#diff-06ce45a2e81bbfd5d4ce7e3c67d49a84R47 also seems to introduce/use logTraceArgs that seems related.

@evankanderson
Copy link

(Don't consider my comment blocking, it's a drive-by because I was commenting on my experience and someone upstream pointed out this KEP.)

@dashpole
Copy link
Contributor Author

dashpole commented Oct 8, 2020

@evankanderson thanks for the feedback; you make good points. I definitely agree with the value of having the same set of labels. One problem right now is that we have two different standards: the long-standing kubernetes instrumentation guidelines and the OpenTelemetry semantic conventions. The kubernetes standards suggest using "pod", and the OpenTelemetry standards suggest "k8s.pod", for example. I don't have a good answer, but I'll bring it up for discussion at next week's instrumentation meeting.

@lavalamp
Copy link
Member

lavalamp commented Oct 8, 2020

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 8, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dashpole, lavalamp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dashpole
Copy link
Contributor Author

dashpole commented Oct 8, 2020

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 8, 2020
@lavalamp
Copy link
Member

lavalamp commented Oct 8, 2020

I think we'll want to encourage clients as strongly as possible to pass those headers through.

Is it easy to describe in non-go terms what a client has to do? Hopefully just copy-paste some headers matching a particular name format?

@k8s-ci-robot k8s-ci-robot merged commit 7340d5b into kubernetes:master Oct 8, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.20 milestone Oct 8, 2020
@dashpole dashpole deleted the tracing_updates branch October 8, 2020 22:29
@dashpole
Copy link
Contributor Author

dashpole commented Oct 8, 2020

Is it easy to describe in non-go terms what a client has to do? Hopefully just copy-paste some headers matching a particular name format?

Yes. This describes the trace context header: https://www.w3.org/TR/trace-context/. And this describes the baggage header: https://w3c.github.io/baggage/. So you would have to copy the traceparent, baggage, and tracestate headers.

But in general, OpenTelemetry has broad language support, which should be easier than manually copying headers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.