🌱 Add distributed tracing framework #1211

bboreham · 2020-10-09T12:03:21Z

Created in support of kubernetes-sigs/cluster-api#3760
This doc goes into more detail and is probably the best place to discuss.

What is in this PR:

A wrapper for client.Client which will log all Kubernetes API calls to the current trace.
A wrapper for logr.Logger which will add all log messages to a trace.
Support for passing trace span context from one program to another via an Annotation (as described in this KEP)

This PR does not create any tracing spans. You would use it in a controller like this (error handling elided):

func main() {
	...
	mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
		...
		NewClient: tracing.WrapRuntimeClient(util.ManagerDelegatingClientFunc),
	...
}

func (r *MyReconciler) Reconcile(ctx context.Context, req ctrl.Request) (_ ctrl.Result, rerr error) {
	myObject := &MyObject{}
	r.Client.Get(ctx, req.NamespacedName, myObject)

	ctx, sp, log := tracing.FromObject(ctx, "Reconcile", myObject)
	if sp != nil {
		defer sp.Finish()
	}
	...
}

The code is written using the OpenTracing API and Jaeger; I will look at porting it to OpenTelemetry.

Functions to wrap a Client and add trace span annotations to objects. Based on OpenTracing, with a sample implementation for Jaeger. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

k8s-ci-robot · 2020-10-09T12:03:29Z

Welcome @bboreham!

It looks like this is your first PR to kubernetes-sigs/controller-runtime 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/controller-runtime has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2020-10-09T12:03:29Z

Hi @bboreham. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yue9944882 · 2020-10-09T12:54:46Z

@bboreham thank you for proposing/bringing open-telemetry integration to this library! i was looking for this for a period of time and also had a few PoC works before (in some of our internal projects) which is very close/similar to the current state of this pull. i think overall the proposed change is fairly intuitive but based on my humble experience from our practice, i think there're yet a few things not addressed in this pull:

the trace never closes: the controller can be syncing resources by processing all items periodically even if there's no change on any resources --- that will create unbounded # of child spans which even makes it hard to visualize if there's multiple controllers working under the same trace. so the trace need to be enclosed properly if the resource successfully reach its desired state. and another tricky part is that if we're closing the trace by removing or editing the annotation, that will trigger a new reconciliation upon the resource, will that count into the trace?
sorting/dedup'ing context propagation: consider cases where we're tracing cascading controller instances (as the following image roughly shows a A->B->C controller topology). i think the spans need to properly de-dup'd otherwise the visualization (on jaeger) can be confusing due to extra verbosity.

and as far as i know this KEP was completely refactored to merely do the tracing for incoming requests at server side instead of controller frameworks, is it? the proposed change from this pull seems to be conflict w/ the latest KEP b/c having tracing at both server and client side can be be redundant i suppose.. cc @lavalamp

vincepri · 2020-10-09T14:32:09Z

cc @alvaroaleman @DirectXMan12

bboreham · 2020-10-09T17:13:29Z

Hi @yue9944882, lot of great points there. I'll respond in pieces:

and as far as i know this KEP was completely refactored to merely do the tracing for incoming requests at server side

Yes, I noticed that today (change was made 8 days ago).

instead of controller frameworks, is it?

Nit: it specified kube-apiserver, kube-scheduler, kube-controller-manager and kubelet, (most of?) which don't use the controller-runtime framework. But your point is a good one: I would now like to resurrect much of the text that was removed.

the proposed change from this pull seems to be conflict w/ the latest KEP

I think they are complementary.

bboreham · 2020-10-09T17:18:33Z

the trace never closes

Yes, this is called out in the linked Google doc.

the trace need to be enclosed properly if the resource successfully reach its desired state

Great thought. I think for many specific cases this can be identified by a single controller.

if we're closing the trace by removing or editing the annotation, that will trigger a new reconciliation upon the resource

Also true. I wonder if it can be fixed by adding something to Status? There isn't any handy bag-of-values field like Annotations, but maybe can be achieved as a Condition?

Further, suppose the observed state changes; we'd expect a reconcillation which (given the above) would start a new span.

bboreham · 2020-10-09T17:23:29Z

sorting/dedup'ing context propagation

Sorry I don't get this one at all, can you please explain more?

The text on your image is too small to see; here's a grab of the graph I get for a Cluster-API trace:

dashpole · 2020-10-09T21:28:44Z

cc @DirectXMan12
We discussed this a looooong time ago

vincepri · 2020-10-12T15:49:01Z

/ok-to-test

k8s-ci-robot · 2020-10-12T15:51:18Z

@bboreham: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
pull-controller-runtime-test-master	`e3408a7`	link	`/test pull-controller-runtime-test-master`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

DirectXMan12 · 2020-10-13T20:43:34Z

(minor point of order: this would be a feature PR, not infra)

DirectXMan12

Neat so far, need to read through it a bit more detail later/catch up on the discussion.

One of the things I've also played around with in the past is hooking the event emitter into this pipeline as well, so we get k8s events showing up too.

DirectXMan12 · 2020-10-13T20:45:05Z

pkg/tracing/logger.go

+
+func (t tracingLogger) Error(err error, msg string, keysAndValues ...interface{}) {
+	t.Logger.Error(err, msg, keysAndValues...)
+	t.Span.LogKV(append([]interface{}{"error", err, "message", msg}, keysAndValues...)...)


is there a way to avoid needing an extra allocation here?

also, is there a semantic difference between WithTags and the key-value pairs specified here?

That LogKV() call is through an interface so the memory will escape; it's unlikely to be possible to avoid an allocation. This particular line is only hit under error conditions so probably not performance-sensitive, but there are similar ones.

Tags apply to the whole span so it wouldn't be right to put details from one error as a tag, however the next line sets an error tag on the span so you can pick it out as one that had at least one error.

bboreham · 2020-10-14T16:44:20Z

hooking the event emitter into this pipeline

I want to do that, but it looks awkward, since the existing code takes it from the controller, and there is no Context parameter to the event recorder interface. Unless you know a way?

bboreham · 2020-10-29T15:27:40Z

Code in this PR is now using OpenTelemetry instead of OpenTracing.

One point of detail: span contexts inside objects now look like this:

  annotations:
    trace.kubernetes.io/traceparent: 00-99a904aa21ed09e78201d8b1e5ed7d44-a3901ad9eae5035c-01

which is a divergence from the (now-deleted) KEP I followed previously.

fejta-bot · 2021-06-03T12:05:54Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fejta-bot · 2021-07-03T12:47:49Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

k8s-triage-robot · 2021-08-02T13:36:56Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

k8s-ci-robot · 2021-08-02T13:37:01Z

@k8s-triage-robot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dims · 2022-02-24T22:04:35Z

/reopen

k8s-ci-robot · 2022-02-24T22:04:47Z

@dims: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2022-02-24T22:05:04Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bboreham
To complete the pull request process, please assign droot after the PR has been reviewed.
You can assign the PR to them by writing /assign @droot in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bboreham · 2022-02-25T07:55:28Z

Hi, I haven’t done much on this for a long time; do you need me to update it?

dims · 2022-02-25T15:03:44Z

@bboreham yes please!

k8s-triage-robot · 2022-03-27T15:07:12Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-03-27T15:07:25Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fstrudel · 2022-04-26T17:12:22Z

/reopen
@bboreham any chance you can make this MR merged? We'd love to be able to use it/help you.

k8s-ci-robot · 2022-04-26T17:12:31Z

@fstrudel: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen
@bboreham any chance you can make this MR merged? We'd love to be able to use it/help you.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

bboreham · 2022-04-27T09:10:32Z

/reopen

Sorry I dropped this again.
I have to say it's unlikely I have time until after KubeCon.

k8s-ci-robot · 2022-04-27T09:10:42Z

@bboreham: Reopened this PR.

In response to this:

/reopen

Sorry I dropped this again.
I have to say it's unlikely I have time until after KubeCon.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fstrudel · 2022-04-27T23:48:41Z

pkg/tracing/jaeger.go

+func SetupJaeger(serviceName string) (io.Closer, error) {
+	// Create and install Jaeger export pipeline
+	flush, err := jaeger.InstallNewPipeline(
+		jaeger.WithCollectorEndpoint("http://jaeger-agent.default:14268/api/traces"), // FIXME name?


Should we allow to specify the endpoint? (parameter/env variable/other?)

k8s-triage-robot · 2022-05-27T23:54:14Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-05-27T23:54:24Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

akselleirv · 2024-02-06T20:34:02Z

Is there a possibility that this PR will be merged? It appears to be the closest effort to adding tracing support in controller-runtime.

Mershab99 · 2024-09-17T22:52:22Z

It would be awesome to see distributed tracing be available to operators!

@bboreham is there anything left to do to push this over the goal line?

bboreham added 3 commits October 8, 2020 15:01

Add tracing framework

6148b41

Functions to wrap a Client and add trace span annotations to objects. Based on OpenTracing, with a sample implementation for Jaeger. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

Connect logging to tracing

1371ba8

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

Try to get GroupVersionKind to log Create()

e3408a7

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 9, 2020

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 9, 2020

k8s-ci-robot requested review from alvaroaleman and joelanford October 9, 2020 12:03

bboreham mentioned this pull request Oct 9, 2020

Consider using OpenTelemetry for metrics instead of Prometheus #305

Open

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 12, 2020

DirectXMan12 reviewed Oct 13, 2020

View reviewed changes

bboreham added 2 commits October 29, 2020 15:19

Convert from OpenTracing to OpenTelemetry

d4bc91a

Update go.mod after conversion to OpenTelemetry

ec80672

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 29, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 5, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 3, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 3, 2021

k8s-ci-robot closed this Aug 2, 2021

k8s-ci-robot reopened this Feb 24, 2022

k8s-ci-robot closed this Mar 27, 2022

k8s-ci-robot reopened this Apr 27, 2022

fstrudel reviewed Apr 27, 2022

View reviewed changes

STRRL mentioned this pull request May 3, 2022

[Question] Integration with tracing #1876

Closed

k8s-ci-robot closed this May 27, 2022

🌱 Add distributed tracing framework #1211

🌱 Add distributed tracing framework #1211

Conversation

bboreham commented Oct 9, 2020

k8s-ci-robot commented Oct 9, 2020

k8s-ci-robot commented Oct 9, 2020

yue9944882 commented Oct 9, 2020

vincepri commented Oct 9, 2020

bboreham commented Oct 9, 2020

bboreham commented Oct 9, 2020

bboreham commented Oct 9, 2020

dashpole commented Oct 9, 2020

vincepri commented Oct 12, 2020

k8s-ci-robot commented Oct 12, 2020

DirectXMan12 commented Oct 13, 2020

DirectXMan12 left a comment • edited Loading

Choose a reason for hiding this comment

DirectXMan12 Oct 13, 2020

Choose a reason for hiding this comment

DirectXMan12 Oct 13, 2020

Choose a reason for hiding this comment

bboreham Oct 14, 2020

Choose a reason for hiding this comment

bboreham commented Oct 14, 2020

bboreham commented Oct 29, 2020

fejta-bot commented Jun 3, 2021

fejta-bot commented Jul 3, 2021

k8s-triage-robot commented Aug 2, 2021

k8s-ci-robot commented Aug 2, 2021

dims commented Feb 24, 2022

k8s-ci-robot commented Feb 24, 2022

k8s-ci-robot commented Feb 24, 2022

bboreham commented Feb 25, 2022

dims commented Feb 25, 2022

k8s-triage-robot commented Mar 27, 2022

k8s-ci-robot commented Mar 27, 2022

fstrudel commented Apr 26, 2022

k8s-ci-robot commented Apr 26, 2022

bboreham commented Apr 27, 2022

k8s-ci-robot commented Apr 27, 2022

fstrudel Apr 27, 2022

Choose a reason for hiding this comment

k8s-triage-robot commented May 27, 2022

k8s-ci-robot commented May 27, 2022

akselleirv commented Feb 6, 2024

Mershab99 commented Sep 17, 2024

DirectXMan12 left a comment •

edited

Loading