Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🌱 Add distributed tracing framework #1211

Closed
wants to merge 5 commits into from

Conversation

bboreham
Copy link
Contributor

@bboreham bboreham commented Oct 9, 2020

Created in support of kubernetes-sigs/cluster-api#3760
This doc goes into more detail and is probably the best place to discuss.

What is in this PR:

  • A wrapper for client.Client which will log all Kubernetes API calls to the current trace.
  • A wrapper for logr.Logger which will add all log messages to a trace.
  • Support for passing trace span context from one program to another via an Annotation (as described in this KEP)

This PR does not create any tracing spans. You would use it in a controller like this (error handling elided):

func main() {
	...
	mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
		...
		NewClient: tracing.WrapRuntimeClient(util.ManagerDelegatingClientFunc),
	...
}

func (r *MyReconciler) Reconcile(ctx context.Context, req ctrl.Request) (_ ctrl.Result, rerr error) {
	myObject := &MyObject{}
	r.Client.Get(ctx, req.NamespacedName, myObject)

	ctx, sp, log := tracing.FromObject(ctx, "Reconcile", myObject)
	if sp != nil {
		defer sp.Finish()
	}
	...
}

The code is written using the OpenTracing API and Jaeger; I will look at porting it to OpenTelemetry.

Functions to wrap a Client and add trace span annotations to objects.

Based on OpenTracing, with a sample implementation for Jaeger.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 9, 2020
@k8s-ci-robot
Copy link
Contributor

Welcome @bboreham!

It looks like this is your first PR to kubernetes-sigs/controller-runtime 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/controller-runtime has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @bboreham. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 9, 2020
@yue9944882
Copy link
Member

@bboreham thank you for proposing/bringing open-telemetry integration to this library! i was looking for this for a period of time and also had a few PoC works before (in some of our internal projects) which is very close/similar to the current state of this pull. i think overall the proposed change is fairly intuitive but based on my humble experience from our practice, i think there're yet a few things not addressed in this pull:

  • the trace never closes: the controller can be syncing resources by processing all items periodically even if there's no change on any resources --- that will create unbounded # of child spans which even makes it hard to visualize if there's multiple controllers working under the same trace. so the trace need to be enclosed properly if the resource successfully reach its desired state. and another tricky part is that if we're closing the trace by removing or editing the annotation, that will trigger a new reconciliation upon the resource, will that count into the trace?

  • sorting/dedup'ing context propagation: consider cases where we're tracing cascading controller instances (as the following image roughly shows a A->B->C controller topology). i think the spans need to properly de-dup'd otherwise the visualization (on jaeger) can be confusing due to extra verbosity.

Screen Shot 2020-10-09 at 8 47 01 PM

and as far as i know this KEP was completely refactored to merely do the tracing for incoming requests at server side instead of controller frameworks, is it? the proposed change from this pull seems to be conflict w/ the latest KEP b/c having tracing at both server and client side can be be redundant i suppose.. cc @lavalamp

@vincepri
Copy link
Member

vincepri commented Oct 9, 2020

cc @alvaroaleman @DirectXMan12

@bboreham
Copy link
Contributor Author

bboreham commented Oct 9, 2020

Hi @yue9944882, lot of great points there. I'll respond in pieces:

and as far as i know this KEP was completely refactored to merely do the tracing for incoming requests at server side

Yes, I noticed that today (change was made 8 days ago).

instead of controller frameworks, is it?

Nit: it specified kube-apiserver, kube-scheduler, kube-controller-manager and kubelet, (most of?) which don't use the controller-runtime framework. But your point is a good one: I would now like to resurrect much of the text that was removed.

the proposed change from this pull seems to be conflict w/ the latest KEP

I think they are complementary.

@bboreham
Copy link
Contributor Author

bboreham commented Oct 9, 2020

the trace never closes

Yes, this is called out in the linked Google doc.

the trace need to be enclosed properly if the resource successfully reach its desired state

Great thought. I think for many specific cases this can be identified by a single controller.

if we're closing the trace by removing or editing the annotation, that will trigger a new reconciliation upon the resource

Also true. I wonder if it can be fixed by adding something to Status? There isn't any handy bag-of-values field like Annotations, but maybe can be achieved as a Condition?

Further, suppose the observed state changes; we'd expect a reconcillation which (given the above) would start a new span.

@bboreham
Copy link
Contributor Author

bboreham commented Oct 9, 2020

sorting/dedup'ing context propagation

Sorry I don't get this one at all, can you please explain more?

The text on your image is too small to see; here's a grab of the graph I get for a Cluster-API trace:
image

@dashpole
Copy link

dashpole commented Oct 9, 2020

cc @DirectXMan12
We discussed this a looooong time ago

@vincepri
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 12, 2020
@k8s-ci-robot
Copy link
Contributor

@bboreham: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-controller-runtime-test-master e3408a7 link /test pull-controller-runtime-test-master

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@DirectXMan12
Copy link
Contributor

(minor point of order: this would be a feature PR, not infra)

Copy link
Contributor

@DirectXMan12 DirectXMan12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat so far, need to read through it a bit more detail later/catch up on the discussion.

One of the things I've also played around with in the past is hooking the event emitter into this pipeline as well, so we get k8s events showing up too.


func (t tracingLogger) Error(err error, msg string, keysAndValues ...interface{}) {
t.Logger.Error(err, msg, keysAndValues...)
t.Span.LogKV(append([]interface{}{"error", err, "message", msg}, keysAndValues...)...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to avoid needing an extra allocation here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, is there a semantic difference between WithTags and the key-value pairs specified here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That LogKV() call is through an interface so the memory will escape; it's unlikely to be possible to avoid an allocation. This particular line is only hit under error conditions so probably not performance-sensitive, but there are similar ones.

Tags apply to the whole span so it wouldn't be right to put details from one error as a tag, however the next line sets an error tag on the span so you can pick it out as one that had at least one error.

@bboreham
Copy link
Contributor Author

hooking the event emitter into this pipeline

I want to do that, but it looks awkward, since the existing code takes it from the controller, and there is no Context parameter to the event recorder interface. Unless you know a way?

@bboreham
Copy link
Contributor Author

Code in this PR is now using OpenTelemetry instead of OpenTracing.

One point of detail: span contexts inside objects now look like this:

  annotations:
    trace.kubernetes.io/traceparent: 00-99a904aa21ed09e78201d8b1e5ed7d44-a3901ad9eae5035c-01

which is a divergence from the (now-deleted) KEP I followed previously.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 29, 2020
@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 5, 2021
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 3, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 3, 2021
@k8s-triage-robot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dims
Copy link
Member

dims commented Feb 24, 2022

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Feb 24, 2022
@k8s-ci-robot
Copy link
Contributor

@dims: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bboreham
To complete the pull request process, please assign droot after the PR has been reviewed.
You can assign the PR to them by writing /assign @droot in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@bboreham
Copy link
Contributor Author

Hi, I haven’t done much on this for a long time; do you need me to update it?

@dims
Copy link
Member

dims commented Feb 25, 2022

@bboreham yes please!

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fstrudel
Copy link

/reopen
@bboreham any chance you can make this MR merged? We'd love to be able to use it/help you.

@k8s-ci-robot
Copy link
Contributor

@fstrudel: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen
@bboreham any chance you can make this MR merged? We'd love to be able to use it/help you.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@bboreham
Copy link
Contributor Author

/reopen

Sorry I dropped this again.
I have to say it's unlikely I have time until after KubeCon.

@k8s-ci-robot
Copy link
Contributor

@bboreham: Reopened this PR.

In response to this:

/reopen

Sorry I dropped this again.
I have to say it's unlikely I have time until after KubeCon.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Apr 27, 2022
func SetupJaeger(serviceName string) (io.Closer, error) {
// Create and install Jaeger export pipeline
flush, err := jaeger.InstallNewPipeline(
jaeger.WithCollectorEndpoint("http://jaeger-agent.default:14268/api/traces"), // FIXME name?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we allow to specify the endpoint? (parameter/env variable/other?)

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@akselleirv
Copy link

Is there a possibility that this PR will be merged? It appears to be the closest effort to adding tracing support in controller-runtime.

@Mershab99
Copy link

It would be awesome to see distributed tracing be available to operators!

@bboreham is there anything left to do to push this over the goal line?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.