Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for otel tracing of grpc calls #140

Conversation

Fricounet
Copy link
Contributor

@Fricounet Fricounet commented Jul 31, 2023

What type of PR is this?

/kind feature

What this PR does / why we need it:

  • Refactor Connect to rely more on Options to configure how the gRPC client connect to the server. Configuring how connect behaves relies on new functions like withMetrics or withTimeout that need to be passed as options at the caller level. This will make it easier to add new configuration options without breaking the api in the future.
  • Create a new WithOtelTracing option to be used by gRPC client when connecting to the CSI server with Connect. This function adds a gRPC interceptor that will provide basic tracing of the gRPC calls.
    Once it is merged and released, I will go through the different CSI sidecars to allow the use of this option behind a feature flag --enable-otel-tracing.

Below is an example of what it looks like when using this change on a CSI sidecar. I have otel tracing activated on the CSI driver (server) and the sidecar (client). We can see a trace composed of 2 spans with their duration.
image

Which issue(s) this PR fixes:

Fixes #139

Special notes for your reviewer:

I split the PR in 2 commits (dependency change first and code change next) to make it easier to review. If you prefer to only merge one commit, I can squash commits once the review is done.

Does this PR introduce a user-facing change?:

Added WithOtelTracing option to connect to the CSI driver gRPC server with otel gRPC tracing interceptor

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. labels Jul 31, 2023
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jul 31, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot
Copy link
Contributor

Hi @Fricounet. Thanks for your PR.

I'm waiting for a kubernetes-csi member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jul 31, 2023
@jarias-lfx
Copy link

/easycla

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jul 31, 2023
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 10, 2023
@Fricounet Fricounet force-pushed the fricounet/upstream/otel-tracing-grpc branch from f7e2b97 to 5bd2d56 Compare August 10, 2023 12:16
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 10, 2023
@xing-yang
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 10, 2023
@mauriciopoppe
Copy link
Member

/cc

@Fricounet
Copy link
Contributor Author

Hey @gnufied, I see you did a PR recently #143 to allow disabling metrics on gRPC connection. You did something fairly similar to what I wanted to do here to add otel tracing. This made me realise that this approach lacks flexibility because we won't be able to have a gRPC connection with tracing without metrics for instance. Unless we write a separate function for each case which does not seem like a good idea.

So I think this part need to be refactored. I was thinking about using the Option type already available and create functions like withMetrics and WithOtelGrpcInterceptor that could be added as parameters when calling connect.
Without breaking the existing API, it could look something like this:

func Connect(address string, metricsManager metrics.CSIMetricsManager, options ...Option) (*grpc.ClientConn, error) {
        if metricsManager != nil {
                options = append(options, withMetrics(metricsManager))
        }
	return connect(address, []grpc.DialOption{grpc.WithTimeout(time.Second * 30)}, options)
}

For your use case, instead of doing: ConnectWithoutMetrics(address), you would do Connect(address, nil). What do you think about it?

@gnufied
Copy link
Contributor

gnufied commented Aug 14, 2023

Sure, Feel free to refactor your PR to design you proposed. And lets make sure this handles tracing changes you are introducing here too.

@Fricounet Fricounet changed the title Fricounet/upstream/otel tracing grpc Add support otel tracing of grpc calls Aug 16, 2023
@Fricounet Fricounet changed the title Add support otel tracing of grpc calls Add support for otel tracing of grpc calls Aug 16, 2023
Extend the existing `Option` functionality for connect to be more
flexible. Configuring how `connect` behaves relies on new functions like
`withMetrics` or `withTimeout` that need to be passed as options at the
caller level. This will make it easier to add new configuration options
without breaking the api.
Create a new `WithOtelTracing` option to be used by gRPC client when
connecting to the CSI server with `Connect`. This function adds a
gRPC interceptor that will provide basic tracing of the gRPC calls.
@Fricounet Fricounet force-pushed the fricounet/upstream/otel-tracing-grpc branch from 5bd2d56 to 46cded8 Compare August 16, 2023 12:17
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 16, 2023
@Fricounet
Copy link
Contributor Author

@gnufied I just pushed the refactor with the otel tracing on top if you want to take a look :)


// ConnectWithoutMetrics behaves exactly like Connect except no metrics are recorded.
func ConnectWithoutMetrics(address string, options ...Option) (*grpc.ClientConn, error) {
return connect(address, nil, []grpc.DialOption{grpc.WithTimeout(time.Second * 30)}, options)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't remove ConnectWithoutMetrics yet, because it means at least one or more user of library will be broken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sure, I had not realised a new version was released with this change available so I thought nobody could use it. I will put it back with a deprecation warning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put the function back :)

if metricsManager != nil {
options = append(options, withMetrics(metricsManager))
}
return connect(address, options)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking, a builder pattern might be more suitable for building options to connect. So something like:

type ConnectionBuilder struct {
	address           string
	metricsManager    metrics.CSIMetricsManager
	timeout           time.Duration
	enableOtelTracing bool
}

func NewConnectionBuilder(address string) *ConnectionBuilder {
	return &ConnectionBuilder{
		address: address,
	}
}

func (b *ConnectionBuilder) WithMetrics(mm metrics.CSIMetricsManager) *ConnectionBuilder {
	b.metricsManager = mm
	return b
}

func (b *ConnectionBuilder) WithTimeout(td time.Duration) *ConnectionBuilder {
	b.timeout = td
	return b
}

func (b *ConnectionBuilder) WithOtelTracing() *ConnectionBuilder {
	b.enableOtelTracing = true
	return b
}

func (b *ConnectionBuilder) Connect() (*grpc.ClientConn, error) {
	return connect(b.address, b.metricsManager,....)
}

So a caller would do something like:

connectionBuilder := NewConnectionBuilder("unix:////blah").WithMetrics(xxx).WithTimeout(xxx)
connection, err := connectionBuilder.Connect()

This might be more easier to extend in future. I do realize, this is new interface functions but it should be fine to use it. We can eventually deprecate both Connect and ConnectWithoutMetrics in future.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not have a strong opinion on this but here are some thoughts:

  • I am not a big fan of breaking the interface because it means people will have to update their code even if they do not want to change the behaviour of Connect. Considering csi-lib-utils is used in all sidecars and some csi-drivers that makes a fair amount of 'clients' that will have to update their code.
  • initially, I sticked to the Options pattern because I saw a similar patterns in other CSI repos (metrics in csi-lib-utils, aws-ebs-csi-driver, azuredisk-csi-driver, ...) so I felt like sticking to the same pattern was a good idea for code consistency.
  • on the other hand, I did not see any other builder pattern in CSI repos (if you have examples, feel free to share them) so it does not feel very consistent with the existing code.

But if you feel like a builder pattern is truly the way to go here, I'm fine with implementing it.
Also, maybe we can ask other folks' opinion on this? Not too familiar with who would be the right person though :/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay. I can probably live with function(option) pattern. I have a minor question below. Another thing is - we are about to start cutting CSI sidecars. Were you hoping to merge this in sidecars and start adding CLI option? Do you want to do this after this release or in current release? (I pinged you on slack with same question).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw supplying a variadic number of arguments for setting options vs builder pattern are basically two sides of the same coin and not all that much different. So basically:

foo = Connect("foo", withBar(), withBaz(), withFoo())

vs

foo = NewConnect("foo").WithBar().WithBaz().WithFoo()

I am not a fan of functions that takes too many parameters. Also it may just be me, but those functions that returns functions, only for setting a field in a option struct seems slightly complicated for its own good (it takes a minute for me to parse those), where as builder seems more straightforward.

I agree that we shouldn't break existing interfaces and we can leave Connect as it is, but I do tend to prefer the API of builder vs functions that takes options.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 23, 2023
return func(o *options) {
o.enableOtelTracing = true
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw why are withTimeout and withMetrics unexported but WIthOtelxxx is exported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly because they were already set when calling Connect with the previous implementation and I did not want to let the user override it (because there was no need until now) and if the need were to arise, the functions could simply be exported.
But I can also directly export them and let them override the default I set in Connect if you prefer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I exported the function and let them override the default set in Connect

To not break the interface for existing clients using the function.
If they are set, they will override the default values that are set when
calling `Connect`.
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 24, 2023
@Fricounet
Copy link
Contributor Author

I think I addressed all your comments @gnufied

@gnufied
Copy link
Contributor

gnufied commented Aug 24, 2023

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 24, 2023
@xing-yang
Copy link
Contributor

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Fricounet, xing-yang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 25, 2023
@k8s-ci-robot k8s-ci-robot merged commit 7b02e8d into kubernetes-csi:master Aug 25, 2023
3 checks passed
@sunnylovestiramisu
Copy link
Contributor

Can we use a version of go.opentelemetry.io metrics that is compatible among component-base, csi-lib-util and external-provisioner?

@Fricounet
Copy link
Contributor Author

Hey @sunnylovestiramisu, from what I see, the versions used are as follow:

How do you want to go about it? Personally, I'd prefer all versions to be upgraded to v0.38.0, however I can try to raise a PR to downgrade the version to v0.31.0 in this repo if you prefer?

@Fricounet
Copy link
Contributor Author

Looking at if I can downgrade the version to v0.31.0 here, I fear that won't be possible. The dependency on go.opentelemetry.io/otel/metric was added in the otelgrpc package in version v0.37.0 but the version used for metric was already v0.34.0 so I feel like upgrading the dependencies in external-provisioner and component-base is the better solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add otel trace instrumentation on gRPC calls
7 participants