Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OpenTelemetry interceptors to capture traces from gRPC communications #3502

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

atoulme
Copy link
Contributor

@atoulme atoulme commented Jun 22, 2022

See #2954 - was closed for inactivity while #2997 was pending.

Type of change

  • New feature

Description

Adds OpenTelemetry tracing by adding interceptors on all gRPC communications.

Related issues

This is tied to https://github.com/hyperledger/fabric-rfcs/blob/main/text/0000-opentelemetry-tracing.md

@atoulme atoulme requested a review from a team as a code owner June 22, 2022 21:31
@atoulme atoulme force-pushed the otel branch 2 times, most recently from 917b57a to 2b9efb7 Compare June 22, 2022 23:26
@SamYuan1990
Copy link
Contributor

SamYuan1990 commented Jun 23, 2022

as a contributor, I just want to ask are we going to put packages in vendor folder? or just go.mod?

@atoulme
Copy link
Contributor Author

atoulme commented Jun 23, 2022

See https://hyperledger-fabric.readthedocs.io/en/latest/style-guides/go-style.html#adding-or-updating-go-packages

…ions

Signed-off-by: Antoine Toulme <antoine@lunar-ocean.com>
@atoulme
Copy link
Contributor Author

atoulme commented Jul 7, 2022

Is anyone available to review?

Copy link
Contributor

@SamYuan1990 SamYuan1990 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, LGTM.
but I want to know

  1. is unit test enough or we need test together with any sdk(node/java) which already enabled opentelemetry?
  2. ref to grpclogging and grpcmetric do we have any further plan to add more feature or logic for grpctracing in future?

@atoulme
Copy link
Contributor Author

atoulme commented Jul 8, 2022

  1. No that is unrelated - tracing is distributed, each software piece will report on its own.
  2. not at this time, see the RFC https://github.com/hyperledger/fabric-rfcs/blob/main/text/0000-opentelemetry-tracing.md

@atoulme
Copy link
Contributor Author

atoulme commented Jul 11, 2022

@SamYuan1990 please review and approve.

@SamYuan1990
Copy link
Contributor

@SamYuan1990 please review and approve.

image

sorry @atoulme , I don't have the permission. As I said, so far, this PR LGTM.

@SamYuan1990
Copy link
Contributor

Or @atoulme , you got my approve for this pr. even if I can't make it as limited by permission, but I suppose I can do something by leaving this message to support add opentelemetry into fabric.

@atoulme
Copy link
Contributor Author

atoulme commented Jul 12, 2022

@SamYuan1990 I appreciate you getting back to me and helping vet the PR. It's deeply appreciated. I believe I have sent an email to the fabric mailing list asking for a committer to take a look, and didn't hear back yet.

@SamYuan1990
Copy link
Contributor

@denyeart , @jkneubuh , @yacovm please help with @atoulme

},
UnaryInterceptors: []grpc.UnaryServerInterceptor{
grpcmetrics.UnaryServerInterceptor(grpcmetrics.NewUnaryMetrics(metricsProvider)),
grpclogging.UnaryServerInterceptor(
flogging.MustGetLogger("comm.grpc.server").Zap(),
grpclogging.WithLeveler(grpclogging.LevelerFunc(grpcLeveler)),
),
grpctracing.UnaryServerInterceptor(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the point in having this? We only have streams in orderer, we don't have RPC calls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows reporting external communications by the software, tagging each outgoing request with a trace ID.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow. what outgoing request? this is a server side interceptor is it not?

@@ -661,13 +663,15 @@ func initializeServerConfig(conf *localconfig.TopLevel, metricsProvider metrics.
StreamInterceptors: []grpc.StreamServerInterceptor{
grpcmetrics.StreamServerInterceptor(grpcmetrics.NewStreamMetrics(metricsProvider)),
grpclogging.StreamServerInterceptor(flogging.MustGetLogger("comm.grpc.server").Zap()),
grpctracing.StreamServerInterceptor(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what information is reported as telemetry as a result of you adding this interceptor? I am under the impression that only stream opening and termination.

If so, then how is this helpful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows us to record interactions with other servers, and tag outgoing and incoming requests with a trace ID so we can recreate the exchange of data between peers by collecting data from the consortium members.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you perhaps give an example? I don't understand what is the type of information that is collected. Did you by chance experiment and can share what data is collected and presented?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example does the trace ID correspond to the stream or to a single message that passes through the stream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example here - not specific to those grpc messages, but tracing calls to contracts https://www.youtube.com/watch?v=NvDE1cnK_3I
Please refer to the RFC as well https://github.com/hyperledger/fabric-rfcs/blob/main/text/0000-opentelemetry-tracing.md

Copy link
Contributor

@yacovm yacovm Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I watched your demo, but I don't understand what information is reported by these stream interceptors. The RFC also doesn't mention this.

@jkneubuh
Copy link
Contributor

Hi @atoulme

Adding a tracking / tracing / instrumentation / observability foundation to Fabric is an incredible contribution. Thank you for advancing this PR and RFC, it is both timely and incredibly relevant. There have been several discussions and efforts underway to characterize, measure, and improve the overall network throughput and transaction processing rates for Fabric, all of which will require a systematic, high-altitude view of system observability and custom metric aggregation. In addition to gRPC level trace monitoring, there have been some recent discussions around the need / opportunity to inject trace-level or function-level call tracking of core Fabric routines to profile, isolate, and resolve system bottlenecks.

To help advance this effort, would you consider presenting the material, pros/cons, benefits, and impacts in a context in a forum that is more suitable for an interactive discussion? This PR represents a "landscape shifting" moment for observability in Fabric networks, which in my opinion, warrants more than a single PR for the contribution. There are several active projects starting to look at Fabric (and overlay / Level 2) networks under a performance lens, to which this addition is directly relevant. In addition to the mechanics of landing this (or related PRs), the teams looking at throughput, i/o, and performance optimization need to be aware of the benefits of system-wide observability made available with an Open Telemetry integration.

Two good opportunities for socialization of the PR include:

  1. Fabric community contributor meetings. @denyeart : would you consider allocating time for Antoine and his team to present the OpenTelemetry integration hooks at our next scheduled (or next available) community call? The general team would really benefit from "seeing" some of the outcomes from this PR, rather than a direct inspection of the code and additional dependencies.

  2. We are in the process of incubating a Hyperledger Technical Working Group / Task Force to converge on architecture patterns to realize a Cloud Native Fabric runtime. Current topics for the Task Force include container orchestration (e.g. Kubernetes Operators), Mesh overlay networks (e.g. Istio/Linkerd), and x509 / TLS certificate management. Would you be interested in contributing to the WG / Task Force as a representative for observability and system-level monitoring?

@atoulme
Copy link
Contributor Author

atoulme commented Jul 13, 2022

@jkneubuh thank you for the kind words. I did present at a contributor meeting over a year ago. The time of the meeting is far from ideal for me as it takes place at 6am. I am not keen on presenting there again. I do however have a webinar with the Hyperledger Foundation to discuss this effort with Fabric and Besu.
I have also opened a RFC and there was extensive discussion of the merits of this improvement there. Overall, this process has been extremely slow. I don't understand some of the feedback on this PR - this PR itself has been opened for months and is quite straightforward, in line with what the RFC recommends.
In your point 2., who is we? I feel this is not a good place to ask me for more time or contribution. Let's stay focused on the PR please. Please email me or contact me on Discord.

)

func UnaryServerInterceptor() grpc.UnaryServerInterceptor {
return otelgrpc.UnaryServerInterceptor()
Copy link
Contributor

@yacovm yacovm Jul 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought some more and I think we should add an option to opt-in in the configuration (core.yaml and orderer.yaml).

I am not sure it's a good idea to load the otelgrpc by default, all the time.
It's true that you configure it with environment variables but let's minimize the potential surface area risk.

@SamYuan1990
Copy link
Contributor

https://github.com/cncf/tag-observability/blob/main/whitepaper.md#executive-summary
attachment cncf document for reference, in this document shows how Observability works and help with performance enhancement?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants