Instrument Jaeger Operator #737

jpkrohling · 2019-10-29T10:25:54Z

Instrument the Jaeger Operator, so that it can create traces per reconciliation.

jpkrohling · 2019-10-31T09:06:01Z

The referenced PR has a first draft for this, but there are a couple of things to figure out, especially regarding the usage of a Jaeger Agent as sidecar to the Operator's deployment.

I have talked with @objectiser about this, and here are some thoughts:

The agent is an extra piece of infra to maintain. Ideally, we would use the gRPC exporter instead of Thrift, but I'm not sure this is already available in that client. Using Thrift isn't desirable here
gRPC for the transport is critical, as we want to retry sending the spans at a later time. This is important, because the bootstrap spans are reported before the Jaeger instance is provisioned, as seen in the first screenshot of the linked PR
One concern about the usage of the agent is about upgrades. I couldn't think of a concrete problem, only theoretical risks. The biggest one that I could think of is that we get hit by some backwards incompatible change, making our (newer) Agent's version to be incompatible with the (older) Collector. The versions would diverge only during upgrades. Such backwards incompatibility situation should never happen, as it would also affect all the other workloads in the cluster: when the operator is upgraded, it will upgrade all instances it manages, but not the clients (or even agents, I believe)
@objectiser mentioned that we could kinda bypass this problem by having the all-in-one as a sidecar. I think it's something to consider, but I'm not sure it's better than having an agent as sidecar
My idea is to have tracing enabled at all times. To keep the overhead low, we could change the provisioned instance to use in-memory storage with a max-traces of, say, 1000. This way, data about the last 1000 reconciliation loops would be stored, while older, not relevant data is overridden. I'd say that having a limit of 256MiB for memory and 0.5 cores is more than enough for that instance, but I have yet to try that
About the upgrade itself: just like we keep the Operator's version as part of the image name in the operator.yaml, we can keep the agent's version as part of the image name. When releasing a new operator, we bump this version as well
Right now, three flags are added as part of the PR: jaeger-agent-hostport, provision-own-instance, and tracing-enabled. The first is to use when a daemonset agent is used (making the sidecar useless). When the second flag is set to true (the default), is when using an externally provisioned Jaeger instance (perhaps manually by applying a CR). The Operator's deployment spec has to be changed, so that the agent points to this other Jaeger instance. When the third flag is set to true, the operator will not add any exporters to the OpenTelemetry client, so, the sidecar can be safely removed

jpkrohling · 2019-11-18T15:21:59Z

We talked about this one today and agreed that we'd have the agent there by default for the moment, especially because there's no gRPC support yet in the open-telemetry Go client. Right before 1.16, we'll decide whether to keep it on by default, or if we turn it off. We'll probably leave it off for 1.16 to see the effects of the instrumentation in production, and make it on by default for 1.17.

Another aspect we discussed was about the ability to customize the operator's deployment: looks like it's not that easy to even set a different logging level for an operator without publishing an alternative CSV. In our case, we need the ability to customize the operator's deployment, as I believe the self-provisioned Jaeger instance will not be used that often in production, so, people running the operator will need a way to customize the sidecar with the actual collector's URL. Apparently, we are not the only ones in need of such feature:

https://github.com/operator-framework/operator-lifecycle-manager/blob/master/doc/contributors/design-proposals/operator-config.md#scenario-enable-debug-logs-for-an-operator

objectiser · 2019-11-18T16:48:38Z

@jpkrohling The other issue we discussed was whether there are any issues defining the agent sidecar with the operator deployment in the CSV?

jpkrohling · 2019-11-20T08:59:50Z

The CSV's .Install.Deployments node is always supposed to contain deployments:

https://github.com/operator-framework/operator-lifecycle-manager/blob/4cb99dae84723f2d6aff267a7b8fcd54577c1040/pkg/controller/install/deployment.go#L29-L32

For the moment, I'll assume that this is part of the contract and that it should work, but I'll make sure to do a manual e2e test to ensure that a sidecar in the deployment is not something that causes problems with OLM.

jpkrohling self-assigned this Oct 29, 2019

jpkrohling mentioned this issue Oct 31, 2019

Added OpenTelemetry instrumentation #738

Merged

jpkrohling closed this as completed Nov 18, 2019

jpkrohling reopened this Nov 18, 2019

jpkrohling closed this as completed in #738 Nov 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instrument Jaeger Operator #737

Instrument Jaeger Operator #737

jpkrohling commented Oct 29, 2019

jpkrohling commented Oct 31, 2019

jpkrohling commented Nov 18, 2019

objectiser commented Nov 18, 2019

jpkrohling commented Nov 20, 2019

Instrument Jaeger Operator #737

Instrument Jaeger Operator #737

Comments

jpkrohling commented Oct 29, 2019

jpkrohling commented Oct 31, 2019

jpkrohling commented Nov 18, 2019

objectiser commented Nov 18, 2019

jpkrohling commented Nov 20, 2019