Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instrument Jaeger Operator #737

Closed
jpkrohling opened this issue Oct 29, 2019 · 4 comments · Fixed by #738
Closed

Instrument Jaeger Operator #737

jpkrohling opened this issue Oct 29, 2019 · 4 comments · Fixed by #738
Assignees

Comments

@jpkrohling
Copy link
Contributor

Instrument the Jaeger Operator, so that it can create traces per reconciliation.

@jpkrohling
Copy link
Contributor Author

The referenced PR has a first draft for this, but there are a couple of things to figure out, especially regarding the usage of a Jaeger Agent as sidecar to the Operator's deployment.

I have talked with @objectiser about this, and here are some thoughts:

  • The agent is an extra piece of infra to maintain. Ideally, we would use the gRPC exporter instead of Thrift, but I'm not sure this is already available in that client. Using Thrift isn't desirable here
  • gRPC for the transport is critical, as we want to retry sending the spans at a later time. This is important, because the bootstrap spans are reported before the Jaeger instance is provisioned, as seen in the first screenshot of the linked PR
  • One concern about the usage of the agent is about upgrades. I couldn't think of a concrete problem, only theoretical risks. The biggest one that I could think of is that we get hit by some backwards incompatible change, making our (newer) Agent's version to be incompatible with the (older) Collector. The versions would diverge only during upgrades. Such backwards incompatibility situation should never happen, as it would also affect all the other workloads in the cluster: when the operator is upgraded, it will upgrade all instances it manages, but not the clients (or even agents, I believe)
  • @objectiser mentioned that we could kinda bypass this problem by having the all-in-one as a sidecar. I think it's something to consider, but I'm not sure it's better than having an agent as sidecar
  • My idea is to have tracing enabled at all times. To keep the overhead low, we could change the provisioned instance to use in-memory storage with a max-traces of, say, 1000. This way, data about the last 1000 reconciliation loops would be stored, while older, not relevant data is overridden. I'd say that having a limit of 256MiB for memory and 0.5 cores is more than enough for that instance, but I have yet to try that
  • About the upgrade itself: just like we keep the Operator's version as part of the image name in the operator.yaml, we can keep the agent's version as part of the image name. When releasing a new operator, we bump this version as well
  • Right now, three flags are added as part of the PR: jaeger-agent-hostport, provision-own-instance, and tracing-enabled. The first is to use when a daemonset agent is used (making the sidecar useless). When the second flag is set to true (the default), is when using an externally provisioned Jaeger instance (perhaps manually by applying a CR). The Operator's deployment spec has to be changed, so that the agent points to this other Jaeger instance. When the third flag is set to true, the operator will not add any exporters to the OpenTelemetry client, so, the sidecar can be safely removed

@jpkrohling
Copy link
Contributor Author

We talked about this one today and agreed that we'd have the agent there by default for the moment, especially because there's no gRPC support yet in the open-telemetry Go client. Right before 1.16, we'll decide whether to keep it on by default, or if we turn it off. We'll probably leave it off for 1.16 to see the effects of the instrumentation in production, and make it on by default for 1.17.

Another aspect we discussed was about the ability to customize the operator's deployment: looks like it's not that easy to even set a different logging level for an operator without publishing an alternative CSV. In our case, we need the ability to customize the operator's deployment, as I believe the self-provisioned Jaeger instance will not be used that often in production, so, people running the operator will need a way to customize the sidecar with the actual collector's URL. Apparently, we are not the only ones in need of such feature:

https://github.com/operator-framework/operator-lifecycle-manager/blob/master/doc/contributors/design-proposals/operator-config.md#scenario-enable-debug-logs-for-an-operator

@objectiser
Copy link
Contributor

@jpkrohling The other issue we discussed was whether there are any issues defining the agent sidecar with the operator deployment in the CSV?

@jpkrohling
Copy link
Contributor Author

The CSV's .Install.Deployments node is always supposed to contain deployments:

https://github.com/operator-framework/operator-lifecycle-manager/blob/4cb99dae84723f2d6aff267a7b8fcd54577c1040/pkg/controller/install/deployment.go#L29-L32

For the moment, I'll assume that this is part of the contract and that it should work, but I'll make sure to do a manual e2e test to ensure that a sidecar in the deployment is not something that causes problems with OLM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants