Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collectors self-telemetry pipelines. #1431

Merged
merged 21 commits into from
Aug 12, 2024
Merged

Conversation

RonFed
Copy link
Collaborator

@RonFed RonFed commented Aug 6, 2024

Send collector metrics from node and cluster collectors.
The goal is to have metrics about the data sent and throughput from sources and to destinations.

The main parts of this PR are:

  • Collector configuration for collecting self-telemetry. This includes definning the self-telemetry pipeline in the node and cluster collectors.
  • odigostrafficmetrics which is a processor that additional collects metrics to those that are already provided by the collector. This processor can be configured to attach different attributes to the metrics. In addition, it is possible to configure the processor to perform its measurements on a fraction of the spans/metrics/logs.
  • collectormetrics package in the UI server which acts as an OTLP receiver, and exposes endpoints for the UI.

@RonFed RonFed changed the title Collector metrics Collectors self-telemetry pipelines. Aug 8, 2024
@RonFed RonFed marked this pull request as ready for review August 8, 2024 14:26
.github/workflows/e2e/bats/traces-orig.json Outdated Show resolved Hide resolved
autoscaler/controllers/datacollection/configmap.go Outdated Show resolved Hide resolved
autoscaler/controllers/datacollection/configmap.go Outdated Show resolved Hide resolved
autoscaler/controllers/gateway/configmap.go Outdated Show resolved Hide resolved
@tamirdavid1
Copy link
Collaborator

Maybe worth to add an option to disable the metrics via flag/odigos-config field?
So we will have the flexibility to disable in case any issue occurs

Comment on lines 9 to 12
// Label used to identify the Odigos pod which is acting as a node collector.
OdigosNodeCollectorLabel = "odigos.io/data-collection"
// Label used to identify the Odigos pod which is acting as a cluster collector.
OdigosClusterCollectorLabel = "odigos.io/gateway"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The structure before this change was:

  • chunk of things related to cluster collector
  • chunk of things related to node collector

Do we want to change it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted to separate to node and cluster collector

k8sutils/pkg/consts/consts.go Outdated Show resolved Hide resolved
opampserver/go.mod Outdated Show resolved Hide resolved
scheduler/go.mod Outdated Show resolved Hide resolved
cli/cmd/resources/ui.go Outdated Show resolved Hide resolved

func (p *dataSizesMetricsProcessor) processTraces(ctx context.Context, td ptrace.Traces) (ptrace.Traces, error) {
if p.samplingFraction != 0 && rand.Float64() < p.samplingFraction {
p.traceSize.Add(ctx, int64(p.tracesSizer.TracesSize(td)) * p.inverseSamplingFraction, metric.WithAttributes(p.traceAttributes(td)...))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.tracesSizer.TracesSize(td) needs to encode the traces to proto.
Can it have performance hit?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is the reason for the SamplingRation config of the processor

// ResourceAttributesKeys is a list of resource attributes keys that will be used to add labels for the metrics.
ResourceAttributesKeys []string `mapstructure:"res_attributes_keys"`

// SamplingRatio is the ratio of payloads that are measured. Values between 0.0 and 1.0 are valid.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the ratio of payloads that are measured"? perhaps add the motivation for why to use it and how to choose the value for it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the comment here PTAL

@@ -138,6 +140,13 @@ func startHTTPServer(flags *Flags) (*gin.Engine, error) {
apis.PUT("/actions/types/RenameAttribute/:id", func(c *gin.Context) { actions.UpdateRenameAttribute(c, flags.Namespace, c.Param("id")) })
apis.DELETE("/actions/types/RenameAttribute/:id", func(c *gin.Context) { actions.DeleteRenameAttribute(c, flags.Namespace, c.Param("id")) })

// Metrics
apis.GET("/metrics/namespace/:namespace/kind/:kind/name/:name", func(c *gin.Context) { endpoints.GetSingleSourceMetrics(c, odigosMetrics) })
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be under /metrics/source? to make it explicit

Comment on lines 35 to 37
tracesDataSent int64
logsDataSent int64
metricsDataSent int64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the units? bytes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, added comments

return err
}

if strings.HasPrefix(senderPod, k8sconsts.OdigosNodeCollectorDaemonSetName) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think relying on the pod name to classify it as node / cluster is risky.
A more robust way would be to populate another resource attribute that record the role of the collector which produced the data

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, however, we also use the pod name as an identifier for the collector - so it has 2 goals.

Copy link
Contributor

@edeNFed edeNFed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Collaborator

@blumamir blumamir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Few last nits :)

Comment on lines 5 to 10
const (
// Cluster collector is responsible for exporting observability data from the cluster.
ClusterCollector CollectorType = "cluster"
// Node collector is receiving data from different instrumentation SDKs in the same node.
NodeCollector CollectorType = "node"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, because we already have this here:

// +kubebuilder:validation:Enum=CLUSTER_GATEWAY;NODE_COLLECTOR
type CollectorsGroupRole string

const (
	CollectorsGroupRoleClusterGateway CollectorsGroupRole = "CLUSTER_GATEWAY"
	CollectorsGroupRoleNodeCollector  CollectorsGroupRole = "NODE_COLLECTOR"
)

and this here:

if (cg.Spec.Role == odigosv1.CollectorsGroupRoleNodeCollector || cg.Spec.Role == "DATA_COLLECTION") && cg.Status.Ready {

Should we use the same names? and should we share the enum everywhere? now or later

cli/go.mod Outdated Show resolved Hide resolved
@RonFed RonFed merged commit 7c8d508 into odigos-io:main Aug 12, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants