[connector/spanmetricsconnector] Generated counter drops then disappears #33421

duc12597 · 2024-06-07T02:38:23Z

Component(s)

connector/spanmetrics

What happened?

Description

Our collector receives OTLP traces from Kafka, convert them into metrics and export to a TSDB. After a certain period of collector uptime (24-48 hours), the generated calls_total counter suffers a significant drop in value. Eventually no more metrics are exported.

Steps to Reproduce

Follow the below collector configuration.

Expected Result

The calls_total counter is ever increasing.

Actual Result

The calls_total counter drops then disappears.

Collector version

v0.101.0

Environment information

Environment

AWS EKS 1.24

OpenTelemetry Collector configuration

extensions:
  sigv4auth:
    region: ap-southeast-1
    service: "aps"
receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: 03-sink-metric-prometheus
          scrape_interval: 10s
          static_configs:
            - targets: ['127.0.0.1:8888']
  kafka/traces:
    protocol_version: 3.3.1
    brokers:
      - b-1.<msk-endpoint>.ap-southeast-1.amazonaws.com:9094
      - b-2.<msk-endpoint>.ap-southeast-1.amazonaws.com:9094
    auth:
      tls:
        insecure: true
    topic: otlp_spans
    group_id: 03-sink-metric-prometheus
  kafka/metrics:
    protocol_version: 3.3.1
    brokers:
      - b-1.<msk-endpoint>.ap-southeast-1.amazonaws.com:9094
      - b-2.<msk-endpoint>.ap-southeast-1.amazonaws.com:9094
    auth:
      tls:
        insecure: true
    topic: otlp_metrics
    group_id: 03-sink-metric-prometheus
processors:
  filter:
    error_mode: ignore
    metrics:
      datapoint:
        - 'IsMatch(attributes["http.target"], ".*.(css|js)")'
  transform:
    error_mode: ignore
    metric_statements:
      - context: datapoint
        statements:
          # reduce the cardinality of metrics with params
          - replace_pattern(attributes["http.target"], "/users/[0-9]{13}", "/users/{userId}")
connectors:
  spanmetrics:
    dimensions:
      - name: http.method
      - name: http.target
      - name: http.status_code
      - name: host.name
      - name: myCustomLabel
    exclude_dimensions:
      - span.kind
      - span.name
      - status.code
    exemplars:
      enabled: true
    metrics_flush_interval: 15s
exporters:
  debug:
  prometheusremotewrite:
    endpoint: https://aps-workspaces.ap-southeast-1.amazonaws.com/workspaces/<prometheus-workspace>/api/v1/remote_write
    auth:
      authenticator: sigv4auth
    external_labels:
      cluster_name: my-cluster
      collector: 03-sink-metric-prometheus
    retry_on_failure:
      enabled: true
      initial_interval: 1s
      max_interval: 10s
      max_elapsed_time: 30s
    send_metadata: true
    max_batch_size_bytes: 3000000
service:
  telemetry:
    metrics:
      address: 127.0.0.1:8888
      level: detailed
  extensions:
    - sigv4auth
  pipelines:
    traces:
      receivers:
        - kafka/traces
      processors: []
      exporters:
        - spanmetrics
    metrics:
      receivers:
        - kafka/metrics
        - prometheus
        - spanmetrics
      processors:
        - filter
        - transform
      exporters:
        - debug
        - prometheusremotewrite

Log output

2024-06-07T01:38:51.776Z    error    exporterhelper/queue_sender.go:101    Exporting failed. Dropping data.    {"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", 
"error": "Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded", "errorCauses": [{"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}], "dropped_items": 58510}   
go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1
    go.opentelemetry.io/collector/exporter@v0.101.0/exporterhelper/queue_sender.go:101
go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume
    go.opentelemetry.io/collector/exporter@v0.101.0/internal/queue/bounded_memory_queue.go:52
go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1
    go.opentelemetry.io/collector/exporter@v0.101.0/internal/queue/consumers.go:43

Additional context

Our application uses the HyperTrace Java agent to send telemetry data to Kafka in OTLP format
The problem persists across different TSDBs (AWS Prometheus, self-hosted Prometheus, Mimir) and different number of collector replicas (1, 3)

The text was updated successfully, but these errors were encountered:

github-actions · 2024-06-07T02:38:40Z

Pinging code owners:

connector/spanmetrics: @portertech @Frapschen

See Adding Labels via Comments if you do not have permissions to add labels yourself.

ankitpatel96 · 2024-06-14T19:25:14Z

I have a few questions that might help us track down this issue: Is there any chance your collector is restarting at these points? Are you running just one collector or many in a gateway mode?

duc12597 · 2024-06-16T08:56:50Z

I'm running the collector as a deployment, and have tried both 1 and 3 replicas. The collector did not restart, I had to terminate the pods to keep exporting the metrics

ankitpatel96 · 2024-06-18T20:23:15Z

I see... honestly at this point I don't quite know what would cause it to eventually stop emitting metrics at all - that's the symptom that is really throwing me for a loop.

Are you still having these problems? Can you try increasing resource_metrics_cache_size? The thought is that this might prevent evictions which might prevent the resets.

Other things that might help us track down this problem - what is the count of the unique series within count_total over time? Are there resets happening for a series that the TSDB has already gotten or are there entirely new series?

duc12597 · 2024-06-19T03:02:53Z

This is the count(calls_total) at approximately the time the counter decreases

duc12597 · 2024-06-20T01:26:31Z

Further observation shows that out of 3 metrics receivers in my collector configuration, kafka/metrics & prometheus worked fine:

Only metrics from spanmetrics failed:

ankitpatel96 · 2024-06-25T16:57:48Z

thanks for your update. Did you try changing the cache size? I'm honestly a little stumped - any ideas @portertech @Frapschen ?

swar8080 · 2024-07-03T03:17:02Z

With the current config the connector will permanently cache every series it sees and send them all during each flush, even the ones where nothing's changed

So eventually the payload flushed to prometheusremotewrite gets so large that the remote write request times out (i.e. context deadline exceeded is a timeout) and likely the request gets rejected by the remote write target because of the size

Permanent error: context deadline exceeded"}], "dropped_items": 58510}

Possible things that could help are:

Setting metrics_expiration on the connector so that infrequently updated span metrics are removed. Then you have to deal with prometheus counter resets
Breaking up the remote write requests into smaller batches, possibly using batch processor and/or prometheusremote's built-in config
Switching to scraping the span metrics using prometheus since it's optimized for a large number of series

duc12597 · 2024-07-10T01:47:13Z

I set metrics_expiration: 30m, the metrics still disappeared altogther. It returned after ~6 hours, but somehow the collectors did not restart.

Frapschen · 2024-07-20T10:22:37Z

@duc12597 Have your try to switch push model to pull?. replace your prometheusremotewrite to prometheusexporter.

duc12597 · 2024-07-29T01:46:09Z

@duc12597 Have your try to switch push model to pull?. replace your prometheusremotewrite to prometheusexporter.

We will consider this option. As of now the collector has been running for 2 weeks without any errors, although there are still counter fluctuations. I'm not sure if it's thanks to any changes on our side. I will close this issue for now and will re-open in the future if this problem resurface.

This is my complete collector manifest:

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: 03-sink-metric-prometheus
spec:
  image: mirror.gcr.io/otel/opentelemetry-collector-contrib:0.102.0
  replicas: 5
  nodeSelector:
    mycompany.com/service: observability
    kubernetes.io/arch: amd64
  tolerations:
    - effect: NoSchedule
      key: mycompany.com/service
      value: observability
      operator: Equal
  config: |
    receivers:
      prometheus:
        config:
          scrape_configs:
            - job_name: 03-sink-metric-prometheus
              scrape_interval: 10s
              static_configs:
                - targets: ['127.0.0.1:8888']
      kafka/traces:
        protocol_version: 3.3.1
        brokers:
          - b-1.<msk-endpoint>.ap-southeast-1.amazonaws.com:9094
          - b-2.<msk-endpoint>.ap-southeast-1.amazonaws.com:9094
        auth:
          tls:
            insecure: true
        topic: otlp_spans
        group_id: 03-sink-metric-prometheus
      kafka/metrics:
        protocol_version: 3.3.1
        brokers:
          - b-1.<msk-endpoint>.ap-southeast-1.amazonaws.com:9094
          - b-2.<msk-endpoint>.ap-southeast-1.amazonaws.com:9094
        auth:
          tls:
            insecure: true
        topic: otlp_metrics
        group_id: 03-sink-metric-prometheus
    processors:
      filter:
        error_mode: ignore
        metrics:
          datapoint:
            - 'IsMatch(attributes["http.target"], ".*.(css|js)")'
      transform:
        error_mode: ignore
        metric_statements:
          - context: datapoint
            statements:
              # reduce the cardinality of metrics with params
              - replace_pattern(attributes["http.target"], "/users/[0-9]{13}", "/users/{userId}")
    connectors:
      spanmetrics:
        dimensions:
          - name: http.method
          - name: http.target
          - name: http.status_code
          - name: host.name
          - name: myCustomLabel
        exclude_dimensions:
          - span.kind
          - span.name
          - status.code
        exemplars:
          enabled: true
        metrics_flush_interval: 15s
        metrics_expiration: 1h
        resource_metrics_key_attributes:
          - service.name
          - telemetry.sdk.language
          - telemetry.sdk.name
        resource_metrics_cache_size: 10000
    exporters:
      debug:
      prometheusremotewrite:
        endpoint: http://mimir-nginx/api/v1/push
        send_metadata: true
    service:
      telemetry:
        metrics:
          address: 127.0.0.1:8888
          level: detailed
      extensions:
        - sigv4auth
      pipelines:
        traces:
          receivers:
            - kafka/traces
          processors: []
          exporters:
            - spanmetrics
        metrics:
          receivers:
            - kafka/metrics
            - prometheus
            - spanmetrics
          processors:
            - filter
            - transform
          exporters:
            - debug
            - prometheusremotewrite
  env:
    - name: GOMEMLIMIT
      value: 1640MiB # 80% of resources.limits.memory
  resources:
    requests:
      cpu: 200m
      memory: 512Mi
    limits:
      cpu: 500m
      memory: 2Gi

Frapschen · 2024-08-07T02:33:37Z

@duc12597 sorry for pinging you, there is a related issue for counter fluctuation, please see #34126 (comment) to fix it.

duc12597 · 2024-08-07T02:46:52Z

@duc12597 sorry for pinging you, there is a related issue for counter fluctuation, please see #34126 (comment) to fix it.

If I understand correctly, this will add a UUID as a label for every metric generated by each collector pod. Will this explode the cardinality? Why does a UUID solve the fluctuation? Can you give an example config?

Thanks a ton.

duc12597 added bug Something isn't working needs triage New item requiring triage labels Jun 7, 2024

github-actions bot added the connector/spanmetrics label Jun 7, 2024

github-actions bot mentioned this issue Jul 2, 2024

Weekly Report: 2024-06-25 - 2024-07-02 #33839

Closed

github-actions bot mentioned this issue Jul 9, 2024

Weekly Report: 2024-07-02 - 2024-07-09 #33962

Closed

github-actions bot mentioned this issue Jul 16, 2024

Weekly Report: 2024-07-09 - 2024-07-16 #34087

Closed

github-actions bot mentioned this issue Jul 23, 2024

Weekly Report: 2024-07-16 - 2024-07-23 #34202

Closed

duc12597 closed this as completed Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[connector/spanmetricsconnector] Generated counter drops then disappears #33421

[connector/spanmetricsconnector] Generated counter drops then disappears #33421

duc12597 commented Jun 7, 2024

github-actions bot commented Jun 7, 2024

ankitpatel96 commented Jun 14, 2024

duc12597 commented Jun 16, 2024

ankitpatel96 commented Jun 18, 2024

duc12597 commented Jun 19, 2024

duc12597 commented Jun 20, 2024

ankitpatel96 commented Jun 25, 2024

swar8080 commented Jul 3, 2024

duc12597 commented Jul 10, 2024

Frapschen commented Jul 20, 2024

duc12597 commented Jul 29, 2024

Frapschen commented Aug 7, 2024

duc12597 commented Aug 7, 2024