Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[statsdreceiver] incorrect aggregation when several producers report the same metrics #23809

Closed
sigilioso opened this issue Jun 28, 2023 · 13 comments
Labels
bug Something isn't working closed as inactive receiver/statsd statsd related issues Stale

Comments

@sigilioso
Copy link
Contributor

Component(s)

receiver/statsd

What happened?

Description

The metrics reported using the statsd receiver are not correctly aggregated when they come from different client addresses and the metrics are not decorated with any attribute to identify the client. Instead of aggregating the metrics with the same type and the same set of attribute keys and values as described in the docs, the receiver aggregates the metrics with the same type and the same set of attribute keys and values, and also the same origin address. But it does not include any attribute to identify the origin address.

As I see it, this is a common scenario, where metrics coming from different producers (or even the same producer but using different connections), do not include specific attributes to identify the producer.

If the k8sattributesprocessor is used to decorate the metrics, this is not an issue since metrics will have attributes identifying the origin but using the processor is not an option for every environment neither is mandatory when using the receiver.

It is also possible to avoid this problem by decorating the metrics in the client, but as stated in #15290, that can be cumbersome or just not possible.

Steps to Reproduce

  1. Run the collector using statsd receiver and the file exporter.
receivers:
  statsd:
    endpoint: "0.0.0.0:8125"
    aggregation_interval: 10s
    is_monotonic_counter: false

processors:
  batch:

exporters:
  file/tmp:
    path: /tmp/collector-output.json

service:
  telemetry:
    logs:
      level: debug
  pipelines:
    metrics:
      receivers: [statsd]
      processors: [batch]
      exporters: [file/tmp]
  1. Report the two values for the same metric with the same set of attributes within the same aggregation interval.
❯ echo "test.metric:10|c|#myKey:myVal" | nc -w 1 -u localhost 8125
❯ echo "test.metric:20|c|#myKey:myVal" | nc -w 1 -u localhost 8125

Expected Result

The metric is expected to be aggregated, I'd expect getting a value of 30 with an attribute mykey:myVal for test.metric.

Actual Result

I get two values of test.metric with the same set attributes (myKey:myVal) and the same timestamp.

Sample output from /tmp/collector-output.json:

{
    "resourceMetrics": [
        {
            "resource": {},
            "scopeMetrics": [
                {
                    "scope": {
                        "name": "otelcol/statsdreceiver",
                        "version": "0.80.0"
                    },
                    "metrics": [
                        {
                            "name": "test.metric",
                            "sum": {
                                "dataPoints": [
                                    {
                                        "attributes": [
                                            {
                                                "key": "myKey",
                                                "value": {
                                                    "stringValue": "myVal"
                                                }
                                            }
                                        ],
                                        "startTimeUnixNano": "1687877475484608000",
                                        "timeUnixNano": "1687877485483610000",
                                        "asInt": "20"
                                    }
                                ],
                                "aggregationTemporality": 1
                            }
                        }
                    ]
                }
            ]
        },
        {
            "resource": {},
            "scopeMetrics": [
                {
                    "scope": {
                        "name": "otelcol/statsdreceiver",
                        "version": "0.80.0"
                    },
                    "metrics": [
                        {
                            "name": "test.metric",
                            "sum": {
                                "dataPoints": [
                                    {
                                        "attributes": [
                                            {
                                                "key": "myKey",
                                                "value": {
                                                    "stringValue": "myVal"
                                                }
                                            }
                                        ],
                                        "startTimeUnixNano": "1687877475484608000",
                                        "timeUnixNano": "1687877485483610000",
                                        "asInt": "10"
                                    }
                                ],
                                "aggregationTemporality": 1
                            }
                        }
                    ]
                }
            ]
        }
    ]
}

Collector version

v0.80.0

Environment information

No response

OpenTelemetry Collector configuration

receivers:
  statsd:
    endpoint: "0.0.0.0:8125"
    aggregation_interval: 10s
    is_monotonic_counter: false

processors:
  batch:

exporters:
  file/tmp:
    path: /tmp/collector-output.json

service:
  telemetry:
    logs:
      level: debug
  pipelines:
    metrics:
      receivers: [statsd]
      processors: [batch]
      exporters: [file/tmp]

Log output

No response

Additional context

Possible cause

This can be reproduced using netcat because it will close the UDP connection right after writing the message. Therefore, different origin ports are used for each metric, unless the port is specified. As a result, metrics are being aggregated separately here and different batches are obtained here.

If we use the same port to write the metrics:

❯ echo "test.metric:20|c|#myKey:myVal" | nc -w 1 -p 5000 -u localhost 8125
❯ echo "test.metric:10|c|#myKey:myVal" | nc -w 1 -p 5000 -u localhost 8125

We get the expected results:

{
    "resourceMetrics": [
        {
            "resource": {},
            "scopeMetrics": [
                {
                    "scope": {
                        "name": "otelcol/statsdreceiver",
                        "version": "0.80.0"
                    },
                    "metrics": [
                        {
                            "name": "test.metric",
                            "sum": {
                                "dataPoints": [
                                    {
                                        "attributes": [
                                            {
                                                "key": "myKey",
                                                "value": {
                                                    "stringValue": "myVal"
                                                }
                                            }
                                        ],
                                        "startTimeUnixNano": "1687878125480554000",
                                        "timeUnixNano": "1687878135480173000",
                                        "asInt": "30"
                                    }
                                ],
                                "aggregationTemporality": 1
                            }
                        }
                    ]
                }
            ]
        }
    ]
}

Considered solutions

  1. Always decorate metrics with the producer's address. This way, there would be an attribute identifying the metrics's origin and therefore we wouldn't be reporting several values for the same metric and the same set of attributes for a specific timestamp. This approach would be really simple to implement but it would increase the weight of every metric reported (even if the origin's address is not an interesting attribute for the user's use case). As I see it, it would introduce a breaking change and we should avoid it if possible.

  2. Take a different approach to identify the producer's address from metrics so the metrics can still be decorated using processors relying on the producer's address, such as the k8sattributesprocessor. Metrics could be optionally decorated (that could be controlled using a configuration flag) with the producer's address and we could avoid grouping them while performing the aggregation. This approach would require a different configuration in the processor to make statsdreceiver and k8sattributesprocessor work together.

If the bug is confirmed, we will be willing to help to solve it.

@sigilioso sigilioso added bug Something isn't working needs triage New item requiring triage labels Jun 28, 2023
@github-actions
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@atoulme atoulme added receiver/statsd statsd related issues and removed needs triage New item requiring triage labels Jun 29, 2023
@github-actions
Copy link
Contributor

Pinging code owners for receiver/statsd: @jmacd @dmitryax. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@sigilioso
Copy link
Contributor Author

Hi @jmacd @dmitryax,

As I see it, the described scenario may be pretty common: several producers report the same metrics without a specific attribute to identify them. Could you take a look and confirm if the aggregation is incorrect? Please, Let me know if you need additional details.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Sep 18, 2023
@akayami
Copy link

akayami commented Oct 6, 2023

I've run into the same problem.

statsdreciver does not aggregate messages correctly and basically causes the metric to be exported as a single unaggregated value. For instance, if I send 10 messages like this:
"test.metricA:5|c|#gate:otel"
within 10 seconds, I expect the block of 10 seconds to have a value of 50. otel statsd receiver reports 5.

The larger the aggregation time span, the larger the error gets. As is, this receiver is unusable, because even at aggregation of 1s, it still looses values.

@github-actions github-actions bot removed the Stale label Oct 7, 2023
Copy link
Contributor

github-actions bot commented Dec 7, 2023

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Dec 7, 2023
@emersonclaireiaquinto
Copy link

emersonclaireiaquinto commented Dec 27, 2023

My team has the same problem as well. The un-aggregated metrics flow directly into the awsemf exporter, which causes a conflict as the exporter doesn't allow duplicate metrics in its batch, dropping all but one of the metrics before export. This is a confirmed issue in both docker (using docker compose and netcat) and kubernetes (Kong Statsd exporter -> ADOT)

@github-actions github-actions bot removed the Stale label Dec 28, 2023
charz pushed a commit to charz/opentelemetry-collector-contrib that referenced this issue Dec 28, 2023
The receiver uses each metric's address as the key to create instruments,
but the port in the address of each metric is dynamically assigned.
This results in the StatsD receiver creating numerous instruments and
being unable to aggregate metrics together.

Fixes open-telemetry#29508, fixex open-telemetry#23809
@charz
Copy link

charz commented Dec 28, 2023

I think this is the same root case (#29508) that I can reproduce this problem. Please check #30228 to see if that helps.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Feb 27, 2024
Copy link
Contributor

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 27, 2024
@Manuelraa
Copy link

Is it correct that this should be closed instead of getting fixed?

@Manuelraa
Copy link

Manuelraa commented May 13, 2024

I'm not a statsd expert myself.
Though I have mixed feeling regarding fix proposed here + #30228
If I read the code correct as long as a persistent UDP connection would be used metrics would be aggregated correctly (?)
By implementing #30228 the aggregation by addr gets pointless and could just be removed. As now all metrics are aggregated by the statsdreceivers endpoint addr.

Let's take Telegraf's implementaiton as an example.
For reference here they store and buffer the package input: link
What we see during processing is that unless the metric uses the datadog statsd format the connection addr is completely ignored. If datadog statsd format is used the connection addr is set as the source tag.

I would argue applications should expose tags which identify themselves.
Still the aggregation by connection source can be useful.

Long story short. I propose to make this addr aggregation optional by a configuration flag.
And also adding a resource attribute for the source to make the resulting aggregation more clear.

@Manuelraa
Copy link

@jmacd , @dmitryax I have created PR #33010 implementing the suggestion. Could you please reopen this issue and have a look at the PR?

dmitryax pushed a commit that referenced this issue Sep 25, 2024
**Description:** <Describe what has changed.>

The `statsdreceiver` only aggregates metrics on `protocol+host+ip`, this
leads to issues or inconsistencies when dealing with clients that
constantly switch tcp/udp ports. To address the issue, this PR adds a
configuration option `enableIPOnlyAggregation` that allows the use to
specify if they want to aggregate on the `IP` instead of `IP+Port`.

For example:

_otel_config.yaml:_

```yaml
receivers:
  statsd:
    endpoint: "0.0.0.0:8125"
    enable_metric_type: true
    is_monotonic_counter: false
    aggregation_interval: 10s
    enable_ip_only_aggregation: true # <-- enable ip only aggregation
    timer_histogram_mapping:
      - statsd_type: "timing"
        observer_type: "histogram"
        histogram: 
          max_size: 50  

exporters:
  debug:
    verbosity: detailed

service:
  pipelines:
    metrics:
      receivers:
        - statsd
      exporters:
        - debug
```

_run:_
```sh
STATSD_HOST="localhost"
STATSD_PORT=8125

for port in {10000..10010}; do
  echo -n "my.metric:1|c" | nc -w 1 -u -p $port ${STATSD_HOST} ${STATSD_PORT}
  echo "Sent from port $port"
done
```

_result:_
```
2024-08-26T23:36:00.224+0200    info    ResourceMetrics #0
Resource SchemaURL: 
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope otelcol/statsdreceiver 0.103.0-dev
Metric #0
Descriptor:
     -> Name: -n my.metric
     -> Description: 
     -> Unit: 
     -> DataType: Sum
     -> IsMonotonic: false
     -> AggregationTemporality: Delta
NumberDataPoints #0
Data point attributes:
     -> metric_type: Str(counter)
StartTimestamp: 2024-08-26 21:35:50.223101 +0000 UTC
Timestamp: 2024-08-26 21:36:00.224252 +0000 UTC
Value: 7
        {"kind": "exporter", "data_type": "metrics", "name": "debug"}
2024-08-26T23:36:10.224+0200    info    MetricsExporter {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 1, "data points": 1}
2024-08-26T23:36:10.224+0200    info    ResourceMetrics #0
Resource SchemaURL: 
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope otelcol/statsdreceiver 0.103.0-dev
Metric #0
Descriptor:
     -> Name: -n my.metric
     -> Description: 
     -> Unit: 
     -> DataType: Sum
     -> IsMonotonic: false
     -> AggregationTemporality: Delta
NumberDataPoints #0
Data point attributes:
     -> metric_type: Str(counter)
StartTimestamp: 2024-08-26 21:36:00.224252 +0000 UTC
Timestamp: 2024-08-26 21:36:10.224607 +0000 UTC
Value: 4
        {"kind": "exporter", "data_type": "metrics", "name": "debug"}
```

Instead of generating 11 metrics for each port that was used to send,
only 2 metrics are blocks are returned, who's values total 11.

![2024-08-26 23 44
15](https://github.com/user-attachments/assets/6b8a89d1-186e-4257-9c82-90c5f9d14f98)


**Link to tracking Issue:** #23809

**Testing:** 

- [x] Added unit tests

**Documentation:** <Describe the documentation added.>

- [x] Added information to the statsdreceiver `README.md` describing the
option.

---------

Co-authored-by: Povilas Versockas <povilas.versockas@coralogix.com>
jriguera pushed a commit to springernature/opentelemetry-collector-contrib that referenced this issue Oct 4, 2024
…34851)

**Description:** <Describe what has changed.>

The `statsdreceiver` only aggregates metrics on `protocol+host+ip`, this
leads to issues or inconsistencies when dealing with clients that
constantly switch tcp/udp ports. To address the issue, this PR adds a
configuration option `enableIPOnlyAggregation` that allows the use to
specify if they want to aggregate on the `IP` instead of `IP+Port`.

For example:

_otel_config.yaml:_

```yaml
receivers:
  statsd:
    endpoint: "0.0.0.0:8125"
    enable_metric_type: true
    is_monotonic_counter: false
    aggregation_interval: 10s
    enable_ip_only_aggregation: true # <-- enable ip only aggregation
    timer_histogram_mapping:
      - statsd_type: "timing"
        observer_type: "histogram"
        histogram: 
          max_size: 50  

exporters:
  debug:
    verbosity: detailed

service:
  pipelines:
    metrics:
      receivers:
        - statsd
      exporters:
        - debug
```

_run:_
```sh
STATSD_HOST="localhost"
STATSD_PORT=8125

for port in {10000..10010}; do
  echo -n "my.metric:1|c" | nc -w 1 -u -p $port ${STATSD_HOST} ${STATSD_PORT}
  echo "Sent from port $port"
done
```

_result:_
```
2024-08-26T23:36:00.224+0200    info    ResourceMetrics #0
Resource SchemaURL: 
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope otelcol/statsdreceiver 0.103.0-dev
Metric #0
Descriptor:
     -> Name: -n my.metric
     -> Description: 
     -> Unit: 
     -> DataType: Sum
     -> IsMonotonic: false
     -> AggregationTemporality: Delta
NumberDataPoints #0
Data point attributes:
     -> metric_type: Str(counter)
StartTimestamp: 2024-08-26 21:35:50.223101 +0000 UTC
Timestamp: 2024-08-26 21:36:00.224252 +0000 UTC
Value: 7
        {"kind": "exporter", "data_type": "metrics", "name": "debug"}
2024-08-26T23:36:10.224+0200    info    MetricsExporter {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 1, "data points": 1}
2024-08-26T23:36:10.224+0200    info    ResourceMetrics #0
Resource SchemaURL: 
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope otelcol/statsdreceiver 0.103.0-dev
Metric #0
Descriptor:
     -> Name: -n my.metric
     -> Description: 
     -> Unit: 
     -> DataType: Sum
     -> IsMonotonic: false
     -> AggregationTemporality: Delta
NumberDataPoints #0
Data point attributes:
     -> metric_type: Str(counter)
StartTimestamp: 2024-08-26 21:36:00.224252 +0000 UTC
Timestamp: 2024-08-26 21:36:10.224607 +0000 UTC
Value: 4
        {"kind": "exporter", "data_type": "metrics", "name": "debug"}
```

Instead of generating 11 metrics for each port that was used to send,
only 2 metrics are blocks are returned, who's values total 11.

![2024-08-26 23 44
15](https://github.com/user-attachments/assets/6b8a89d1-186e-4257-9c82-90c5f9d14f98)


**Link to tracking Issue:** open-telemetry#23809

**Testing:** 

- [x] Added unit tests

**Documentation:** <Describe the documentation added.>

- [x] Added information to the statsdreceiver `README.md` describing the
option.

---------

Co-authored-by: Povilas Versockas <povilas.versockas@coralogix.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working closed as inactive receiver/statsd statsd related issues Stale
Projects
None yet
6 participants