Internal server error when location field not set #789

jsirianni · 2023-12-20T17:17:38Z

This is a follow up for #760. The issue with logs being silently dropped has been resolved and has exposed the real issue I am encountering.

When sending a k8s_container metric to Cloud Monitoring, I get an "internal server error". It is resolved by setting cloud.availability_zone (or the region resource attribute).

With logging, I can get by without setting the location field. It seems that this should be the case with metrics as well. My cluster is not part of a cloud and has no reasonable region / zone value. This is also true for the GCP customers that we (observiq) support. They have on premise clusters.

This is the error text, formatted to be easily readable:

One or more TimeSeries could not be written: Internal error encountered. 
Please retry after a few seconds. If internal errors persist, contact support at https://cloud.google.com/support/docs.: 

k8s_container{
    container_name:coredns,
    location:,
    pod_name:coredns-787d4945fb-r9fk5,
    namespace_name:kube-system,
    cluster_name:stage
} 

timeSeries[0]: workload.googleapis.com/core.log.count{
    k8s_pod_name:coredns-787d4945fb-r9fk5,
    k8s_node_name:minikube,
    k8s_cluster_name:stage,
    host_name:bindplane-gateway-agent-2,
    instrumentation_source:logcount,
    k8s_container_name:coredns,
    k8s_namespace_name:kube-system
}

error details: name = Unknown  desc = total_point_count:1 errors:{status:{code:13} point_count:1}

The error's metric descriptor does make it clear that the location field is unset.

This is the log from the collector (will the error value removed.

{
  "level":"error",
  "ts":"2023-12-20T16:41:34.815Z",
  "caller":"exporterhelper/queue_sender.go:123",
  "msg":"Exporting failed. No more retries left. Dropping data.",
  "kind":"exporter",
  "data_type":"metrics",
  "name":"googlecloud/jsirianni-12202023-temp",
  "error":"<see error>",
  "dropped_items":1,
  "stacktrace":"go.opentelemetry.io/collector/exporter/exporterhelper.(*queueSender).consume\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.91.0/exporterhelper/queue_sender.go:123\ngo.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue[...]).Consume\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.91.0/exporterhelper/internal/bounded_memory_queue.go:55\ngo.opentelemetry.io/collector/exporter/exporterhelper/internal.(*QueueConsumers[...]).Start.func1\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.91.0/exporterhelper/internal/consumers.go:43"
}

Exporter config:

googlecloud/jsirianni-12202023-temp:
    log:
        compression: gzip
        resource_filters:
            - regex: .*
    metric:
        compression: gzip
    project: jsirianni-12202023-temp
    sending_queue:
        enabled: true
        num_consumers: 10
        queue_size: 5000

The text was updated successfully, but these errors were encountered:

jsirianni · 2023-12-20T17:33:33Z

I just encountered this with some generic node metrics as well, which is interesting because they were fine previously.

I was getting this error over and over again until I added the location resource attribute.

{"level":"error","ts":"2023-12-20T12:29:02.935-0500","caller":"exporterhelper/queue_sender.go:123","msg":"Exporting failed. No more retries left. Dropping data.","kind":"exporter","data_type":"metrics","name":"googlecloud/jsirianni-12202023-temp","error":"rpc error: code = Internal desc = Internal error encountered. Please retry after a few seconds. If internal errors persist, contact support at https://cloud.google.com/support/docs.","dropped_items":12,"stacktrace":"go.opentelemetry.io/collector/exporter/exporterhelper.(*queueSender).consume\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.91.0/exporterhelper/queue_sender.go:123\ngo.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue[...]).Consume\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.91.0/exporterhelper/internal/bounded_memory_queue.go:55\ngo.opentelemetry.io/collector/exporter/exporterhelper/internal.(*QueueConsumers[...]).Start.func1\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.91.0/exporterhelper/internal/consumers.go:43"}

dashpole · 2024-01-08T19:56:38Z

Are you explicitly setting cloud.availability_zone to ""? We should be falling back to "global" if none is set: https://github.com/GoogleCloudPlatform/opentelemetry-operations-go/blob/main/internal/resourcemapping/resourcemapping.go#L155

dashpole · 2024-01-08T19:56:44Z

This might be a regression

dashpole · 2024-01-11T15:37:40Z

I don't think it is a regression, since I don't think the fallbacks in question have been changed. But I do think it is best to fall back to global where it makes sense. I opened #795

jsirianni · 2024-01-12T19:44:35Z

Are you explicitly setting cloud.availability_zone to ""? We should be falling back to "global" if none is set: https://github.com/GoogleCloudPlatform/opentelemetry-operations-go/blob/main/internal/resourcemapping/resourcemapping.go#L155

Sorry for the late reply, no we do not set cloud.availability_zone to "". If we do set it, we use a valid zone.

#795 sounds good to me, I think that would be helpful.

dashpole · 2024-01-17T20:24:21Z

The error message should be improved to actually indicate that the location is the problem. That change should roll out in the next week or two.

But i've talked with some folks from cloud monitoring, and they don't think falling back to global is a good idea. It isn't actually a global location, which is misleading, and can make users think the data is replicated globally, which it isn't. It is strongly recommended that users pick a location that is at least reasonably close to them so it doesn't cause problems on the query side of things.

If you really want to use global, which I don't recommend, you can hard-code it using the resource processor:

processors:
  resource/defaultglobal:
    attributes:
    - key: cloud.availability_zone
      value: "global"
      action: upsert
    - key: cloud.region
      value: "global"
      action: upsert

Closing as won't fix. Feel free to reopen if you have other questions.

jsirianni changed the title ~~K8s_container metrics result in internal server error when location field not set~~ Internal server error when location field not set Dec 20, 2023

dashpole added the bug Something isn't working label Jan 8, 2024

dashpole self-assigned this Jan 8, 2024

dashpole mentioned this issue Jan 11, 2024

Fall back to global for all monitored resources that accept it #795

Closed

dashpole closed this as completed Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Internal server error when location field not set #789

Internal server error when location field not set #789

jsirianni commented Dec 20, 2023

jsirianni commented Dec 20, 2023

dashpole commented Jan 8, 2024

dashpole commented Jan 8, 2024

dashpole commented Jan 11, 2024

jsirianni commented Jan 12, 2024

dashpole commented Jan 17, 2024

Internal server error when location field not set #789

Internal server error when location field not set #789

Comments

jsirianni commented Dec 20, 2023

jsirianni commented Dec 20, 2023

dashpole commented Jan 8, 2024

dashpole commented Jan 8, 2024

dashpole commented Jan 11, 2024

jsirianni commented Jan 12, 2024

dashpole commented Jan 17, 2024