Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal server error when location field not set #789

Closed
jsirianni opened this issue Dec 20, 2023 · 6 comments
Closed

Internal server error when location field not set #789

jsirianni opened this issue Dec 20, 2023 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@jsirianni
Copy link

This is a follow up for #760. The issue with logs being silently dropped has been resolved and has exposed the real issue I am encountering.

When sending a k8s_container metric to Cloud Monitoring, I get an "internal server error". It is resolved by setting cloud.availability_zone (or the region resource attribute).

With logging, I can get by without setting the location field. It seems that this should be the case with metrics as well. My cluster is not part of a cloud and has no reasonable region / zone value. This is also true for the GCP customers that we (observiq) support. They have on premise clusters.

This is the error text, formatted to be easily readable:

One or more TimeSeries could not be written: Internal error encountered. 
Please retry after a few seconds. If internal errors persist, contact support at https://cloud.google.com/support/docs.: 

k8s_container{
    container_name:coredns,
    location:,
    pod_name:coredns-787d4945fb-r9fk5,
    namespace_name:kube-system,
    cluster_name:stage
} 

timeSeries[0]: workload.googleapis.com/core.log.count{
    k8s_pod_name:coredns-787d4945fb-r9fk5,
    k8s_node_name:minikube,
    k8s_cluster_name:stage,
    host_name:bindplane-gateway-agent-2,
    instrumentation_source:logcount,
    k8s_container_name:coredns,
    k8s_namespace_name:kube-system
}

error details: name = Unknown  desc = total_point_count:1 errors:{status:{code:13} point_count:1}

The error's metric descriptor does make it clear that the location field is unset.

This is the log from the collector (will the error value removed.

{
  "level":"error",
  "ts":"2023-12-20T16:41:34.815Z",
  "caller":"exporterhelper/queue_sender.go:123",
  "msg":"Exporting failed. No more retries left. Dropping data.",
  "kind":"exporter",
  "data_type":"metrics",
  "name":"googlecloud/jsirianni-12202023-temp",
  "error":"<see error>",
  "dropped_items":1,
  "stacktrace":"go.opentelemetry.io/collector/exporter/exporterhelper.(*queueSender).consume\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.91.0/exporterhelper/queue_sender.go:123\ngo.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue[...]).Consume\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.91.0/exporterhelper/internal/bounded_memory_queue.go:55\ngo.opentelemetry.io/collector/exporter/exporterhelper/internal.(*QueueConsumers[...]).Start.func1\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.91.0/exporterhelper/internal/consumers.go:43"
}

Exporter config:

googlecloud/jsirianni-12202023-temp:
    log:
        compression: gzip
        resource_filters:
            - regex: .*
    metric:
        compression: gzip
    project: jsirianni-12202023-temp
    sending_queue:
        enabled: true
        num_consumers: 10
        queue_size: 5000
@jsirianni jsirianni changed the title K8s_container metrics result in internal server error when location field not set Internal server error when location field not set Dec 20, 2023
@jsirianni
Copy link
Author

I just encountered this with some generic node metrics as well, which is interesting because they were fine previously.

I was getting this error over and over again until I added the location resource attribute.

{"level":"error","ts":"2023-12-20T12:29:02.935-0500","caller":"exporterhelper/queue_sender.go:123","msg":"Exporting failed. No more retries left. Dropping data.","kind":"exporter","data_type":"metrics","name":"googlecloud/jsirianni-12202023-temp","error":"rpc error: code = Internal desc = Internal error encountered. Please retry after a few seconds. If internal errors persist, contact support at https://cloud.google.com/support/docs.","dropped_items":12,"stacktrace":"go.opentelemetry.io/collector/exporter/exporterhelper.(*queueSender).consume\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.91.0/exporterhelper/queue_sender.go:123\ngo.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue[...]).Consume\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.91.0/exporterhelper/internal/bounded_memory_queue.go:55\ngo.opentelemetry.io/collector/exporter/exporterhelper/internal.(*QueueConsumers[...]).Start.func1\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.91.0/exporterhelper/internal/consumers.go:43"}

@dashpole dashpole added the bug Something isn't working label Jan 8, 2024
@dashpole
Copy link
Contributor

dashpole commented Jan 8, 2024

Are you explicitly setting cloud.availability_zone to ""? We should be falling back to "global" if none is set: https://github.com/GoogleCloudPlatform/opentelemetry-operations-go/blob/main/internal/resourcemapping/resourcemapping.go#L155

@dashpole
Copy link
Contributor

dashpole commented Jan 8, 2024

This might be a regression

@dashpole
Copy link
Contributor

I don't think it is a regression, since I don't think the fallbacks in question have been changed. But I do think it is best to fall back to global where it makes sense. I opened #795

@jsirianni
Copy link
Author

Are you explicitly setting cloud.availability_zone to ""? We should be falling back to "global" if none is set: https://github.com/GoogleCloudPlatform/opentelemetry-operations-go/blob/main/internal/resourcemapping/resourcemapping.go#L155

Sorry for the late reply, no we do not set cloud.availability_zone to "". If we do set it, we use a valid zone.

#795 sounds good to me, I think that would be helpful.

@dashpole
Copy link
Contributor

The error message should be improved to actually indicate that the location is the problem. That change should roll out in the next week or two.

But i've talked with some folks from cloud monitoring, and they don't think falling back to global is a good idea. It isn't actually a global location, which is misleading, and can make users think the data is replicated globally, which it isn't. It is strongly recommended that users pick a location that is at least reasonably close to them so it doesn't cause problems on the query side of things.

If you really want to use global, which I don't recommend, you can hard-code it using the resource processor:

processors:
  resource/defaultglobal:
    attributes:
    - key: cloud.availability_zone
      value: "global"
      action: upsert
    - key: cloud.region
      value: "global"
      action: upsert

Closing as won't fix. Feel free to reopen if you have other questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants