Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8s metrics silently fail, do not appear within Cloud Monitoring #760

Closed
jsirianni opened this issue Oct 26, 2023 · 17 comments
Closed

K8s metrics silently fail, do not appear within Cloud Monitoring #760

jsirianni opened this issue Oct 26, 2023 · 17 comments
Assignees
Labels
bug Something isn't working priority: p2

Comments

@jsirianni
Copy link

When using kubeletstatsreceiver and the k8sclusterreceiver, the collector successfully scrapes metrics from my cluster but the data points do not show up within Cloud Monitoring. The collector does not log an error.

Metric descriptors are created in my brand new GCP Project, but they appear as "inactive" and no data points are shown when viewing the metrics using a dashboard widget.

If I remove the k8s-based attributes and replace them with a single host.name attribute, the metrics come through as "generic_node". This provides evidence of an issue with metric mapping to k8s monitored resource types.

This screenshot illustrates a before and after representation of the data points and their resource attributes.

Screenshot 2023-10-26 at 11 32 34 AM

This screenshot illustrates that metrics began showing up after removing the k8s attributes

Screenshot 2023-10-26 at 11 33 28 AM

If I use version 1.18.0 / 0.42.0 of this repo, the issue disappears and metrics work as expected. I believe the issue was introduced in v1.19.1 / v0.43.1 and is still present in v1.20.0 / v0.44.0. Interestingly, v1.19.0 included a change that fixes monitored resource type mapping for logs, which are working great. #683

My testing was performed with bindplane-agent managed by BindPlane OP. If you need me to perform the same tests with OpenTelemetry Contrib, I can, but I suspect the results will be the same.

We support Google Cloud customers who operate Kubernetes Clusters outside of Google Cloud, this issue means that users cannot update to the most recent collector versions.

@dashpole
Copy link
Contributor

A reproduction using the latest contrib image would be very helpful. If you can share a collector configuration that demonstrates the issue, that would also be helpful. Otherwise, I'll still try and reproduce it myself.

@dashpole dashpole self-assigned this Oct 26, 2023
@dashpole dashpole added bug Something isn't working priority: p1 labels Oct 26, 2023
@jsirianni
Copy link
Author

therwise, I'll still try and reproduce it myself.

Sounds good, ill get a cluster deployed and use the contrib collector. Will also report back with my exact config. Thanks!

@jsirianni
Copy link
Author

@dashpole thank you for the fast response. I was able to reproduce the issue with a contrib build (from main branch, commit 2816252149).

I deployed the following to a new GKE cluster.

Click me
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: otelcol
  namespace: default
data:
  config.yaml: |
    receivers:
        k8s_cluster:
            allocatable_types_to_report:
                - cpu
                - memory
                - ephemeral-storage
                - storage
            auth_type: serviceAccount
            collection_interval: 60s
            distribution: kubernetes
            node_conditions_to_report:
                - Ready
                - DiskPressure
                - MemoryPressure
                - PIDPressure
                - NetworkUnavailable

    processors:
        batch:

        resource/clustername:
            attributes:
                - action: insert
                  key: k8s.cluster.name
                  value: minikube

        transform/cleanup:
            error_mode: ignore
            metric_statements:
                - context: datapoint
                  statements:
                    - delete_key(resource.attributes, "k8s.cluster.name") where true
                    - delete_key(resource.attributes, "k8s.pod.name") where true
                    - delete_key(resource.attributes, "k8s.node.name") where true
                    - delete_key(resource.attributes, "k8s.container.name") where true
                    - delete_key(resource.attributes, "k8s.namespace.name") where true
                    - delete_key(resource.attributes, "k8s.node.uid") where true
                    - delete_key(resource.attributes, "opencensus.resourcetype") where true

        transform/hostname:
            error_mode: ignore
            metric_statements:
                - context: datapoint
                  statements:
                    - set(resource.attributes["host.name"], "otel-cluster-agent") where true

    exporters:
        googlecloud:

        logging:

    service:
        pipelines:
            metrics:
                receivers:
                    - k8s_cluster
                processors:
                    - resource/clustername
                    # - transform/cleanup
                    # - transform/hostname
                    - batch
                exporters:
                    - googlecloud
                    - logging
---
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/name: otelcol
  name: otelcolcontrib
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: otelcolcontrib
  labels:
    app.kubernetes.io/name: otelcol
  namespace: default
rules:
- apiGroups:
  - ""
  resources:
  - events
  - namespaces
  - namespaces/status
  - nodes
  - nodes/spec
  - nodes/stats
  - nodes/proxy
  - pods
  - pods/status
  - replicationcontrollers
  - replicationcontrollers/status
  - resourcequotas
  - services
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - apps
  resources:
  - daemonsets
  - deployments
  - replicasets
  - statefulsets
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - extensions
  resources:
  - daemonsets
  - deployments
  - replicasets
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - batch
  resources:
  - jobs
  - cronjobs
  verbs:
  - get
  - list
  - watch
- apiGroups:
    - autoscaling
  resources:
    - horizontalpodautoscalers
  verbs:
    - get
    - list
    - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: otelcolcontrib
  labels:
    app.kubernetes.io/name: otelcol
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: otelcolcontrib
subjects:
- kind: ServiceAccount
  name: otelcolcontrib
  namespace: default
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-cluster-agent
  labels:
    app.kubernetes.io/name: otelcol
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: otelcol
  template:
    metadata:
      labels:
        app.kubernetes.io/name: otelcol
    spec:
      serviceAccount: otelcolcontrib
      containers:
        - name: opentelemetry-container
          image: bmedora/otelcolcontrib:2816252149.0
          imagePullPolicy: IfNotPresent
          securityContext:
            readOnlyRootFilesystem: true
          resources:
            requests:
              memory: 200Mi
              cpu: 100m
            limits:
              cpu: 100m
              memory: 200Mi
          env:
            - name: AGENT_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: KUBE_NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
          volumeMounts:
            - mountPath: /etc/otel
              name: config
      volumes:
        - name: config
          configMap:
            name: otelcol

Once deployed, you can uncomment the processors in the pipeline to observe the workaround:

                processors:
                    - resource/clustername
                    # - transform/cleanup
                    # - transform/hostname
                    - batch

After applying again, restart the deployment: kubectl rollout restart deploy otel-cluster-agent.

Because I was running on GKE, I could have used resource detection. I left the resource processor to set k8s.cluster.name to minikube, as that is where I observed the issue initially. On GKE, we get automatic authentication, as you know.

@dashpole
Copy link
Contributor

I think I figured out why it silently fails. I removed the retry_on_failure helper because we aren't using the retry mechanism. However, that is what ultimately logs the error message. Downgrading to v0.83.0 will give you the error message back.

@dashpole
Copy link
Contributor

I get a bunch of errors:

Field timeSeries[57] had an invalid value: Duplicate TimeSeries encountered. Only one point can be written per TimeSeries per request.\nerror details: name = Unknown  desc = total_point_count:101 success_point_count:16 errors:{status:{code:3} point_count:85}"}]}

@dashpole
Copy link
Contributor

The following made some of the metrics work:

                - action: insert
                  key: k8s.cluster.name
                  value: minikube
                - action: insert
                  key: cloud.availability_zone
                  value: us-east1-b
                - action: insert
                  key: cloud.platform
                  value: gcp_kubernetes_engine

I think the remaining issue is that we need to map the deployment/daemonset/statefulset, etc name to an attribute.

@jsirianni
Copy link
Author

@dashpole it sounds like I need cloud.platform and cloud.availability_zone in order to map to k8s monitored resource types?

I would have expected my metrics to be unique with or without the additional two resources. Things work fine if I use host.name and "trick" cloud monitoring / exporter into using generic_node.

Even without host.name or k8s.cluster.name, my project had a single collector sending metrics from a single cluster. Usually the duplicate time series errors how up if we have a uniqueness issue on our end (multiple collectors sending the same metrics).

@dashpole
Copy link
Contributor

I would have expected my metrics to be unique with or without the additional two resources. Things work fine if I use host.name and "trick" cloud monitoring / exporter into using generic_node.

I'm actually surprised this worked. I would have expected metrics to still collide, as multiple metrics would have the same host.name... I suspect most metrics still failed to send, but some succeeded. The failures just weren't logged because open-telemetry/opentelemetry-collector-contrib#25900 removed all logging of errors.

Even without host.name or k8s.cluster.name, my project had a single collector sending metrics from a single cluster. Usually the duplicate time series errors how up if we have a uniqueness issue on our end (multiple collectors sending the same metrics).

One thing to keep in mind is that we don't preserve all resource attributes, since we need to map to Google Cloud Monitoring resources. Any resource attributes we don't use for the monitored resource are discarded, unless you set metric.resource_filters in the config:

ResourceFilters []ResourceFilter `mapstructure:"resource_filters"`

it sounds like I need cloud.platform and cloud.availability_zone in order to map to k8s monitored resource types?

You can see the full mapping logic here: https://github.com/GoogleCloudPlatform/opentelemetry-operations-go/blob/main/internal/resourcemapping/resourcemapping.go#L65. For k8s_cluster, you need cloud.availability_zone, k8s.cluster.name. For k8s_pod, you additionally need k8s.namespace.name, and k8s.pod.name. For k8s_container, you additionally need k8s.container.

One omission to note is that we don't have mappings for k8s_deployment, k8s_daemonset, etc. For example, for deployment metrics, the best mapping would be to k8s_cluster. You would need to use metric.resource_filters to add k8s.deployment.name as a metric attribute to make those metrics work.

Filed #761 for the collector error logging issue.

@dashpole
Copy link
Contributor

dashpole commented Nov 1, 2023

I've also filed GoogleCloudPlatform/opentelemetry-operator-sample#56 to try and document this usage better.

@dashpole
Copy link
Contributor

dashpole commented Nov 1, 2023

Let me know if using metric.resource_filters for k8s.deployment.name (and other k8s..name) attributes fixes the remaining issues you are having.

@jsirianni
Copy link
Author

Let me know if using metric.resource_filters for k8s.deployment.name (and other k8s..name) attributes fixes the remaining issues you are having.

Our distribution (bindplane-agent) configures the exporter's resource_filters with prefix: "", which matches all resource attributes. We have found this to be necessary for many different receivers where their resource attributes would be dropped.

I re-ran my test with the contrib collector, with the following config. No luck.

receivers:
    k8s_cluster:
        allocatable_types_to_report:
            - cpu
            - memory
            - ephemeral-storage
            - storage
        auth_type: serviceAccount
        collection_interval: 60s
        distribution: kubernetes
        node_conditions_to_report:
            - Ready
            - DiskPressure
            - MemoryPressure
            - PIDPressure
            - NetworkUnavailable

processors:
    batch:

    resource/clustername:
        attributes:
            - action: insert
                key: k8s.cluster.name
                value: minikube

exporters:
    googlecloud:
        metric:
        resource_filters:
            prefix: "k8s."

    logging:

service:
    pipelines:
        metrics:
            receivers:
                - k8s_cluster
            processors:
                - resource/clustername
                - batch
            exporters:
                - googlecloud
                - logging

If I set host.name instead of k8s.cluster.name, the metrics show up just fine. If there are time series issues, they would be resolved by the resource_filters settings that we normally use, copying deployment name (and other resource attributes) to datapoint attributes / google labels.

This screenshot shows host.name being turned into node_id like usual, and the datapoints show up. If I switch back to k8s.cluster.name, the datapoints stop appearing.

Screenshot from 2023-11-02 13-49-39

@dashpole
Copy link
Contributor

dashpole commented Nov 2, 2023

You need to add this as well in the resource processor:

                - action: insert
                  key: cloud.availability_zone
                  value: us-east1-b
                - action: insert
                  key: cloud.platform
                  value: gcp_kubernetes_engine

(the requirement for cloud.platform was removed in recent versions, but could possibly still be needed)

@jsirianni
Copy link
Author

With platform and location missing, shouldn't I expect to see the metrics show up for generic node? Or is host.name / node_id a hard requirement?

With resource_filters configured, all resource attributes are copied over to metric labels. Each datapoint for each metric is unique, but missing from Cloud Monitoring.

@dashpole
Copy link
Contributor

With platform and location missing, shouldn't I expect to see the metrics show up for generic node? Or is host.name / node_id a hard requirement?

That is what I would expect. host.name / node_id is not a hard requirement (I think). It very well may be a bug. As mentioned above, you will need to downgrade to v0.83.0 to see the error message so we can figure out what is actually happening and fix it.

@dashpole
Copy link
Contributor

If you update to v0.89.0, the error logging will be fixed

@dashpole
Copy link
Contributor

I added a sample that works for either the googlecloud or googlemanagedprometheus exporters: GoogleCloudPlatform/opentelemetry-operator-sample#57. Just make sure you also set cloud.availability_zone, cloud.platform and k8s.cluster.name set as well if you aren't on GKE.

@dashpole
Copy link
Contributor

Optimistically closing. Feel free to reopen if you have any more questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority: p2
Projects
None yet
Development

No branches or pull requests

2 participants