-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
K8s metrics silently fail, do not appear within Cloud Monitoring #760
Comments
A reproduction using the latest contrib image would be very helpful. If you can share a collector configuration that demonstrates the issue, that would also be helpful. Otherwise, I'll still try and reproduce it myself. |
Sounds good, ill get a cluster deployed and use the contrib collector. Will also report back with my exact config. Thanks! |
@dashpole thank you for the fast response. I was able to reproduce the issue with a contrib build (from main branch, commit I deployed the following to a new GKE cluster. Click me---
apiVersion: v1
kind: ConfigMap
metadata:
name: otelcol
namespace: default
data:
config.yaml: |
receivers:
k8s_cluster:
allocatable_types_to_report:
- cpu
- memory
- ephemeral-storage
- storage
auth_type: serviceAccount
collection_interval: 60s
distribution: kubernetes
node_conditions_to_report:
- Ready
- DiskPressure
- MemoryPressure
- PIDPressure
- NetworkUnavailable
processors:
batch:
resource/clustername:
attributes:
- action: insert
key: k8s.cluster.name
value: minikube
transform/cleanup:
error_mode: ignore
metric_statements:
- context: datapoint
statements:
- delete_key(resource.attributes, "k8s.cluster.name") where true
- delete_key(resource.attributes, "k8s.pod.name") where true
- delete_key(resource.attributes, "k8s.node.name") where true
- delete_key(resource.attributes, "k8s.container.name") where true
- delete_key(resource.attributes, "k8s.namespace.name") where true
- delete_key(resource.attributes, "k8s.node.uid") where true
- delete_key(resource.attributes, "opencensus.resourcetype") where true
transform/hostname:
error_mode: ignore
metric_statements:
- context: datapoint
statements:
- set(resource.attributes["host.name"], "otel-cluster-agent") where true
exporters:
googlecloud:
logging:
service:
pipelines:
metrics:
receivers:
- k8s_cluster
processors:
- resource/clustername
# - transform/cleanup
# - transform/hostname
- batch
exporters:
- googlecloud
- logging
---
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
app.kubernetes.io/name: otelcol
name: otelcolcontrib
namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: otelcolcontrib
labels:
app.kubernetes.io/name: otelcol
namespace: default
rules:
- apiGroups:
- ""
resources:
- events
- namespaces
- namespaces/status
- nodes
- nodes/spec
- nodes/stats
- nodes/proxy
- pods
- pods/status
- replicationcontrollers
- replicationcontrollers/status
- resourcequotas
- services
verbs:
- get
- list
- watch
- apiGroups:
- apps
resources:
- daemonsets
- deployments
- replicasets
- statefulsets
verbs:
- get
- list
- watch
- apiGroups:
- extensions
resources:
- daemonsets
- deployments
- replicasets
verbs:
- get
- list
- watch
- apiGroups:
- batch
resources:
- jobs
- cronjobs
verbs:
- get
- list
- watch
- apiGroups:
- autoscaling
resources:
- horizontalpodautoscalers
verbs:
- get
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: otelcolcontrib
labels:
app.kubernetes.io/name: otelcol
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: otelcolcontrib
subjects:
- kind: ServiceAccount
name: otelcolcontrib
namespace: default
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-cluster-agent
labels:
app.kubernetes.io/name: otelcol
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: otelcol
template:
metadata:
labels:
app.kubernetes.io/name: otelcol
spec:
serviceAccount: otelcolcontrib
containers:
- name: opentelemetry-container
image: bmedora/otelcolcontrib:2816252149.0
imagePullPolicy: IfNotPresent
securityContext:
readOnlyRootFilesystem: true
resources:
requests:
memory: 200Mi
cpu: 100m
limits:
cpu: 100m
memory: 200Mi
env:
- name: AGENT_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: KUBE_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumeMounts:
- mountPath: /etc/otel
name: config
volumes:
- name: config
configMap:
name: otelcol Once deployed, you can uncomment the processors in the pipeline to observe the workaround: processors:
- resource/clustername
# - transform/cleanup
# - transform/hostname
- batch After applying again, restart the deployment: Because I was running on GKE, I could have used resource detection. I left the |
I think I figured out why it silently fails. I removed the retry_on_failure helper because we aren't using the retry mechanism. However, that is what ultimately logs the error message. Downgrading to v0.83.0 will give you the error message back. |
I get a bunch of errors:
|
The following made some of the metrics work:
I think the remaining issue is that we need to map the deployment/daemonset/statefulset, etc name to an attribute. |
@dashpole it sounds like I need cloud.platform and cloud.availability_zone in order to map to k8s monitored resource types? I would have expected my metrics to be unique with or without the additional two resources. Things work fine if I use Even without host.name or k8s.cluster.name, my project had a single collector sending metrics from a single cluster. Usually the duplicate time series errors how up if we have a uniqueness issue on our end (multiple collectors sending the same metrics). |
I'm actually surprised this worked. I would have expected metrics to still collide, as multiple metrics would have the same host.name... I suspect most metrics still failed to send, but some succeeded. The failures just weren't logged because open-telemetry/opentelemetry-collector-contrib#25900 removed all logging of errors.
One thing to keep in mind is that we don't preserve all resource attributes, since we need to map to Google Cloud Monitoring resources. Any resource attributes we don't use for the monitored resource are discarded, unless you set metric.resource_filters in the config:
You can see the full mapping logic here: https://github.com/GoogleCloudPlatform/opentelemetry-operations-go/blob/main/internal/resourcemapping/resourcemapping.go#L65. For k8s_cluster, you need One omission to note is that we don't have mappings for k8s_deployment, k8s_daemonset, etc. For example, for deployment metrics, the best mapping would be to k8s_cluster. You would need to use metric.resource_filters to add k8s.deployment.name as a metric attribute to make those metrics work. Filed #761 for the collector error logging issue. |
I've also filed GoogleCloudPlatform/opentelemetry-operator-sample#56 to try and document this usage better. |
Let me know if using |
Our distribution (bindplane-agent) configures the exporter's resource_filters with I re-ran my test with the contrib collector, with the following config. No luck. receivers:
k8s_cluster:
allocatable_types_to_report:
- cpu
- memory
- ephemeral-storage
- storage
auth_type: serviceAccount
collection_interval: 60s
distribution: kubernetes
node_conditions_to_report:
- Ready
- DiskPressure
- MemoryPressure
- PIDPressure
- NetworkUnavailable
processors:
batch:
resource/clustername:
attributes:
- action: insert
key: k8s.cluster.name
value: minikube
exporters:
googlecloud:
metric:
resource_filters:
prefix: "k8s."
logging:
service:
pipelines:
metrics:
receivers:
- k8s_cluster
processors:
- resource/clustername
- batch
exporters:
- googlecloud
- logging If I set This screenshot shows |
You need to add this as well in the resource processor: - action: insert
key: cloud.availability_zone
value: us-east1-b
- action: insert
key: cloud.platform
value: gcp_kubernetes_engine (the requirement for cloud.platform was removed in recent versions, but could possibly still be needed) |
With platform and location missing, shouldn't I expect to see the metrics show up for generic node? Or is host.name / node_id a hard requirement? With |
That is what I would expect. host.name / node_id is not a hard requirement (I think). It very well may be a bug. As mentioned above, you will need to downgrade to v0.83.0 to see the error message so we can figure out what is actually happening and fix it. |
If you update to v0.89.0, the error logging will be fixed |
I added a sample that works for either the googlecloud or googlemanagedprometheus exporters: GoogleCloudPlatform/opentelemetry-operator-sample#57. Just make sure you also set |
Optimistically closing. Feel free to reopen if you have any more questions. |
When using kubeletstatsreceiver and the k8sclusterreceiver, the collector successfully scrapes metrics from my cluster but the data points do not show up within Cloud Monitoring. The collector does not log an error.
Metric descriptors are created in my brand new GCP Project, but they appear as "inactive" and no data points are shown when viewing the metrics using a dashboard widget.
If I remove the k8s-based attributes and replace them with a single
host.name
attribute, the metrics come through as "generic_node". This provides evidence of an issue with metric mapping to k8s monitored resource types.This screenshot illustrates a before and after representation of the data points and their resource attributes.
This screenshot illustrates that metrics began showing up after removing the k8s attributes
If I use version 1.18.0 / 0.42.0 of this repo, the issue disappears and metrics work as expected. I believe the issue was introduced in v1.19.1 / v0.43.1 and is still present in v1.20.0 / v0.44.0. Interestingly, v1.19.0 included a change that fixes monitored resource type mapping for logs, which are working great. #683
My testing was performed with bindplane-agent managed by BindPlane OP. If you need me to perform the same tests with OpenTelemetry Contrib, I can, but I suspect the results will be the same.
We support Google Cloud customers who operate Kubernetes Clusters outside of Google Cloud, this issue means that users cannot update to the most recent collector versions.
The text was updated successfully, but these errors were encountered: