Collector constantly restarts when using k8s_observer #29558

ChrsMark · 2023-11-29T13:04:28Z

Component(s)

extension/observer/k8sobserver

What happened?

Description

Trying to deploy Collector with k8s_observer enabled, the collector never starts successfully and falls into a constant restarting loop because the readiness/liveness checks never succeed.
Even after increasing the timeoutSeconds of the probes the situation is not improved.

Steps to Reproduce

Using the provided configuration deploy the collector using Helm: helm install daemonset open-telemetry/opentelemetry-collector --values daemonset.yaml
Check that the Collector Pods fail to come to a healthy state and they are constantly getting restarted

Expected Result

Collector Pods should be running without restarts with this basic configuration.

Actual Result

Constant restarts.

Collector version

0.90.0

Environment information

Environment

GKE: v1.27.3-gke.100

OpenTelemetry Collector configuration

mode: daemonset
presets:
  kubernetesAttributes:
    enabled: true

image:
  tag: "0.90.0"

config:
  extensions:
    k8s_observer:
      auth_type: serviceAccount
      node: ${env:K8S_NODE_NAME}
      observe_pods: true
  receivers:
    receiver_creator:
      watch_observers: [ k8s_observer ]
      receivers:
        redis:
          rule: type == "port" && pod.name matches "redis"
          config:
            collection_interval: 2s
  service:
    extensions: [k8s_observer]
    pipelines:
      metrics:
        receivers: [ receiver_creator]

Log output

2023-11-29T12:51:34.736Z	info	service@v0.90.0/telemetry.go:86	Setting up own telemetry...
2023-11-29T12:51:34.736Z	info	service@v0.90.0/telemetry.go:203	Serving Prometheus metrics	{"address": "10.68.0.132:8888", "level": "Basic"}
2023-11-29T12:51:34.737Z	info	exporter@v0.90.0/exporter.go:275	Development component. May change in the future.	{"kind": "exporter", "data_type": "traces", "name": "debug"}
2023-11-29T12:51:34.738Z	info	memorylimiterprocessor@v0.90.0/memorylimiter.go:138	Using percentage memory limiter	{"kind": "processor", "name": "memory_limiter", "pipeline": "traces", "total_memory_mib": 3928, "limit_percentage": 80, "spike_limit_percentage": 25}
2023-11-29T12:51:34.738Z	info	memorylimiterprocessor@v0.90.0/memorylimiter.go:102	Memory limiter configured	{"kind": "processor", "name": "memory_limiter", "pipeline": "traces", "limit_mib": 3142, "spike_limit_mib": 982, "check_interval": 5}
2023-11-29T12:51:34.738Z	info	exporter@v0.90.0/exporter.go:275	Development component. May change in the future.	{"kind": "exporter", "data_type": "logs", "name": "debug"}
2023-11-29T12:51:34.738Z	info	exporter@v0.90.0/exporter.go:275	Development component. May change in the future.	{"kind": "exporter", "data_type": "metrics", "name": "debug"}
2023-11-29T12:51:34.739Z	info	kube/client.go:113	k8s filtering	{"kind": "processor", "name": "k8sattributes", "pipeline": "traces", "labelSelector": "", "fieldSelector": "spec.nodeName=gke-otel-demo-default-pool-0f2ce7eb-c7qr"}
2023-11-29T12:51:34.739Z	warn	jaegerreceiver@v0.90.0/factory.go:49	jaeger receiver will deprecate Thrift-gen and replace it with Proto-gen to be compatbible to jaeger 1.42.0 and higher. See https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/18485 for more details.	{"kind": "receiver", "name": "jaeger", "data_type": "traces"}
2023-11-29T12:51:34.739Z	info	kube/client.go:113	k8s filtering	{"kind": "processor", "name": "k8sattributes", "pipeline": "metrics", "labelSelector": "", "fieldSelector": "spec.nodeName=gke-otel-demo-default-pool-0f2ce7eb-c7qr"}
2023-11-29T12:51:34.740Z	info	kube/client.go:113	k8s filtering	{"kind": "processor", "name": "k8sattributes", "pipeline": "logs", "labelSelector": "", "fieldSelector": "spec.nodeName=gke-otel-demo-default-pool-0f2ce7eb-c7qr"}
2023-11-29T12:51:34.741Z	info	service@v0.90.0/service.go:148	Starting otelcol-contrib...	{"Version": "0.90.0", "NumCPU": 2}
2023-11-29T12:51:34.741Z	info	extensions/extensions.go:34	Starting extensions...
2023-11-29T12:51:34.741Z	info	extensions/extensions.go:37	Extension is starting...	{"kind": "extension", "name": "k8s_observer"}
2023-11-29T12:51:34.741Z	info	extensions/extensions.go:45	Extension started.	{"kind": "extension", "name": "k8s_observer"}
2023-11-29T12:51:34.742Z	info	otlpreceiver@v0.90.0/otlp.go:83	Starting GRPC server	{"kind": "receiver", "name": "otlp", "data_type": "traces", "endpoint": "10.68.0.132:4317"}
2023-11-29T12:51:34.742Z	info	otlpreceiver@v0.90.0/otlp.go:101	Starting HTTP server	{"kind": "receiver", "name": "otlp", "data_type": "traces", "endpoint": "10.68.0.132:4318"}
2023-11-29T12:51:34.742Z	info	service@v0.90.0/service.go:174	Everything is ready. Begin running and processing data.
2023-11-29T12:52:04.138Z	info	otelcol@v0.90.0/collector.go:258	Received signal from OS	{"signal": "terminated"}
2023-11-29T12:52:04.139Z	info	service@v0.90.0/service.go:188	Starting shutdown...
2023-11-29T12:52:04.140Z	info	extensions/extensions.go:52	Stopping extensions...
2023-11-29T12:52:04.140Z	info	service@v0.90.0/service.go:202	Shutdown complete.

Additional context

It seems that the collector fails to report ready/healthy. Here is what describe output gives:

  Warning  Unhealthy  4m24s (x11 over 5m23s)  kubelet            Readiness probe failed: Get "http://10.68.0.132:13133/": dial tcp 10.68.0.132:13133: connect: connection refused
  Warning  Unhealthy  4m24s (x6 over 5m14s)   kubelet            Liveness probe failed: Get "http://10.68.0.132:13133/": dial tcp 10.68.0.132:13133: connect: connection refused
  Normal   Killing    4m24s (x2 over 4m54s)   kubelet            Container opentelemetry-collector failed liveness probe, will be restarted

The text was updated successfully, but these errors were encountered:

github-actions · 2023-11-29T13:04:48Z

Pinging code owners:

extension/observer/k8sobserver: @rmfitzpatrick @dmitryax

See Adding Labels via Comments if you do not have permissions to add labels yourself.

rmfitzpatrick · 2023-11-29T14:26:49Z

  service:
    extensions: [k8s_observer]

The issue is that the probe relies on the health check extension, which is now omitted from the service extensions sequence. This should be included along with any other desired extension:

  service:
    extensions: [health_check, k8s_observer]

ChrsMark · 2023-11-29T15:47:51Z

Thank's @rmfitzpatrick this solves the issue.

I guess this is sth that the helm-chart should check on start up and explicitly inform the user about it. I will create an issue in the helm-charts repo since this is not related to the collector's implementation specifically.

ChrsMark added bug Something isn't working needs triage New item requiring triage labels Nov 29, 2023

github-actions bot added the extension/observer/k8sobserver label Nov 29, 2023

ChrsMark closed this as completed Nov 29, 2023

ChrsMark mentioned this issue Nov 30, 2023

Chart should validate extensions config to avoid missing health_check open-telemetry/opentelemetry-helm-charts#974

Closed

github-actions bot mentioned this issue Dec 5, 2023

Weekly Report: 2023-11-28 - 2023-12-05 #29650

Closed

ChrsMark mentioned this issue Mar 11, 2024

[opentelemetry-collector] Add extension validation for health_check open-telemetry/opentelemetry-helm-charts#1070

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collector constantly restarts when using k8s_observer #29558

Collector constantly restarts when using k8s_observer #29558

ChrsMark commented Nov 29, 2023

github-actions bot commented Nov 29, 2023

rmfitzpatrick commented Nov 29, 2023 •

edited

Loading

ChrsMark commented Nov 29, 2023

Collector constantly restarts when using k8s_observer #29558

Collector constantly restarts when using k8s_observer #29558

Comments

ChrsMark commented Nov 29, 2023

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Nov 29, 2023

rmfitzpatrick commented Nov 29, 2023 • edited Loading

ChrsMark commented Nov 29, 2023

rmfitzpatrick commented Nov 29, 2023 •

edited

Loading