Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collector constantly restarts when using k8s_observer #29558

Closed
ChrsMark opened this issue Nov 29, 2023 · 3 comments
Closed

Collector constantly restarts when using k8s_observer #29558

ChrsMark opened this issue Nov 29, 2023 · 3 comments
Labels
bug Something isn't working extension/observer/k8sobserver needs triage New item requiring triage

Comments

@ChrsMark
Copy link
Member

Component(s)

extension/observer/k8sobserver

What happened?

Description

Trying to deploy Collector with k8s_observer enabled, the collector never starts successfully and falls into a constant restarting loop because the readiness/liveness checks never succeed.
Even after increasing the timeoutSeconds of the probes the situation is not improved.

Steps to Reproduce

  1. Using the provided configuration deploy the collector using Helm: helm install daemonset open-telemetry/opentelemetry-collector --values daemonset.yaml
  2. Check that the Collector Pods fail to come to a healthy state and they are constantly getting restarted

Expected Result

Collector Pods should be running without restarts with this basic configuration.

Actual Result

Constant restarts.

Collector version

0.90.0

Environment information

Environment

GKE: v1.27.3-gke.100

OpenTelemetry Collector configuration

mode: daemonset
presets:
  kubernetesAttributes:
    enabled: true

image:
  tag: "0.90.0"

config:
  extensions:
    k8s_observer:
      auth_type: serviceAccount
      node: ${env:K8S_NODE_NAME}
      observe_pods: true
  receivers:
    receiver_creator:
      watch_observers: [ k8s_observer ]
      receivers:
        redis:
          rule: type == "port" && pod.name matches "redis"
          config:
            collection_interval: 2s
  service:
    extensions: [k8s_observer]
    pipelines:
      metrics:
        receivers: [ receiver_creator]

Log output

2023-11-29T12:51:34.736Z	info	service@v0.90.0/telemetry.go:86	Setting up own telemetry...
2023-11-29T12:51:34.736Z	info	service@v0.90.0/telemetry.go:203	Serving Prometheus metrics	{"address": "10.68.0.132:8888", "level": "Basic"}
2023-11-29T12:51:34.737Z	info	exporter@v0.90.0/exporter.go:275	Development component. May change in the future.	{"kind": "exporter", "data_type": "traces", "name": "debug"}
2023-11-29T12:51:34.738Z	info	memorylimiterprocessor@v0.90.0/memorylimiter.go:138	Using percentage memory limiter	{"kind": "processor", "name": "memory_limiter", "pipeline": "traces", "total_memory_mib": 3928, "limit_percentage": 80, "spike_limit_percentage": 25}
2023-11-29T12:51:34.738Z	info	memorylimiterprocessor@v0.90.0/memorylimiter.go:102	Memory limiter configured	{"kind": "processor", "name": "memory_limiter", "pipeline": "traces", "limit_mib": 3142, "spike_limit_mib": 982, "check_interval": 5}
2023-11-29T12:51:34.738Z	info	exporter@v0.90.0/exporter.go:275	Development component. May change in the future.	{"kind": "exporter", "data_type": "logs", "name": "debug"}
2023-11-29T12:51:34.738Z	info	exporter@v0.90.0/exporter.go:275	Development component. May change in the future.	{"kind": "exporter", "data_type": "metrics", "name": "debug"}
2023-11-29T12:51:34.739Z	info	kube/client.go:113	k8s filtering	{"kind": "processor", "name": "k8sattributes", "pipeline": "traces", "labelSelector": "", "fieldSelector": "spec.nodeName=gke-otel-demo-default-pool-0f2ce7eb-c7qr"}
2023-11-29T12:51:34.739Z	warn	jaegerreceiver@v0.90.0/factory.go:49	jaeger receiver will deprecate Thrift-gen and replace it with Proto-gen to be compatbible to jaeger 1.42.0 and higher. See https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/18485 for more details.	{"kind": "receiver", "name": "jaeger", "data_type": "traces"}
2023-11-29T12:51:34.739Z	info	kube/client.go:113	k8s filtering	{"kind": "processor", "name": "k8sattributes", "pipeline": "metrics", "labelSelector": "", "fieldSelector": "spec.nodeName=gke-otel-demo-default-pool-0f2ce7eb-c7qr"}
2023-11-29T12:51:34.740Z	info	kube/client.go:113	k8s filtering	{"kind": "processor", "name": "k8sattributes", "pipeline": "logs", "labelSelector": "", "fieldSelector": "spec.nodeName=gke-otel-demo-default-pool-0f2ce7eb-c7qr"}
2023-11-29T12:51:34.741Z	info	service@v0.90.0/service.go:148	Starting otelcol-contrib...	{"Version": "0.90.0", "NumCPU": 2}
2023-11-29T12:51:34.741Z	info	extensions/extensions.go:34	Starting extensions...
2023-11-29T12:51:34.741Z	info	extensions/extensions.go:37	Extension is starting...	{"kind": "extension", "name": "k8s_observer"}
2023-11-29T12:51:34.741Z	info	extensions/extensions.go:45	Extension started.	{"kind": "extension", "name": "k8s_observer"}
2023-11-29T12:51:34.742Z	info	otlpreceiver@v0.90.0/otlp.go:83	Starting GRPC server	{"kind": "receiver", "name": "otlp", "data_type": "traces", "endpoint": "10.68.0.132:4317"}
2023-11-29T12:51:34.742Z	info	otlpreceiver@v0.90.0/otlp.go:101	Starting HTTP server	{"kind": "receiver", "name": "otlp", "data_type": "traces", "endpoint": "10.68.0.132:4318"}
2023-11-29T12:51:34.742Z	info	service@v0.90.0/service.go:174	Everything is ready. Begin running and processing data.
2023-11-29T12:52:04.138Z	info	otelcol@v0.90.0/collector.go:258	Received signal from OS	{"signal": "terminated"}
2023-11-29T12:52:04.139Z	info	service@v0.90.0/service.go:188	Starting shutdown...
2023-11-29T12:52:04.140Z	info	extensions/extensions.go:52	Stopping extensions...
2023-11-29T12:52:04.140Z	info	service@v0.90.0/service.go:202	Shutdown complete.

Additional context

It seems that the collector fails to report ready/healthy. Here is what describe output gives:

  Warning  Unhealthy  4m24s (x11 over 5m23s)  kubelet            Readiness probe failed: Get "http://10.68.0.132:13133/": dial tcp 10.68.0.132:13133: connect: connection refused
  Warning  Unhealthy  4m24s (x6 over 5m14s)   kubelet            Liveness probe failed: Get "http://10.68.0.132:13133/": dial tcp 10.68.0.132:13133: connect: connection refused
  Normal   Killing    4m24s (x2 over 4m54s)   kubelet            Container opentelemetry-collector failed liveness probe, will be restarted
@ChrsMark ChrsMark added bug Something isn't working needs triage New item requiring triage labels Nov 29, 2023
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@rmfitzpatrick
Copy link
Contributor

rmfitzpatrick commented Nov 29, 2023

  service:
    extensions: [k8s_observer]

The issue is that the probe relies on the health check extension, which is now omitted from the service extensions sequence. This should be included along with any other desired extension:

  service:
    extensions: [health_check, k8s_observer]

@ChrsMark
Copy link
Member Author

Thank's @rmfitzpatrick this solves the issue.

I guess this is sth that the helm-chart should check on start up and explicitly inform the user about it. I will create an issue in the helm-charts repo since this is not related to the collector's implementation specifically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working extension/observer/k8sobserver needs triage New item requiring triage
Projects
None yet
Development

No branches or pull requests

2 participants