Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporterhelper] Persistent queue can block collector start-up #9451

Closed
swar8080 opened this issue Feb 1, 2024 · 3 comments
Closed

[exporterhelper] Persistent queue can block collector start-up #9451

swar8080 opened this issue Feb 1, 2024 · 3 comments
Labels
area:exporter bug Something isn't working

Comments

@swar8080
Copy link

swar8080 commented Feb 1, 2024

Describe the bug
When using a persistent queue backed by file storage, the persisted state can cause the collector to get stuck midway during start-up. By stuck I mean that the collector process doesn't exit but some components are never initialized. Disabling the persistent queue fixes the issue.

From dumping the stacktraces of all goroutines using the pprof extension, it looks like the below goroutine is blocking start-up from continuing. I confirmed this stacktrace was always present the few times this bug happened, and never present on collector instances that start cleanly:

1 @ 0x43f1ce 0x408c45 0x408897 0x11ea95e 0x11eaddb 0x11eaf4b 0x11e706c 0x11e3829 0x11de3a7 0xd0b245 0xe5c0a5 0xeca506 0xeca912 0xecce0c 0xe8867c 0xe88f3f 0xe88a93 0x630ea7d 0x630e9f8 0x630e96a 0x43ed5b 0x472ac1
#	0x11ea95d	go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*persistentQueue[...]).initPersistentContiguousStorage+0x3bd	go.opentelemetry.io/collector/exporter@v0.92.0/exporterhelper/internal/persistent_queue.go:154
#	0x11eadda	go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*persistentQueue[...]).initClient+0x7a				go.opentelemetry.io/collector/exporter@v0.92.0/exporterhelper/internal/persistent_queue.go:123
#	0x11eaf4a	go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*persistentQueue[...]).Start+0xea				go.opentelemetry.io/collector/exporter@v0.92.0/exporterhelper/internal/persistent_queue.go:115
#	0x11e706b	go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*QueueConsumers[...]).Start+0x4b				go.opentelemetry.io/collector/exporter@v0.92.0/exporterhelper/internal/consumers.go:31
#	0x11e3828	go.opentelemetry.io/collector/exporter/exporterhelper.(*queueSender).Start+0x48							go.opentelemetry.io/collector/exporter@v0.92.0/exporterhelper/queue_sender.go:136
#	0x11de3a6	go.opentelemetry.io/collector/exporter/exporterhelper.(*baseExporter).Start+0xa6						go.opentelemetry.io/collector/exporter@v0.92.0/exporterhelper/common.go:202
#	0xd0b244	go.opentelemetry.io/collector/service/internal/graph.(*Graph).StartAll+0x1a4							go.opentelemetry.io/collector/service@v0.92.0/internal/graph/graph.go:396
#	0xe5c0a4	go.opentelemetry.io/collector/service.(*Service).Start+0x2e4									go.opentelemetry.io/collector/service@v0.92.0/service.go:169
#	0xeca505	go.opentelemetry.io/collector/otelcol.(*Collector).setupConfigurationComponents+0x705						go.opentelemetry.io/collector/otelcol@v0.92.0/collector.go:190
#	0xeca911	go.opentelemetry.io/collector/otelcol.(*Collector).Run+0x51									go.opentelemetry.io/collector/otelcol@v0.92.0/collector.go:229
#	0xecce0b	go.opentelemetry.io/collector/otelcol.NewCommand.func1+0x6b									go.opentelemetry.io/collector/otelcol@v0.92.0/command.go:27
#	0xe8867b	github.com/spf13/cobra.(*Command).execute+0xabb											github.com/spf13/cobra@v1.8.0/command.go:983
#	0xe88f3e	github.com/spf13/cobra.(*Command).ExecuteC+0x3fe										github.com/spf13/cobra@v1.8.0/command.go:1115
#	0xe88a92	github.com/spf13/cobra.(*Command).Execute+0x12											github.com/spf13/cobra@v1.8.0/command.go:1039
#	0x630ea7c	main.runInteractive+0x3c													github.com/open-telemetry/opentelemetry-collector-releases/contrib/main.go:27
#	0x630e9f7	main.run+0x117															github.com/open-telemetry/opentelemetry-collector-releases/contrib/main_others.go:11
#	0x630e969	main.main+0x89															github.com/open-telemetry/opentelemetry-collector-releases/contrib/main.go:20
#	0x43ed5a	runtime.main+0x2ba		

Line 154 is trying to write to a buffered channel. If i'm reading it correctly, the goroutine would get blocked if persisted write index - persisted read index > queue capacity?

image

Steps to reproduce

I've since disabled file storage and didn't have time to find consistent steps to reproduce the issue, but can revisit later if it'd help.

Here are some observations though:

  • Our exporter configuration was left broken for a while with many failures happening
  • I remember seeing error messages about the exporter queue being full
  • The issue never resolves itself, even after restarts or letting the instance sit for hours. I think we did fix it by deleting the persistent volume claims at one point, before realizing they were probably at fault, but it eventually happened again
  • Some of the collectors in our kubernetes stateful set never ran into the issue even when restarting, but did not test this thoroughly

What did you expect to see?
Collector starts-up cleanly

What did you see instead?
Collector is up and running but not doing anything, i.e. no new log statements and receivers are not accepting data.

What version did you use?
v0.92.0 of the collector-contrib image

What config did you use?

    extensions:
      ...
      file_storage/otc:
        directory: /data/wal/storage/otc
        timeout: 10s

    exporters:
      otlp:
        endpoint: ...
        sending_queue:
          queue_size: 10000
          storage: file_storage/otc


    service:
      extensions: [file_storage/otc]
      ...
      pipelines:
        ...
        traces:
          receivers: [otlp]
          processors: [k8sattributes, attributes/env, batch]
          exporters: [otlp, spanmetrics, servicegraph, debug]

Environment

  • Using the collector-contrib image on linux ec2 instances
  • Using OpenTelemetry operator and helm charts, with pods deployed as a statefulset using k8 persistent volume claims
@swar8080 swar8080 added the bug Something isn't working label Feb 1, 2024
@swar8080 swar8080 changed the title [exporterhelper] Persistent queue intermittently blocking collector start-up [exporterhelper] Persistent queue can block collector start-up Feb 1, 2024
@TylerHelmuth
Copy link
Member

pinging @dmitryax @djaglowski @astencel-sumo

@dmitryax
Copy link
Member

dmitryax commented Feb 5, 2024

@swar8080 Thank you for reporting.

I think it's happening when the queue backed on the disk has bigger size than the capacity configured for the collector. #9232 should fix the issue.

In the meantime, can you please try giving more queue_size to the collectors that cannot start and see if resolved the issue?

@dmitryax
Copy link
Member

dmitryax commented May 4, 2024

@swar8080 #9232 is merged. So you shouldn't see this problem starting from the next version 0.100.0. Please confirm once you try it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:exporter bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants