Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[resourcedetection] windows: Error 'failed getting host cpuinfo: context deadline exceeded' #33768

Closed
cwegener opened this issue Jun 26, 2024 · 4 comments · Fixed by #33774
Closed
Labels
bug Something isn't working os:windows priority:p2 Medium processor/resourcedetection Resource detection processor

Comments

@cwegener
Copy link
Contributor

Component(s)

processor/resourcedetection, processor/resourcedetection/internal/system

What happened?

Description

After introduction of the host cpuinfo attributes in #26533, the system resource detection can fail catastrophically on Windows hosts, resulting in ALL configured system resource attributes (Host name, Host ID, OS type, OS description ...) to become unavailable in all pipelines where the instance of resourcedection processor is used.

The cause is a combination of:

  1. cpuinfo attribute collection is ALWAYS running on the processor's Start() phase, regardless of whether the cpuinfo attributes are configured to be added into the resource attributes.
  2. The newly included external dependency introduced by the cpuinfo work in [processor/resourcedetection] Add support for host cpuinfo attributes #26533 uses a mechanism (WMI 1) for retrieving the CPU info that can often fail with a timeout (hence, the context deadline exceeded error).

The issue is more likely to happen when the Otel collector starts up during host boot up (e.g. as a service launched by a service manager) as opposed to launching the Otel collector on demand after the Windows host is already running.
This due to parallelization of startup tasks (services) in the Operating System.

Steps to Reproduce

  1. Stop and disable the winmgmgt windows service to simulate the failure condition of not being able to collect the CPU Info: sc config winmgmt start= disabled and net stop winmgmt
  2. Run Otel Collector with a config that includes the system resourcedetection with at least one of the configs enabled (e.g. Host ID) in a pipeline.
  3. Observe Otel Collector's logs

Expected Result

Otel Collector's logs contains the requested attribute in the resourcedetection processor's logs (e.g. Host ID)

e.g.

2024-06-26T13:31:37.391+1000    info    internal/resourcedetection.go:125       began detecting resource information   {"kind": "processor", "name": "resourcedetection", "pipeline": "traces"}
2024-06-26T13:31:38.698+1000    info    internal/resourcedetection.go:139       detected resource information   {"kind": "processor", "name": "resourcedetection", "pipeline": "traces", "resource": {"host.id":"5ac27508-7835-40fd-a8e3-541bb69b8f70","host.name":"DESKTOP-RHABMHR","os.type":"windows","service.name":"windows-dev"}}

Actual Result

Otel Collector's logs contain the error message 'failed getting host cpuinfo:

(NOTE: the simulated failure condition of completely disabling winmgmt produces a slightly different exception instead of the 'context deadline exceeded' error from a production system)

The requested attributes (e.g. Host ID) are missing from the resourcedetection processor's logs

e.g.

2024-06-26T13:38:26.012+1000    warn    internal/resourcedetection.go:130       failed to detect resource       {"kind": "processor", "name": "resourcedetection", "pipeline": "traces", "error": "failed getting host cpuinfo: Exception occurred. (The service cannot be started, either because it is disabled or because it has no enabled devices associated with it. )"}
2024-06-26T13:38:26.012+1000    info    internal/resourcedetection.go:139       detected resource information   {"kind": "processor", "name": "resourcedetection", "pipeline": "traces", "resource": {"service.name":"windows-dev"}}

Collector version

v0.103.1

Environment information

Environment

OS: Windows (10, 11, 2019, 2022)

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      grpc:
      http:
processors:
  resourcedetection:
    detectors:
      - system
      - env
    system:
      resource_attributes:
        host.id:
          enabled: true
exporters:
  logging:
service:
  pipelines:
    traces:
      receivers:
        - otlp
      processors:
        - resourcedetection
      exporters:
        - logging

Log output

2024-06-26T13:38:26.012+1000    warn    internal/resourcedetection.go:130       failed to detect resource       {"kind": "processor", "name": "resourcedetection", "pipeline": "traces", "error": "failed getting host cpuinfo:

Additional context

There should have been a Breaking Change note in the ChangeLog that makes all users of the resourcedetection processor aware of the newly introduced hard dependency on the winmgmt service.

Footnotes

  1. https://github.com/shirou/gopsutil/blob/e74324b6a726997ce756b8f79dbbd7a3a0999ba0/cpu/cpu_windows.go#L98-L127

@cwegener cwegener added bug Something isn't working needs triage New item requiring triage labels Jun 26, 2024
@github-actions github-actions bot added the processor/resourcedetection Resource detection processor label Jun 26, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@ChrsMark
Copy link
Member

cpuinfo attribute collection is ALWAYS running on the processor's Start() phase, regardless of whether the cpuinfo attributes are configured to be added into the resource attributes.

Since these cpu info attributes are not enabled by default we shouldn't call the cpu.Info() if none of those are enabled. I can send a PR to fix this.

/cc @mx-psi

@cwegener
Copy link
Contributor Author

Since these cpu info attributes are not enabled by default we shouldn't call the cpu.Info() if none of those are enabled. I can send a PR to fix this

Yeah. That's one thing that needs fixing indeed. It would be great if cpu.Info() only ever gets called in cases where the cpuinfo detector is actually enabled by the user.

@cwegener
Copy link
Contributor Author

I also just opened one more bug about this new detector: #33771

That other bug is MUCH harder to reproduce though, because I cannot rely on the alternative WMI/COM error case of simply disable the winmgmt service .... I would actually need to find a way to artificially slow down COM calls in a dev/test system.

@mx-psi mx-psi added priority:p2 Medium os:windows and removed needs triage New item requiring triage labels Jun 26, 2024
mx-psi pushed a commit that referenced this issue Jun 28, 2024
**Description:** <Describe what has changed.>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
This PR changes the system resource detector so as to only try to fetch
the CPU info when required. The CPU info attributes are disabled by
default so we should only fetch this information when at least one of
those is enabled.

**Link to tracking Issue:** <Issue number if applicable>
#33768

**Testing:** <Describe what testing was performed and which tests were
added.> Added unti-tests

**Documentation:** <Describe the documentation added.> ~

/cc @mx-psi

---------

Signed-off-by: ChrsMark <chrismarkou92@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working os:windows priority:p2 Medium processor/resourcedetection Resource detection processor
Projects
None yet
3 participants