[gh-actions] make linux package installation tests resilient to container image failures #623

cwegener · 2024-01-01T21:34:37Z

Component(s)

No response

Describe the issue you're reporting

Problem

In issue open-telemetry/opentelemetry-collector-contrib#16450 it has been discovered that the Linux Package installation tests can fail for non-obvious reasons due to the complexity of the test bed setup.

The critical part of the Linux Package testing that was failing in the referenced issue was:

Run up a container image with a standard Linux distro (1 x Debian and 1 x Rocky Linux) which has SystemD as the container's ENTRYPOINT
Copy the .deb / .rpm package to be tested onto the running container
Perform the installation and start-up checks of the package inside the container

In the issue, the execution was silently failing at step 1 without raising an error.

The cause of the silent failure is because the container image used for the installation testing has no health check configured when the image is run. Therefore, the only failure conditions that can be captured are the ones that podman run / docker run are reporting, which are limited to:

125 The error is with Podman itself

$ podman run --foo busybox; echo $?
Error: unknown flag: --foo
125

126 The contained command cannot be invoked

$ podman run busybox /etc; echo $?
Error: container_linux.go:346: starting container process caused "exec: \"/etc\": permission denied": OCI runtime error
126

127 The contained command cannot be found

$ podman run busybox foo; echo $?
Error: container_linux.go:346: starting container process caused "exec: \"foo\": executable file not found in $PATH": OCI runtime error
127

Exit code contained command exit code

$ podman run busybox /bin/sh -c 'exit 3'; echo $?
3

source: https://docs.podman.io/en/latest/markdown/podman-run.1.html#exit-status

Since docker run / podman run is used with the --detach switch in the Linux Packaging tests, a non-zero exit code from the contained command itself will never be returned by podman run / docker run

Solution

In order to make the transition from step 1. to step 2. and beyond more resilient, it is sufficient to simply make the Linux Packaging script execution wait until the SystemD manager inside of the container is confirmed to be up and running.

This can be achieved with the following:

Execute systemctl is-system-running --wait inside of the container in order to wait for the system inside of the container to be fully operational
Integrate the check from step 1 into the Linux Packaging execution by either:

Use systemctl --machine=<nameofcontainer> is-system-running --wait directly in the Linux Packaging test script
Or adjust the podman run command line to run the systemctl is-system-running --wait command as a Startup Health Check script and therefore make the execution of podman run wait until the container is fully operational.

The text was updated successfully, but these errors were encountered:

crobert-1 · 2024-01-02T17:56:53Z

Makes sense to me, removing the needs triage label.

github-actions · 2024-01-02T17:57:22Z

Pinging code owners for testbed: @open-telemetry/collector-approvers. See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2024-03-04T03:41:55Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

testbed: @open-telemetry/collector-approvers

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2024-05-06T03:29:33Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

testbed: @open-telemetry/collector-approvers

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2024-07-08T03:31:48Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

testbed: @open-telemetry/collector-approvers

See Adding Labels via Comments if you do not have permissions to add labels yourself.

mowies · 2024-08-01T09:25:46Z

I think this issue should be moved into https://github.com/open-telemetry/opentelemtry-collector-releases since the package tests were moved there in #604

dosubot · 2024-08-01T10:51:54Z

@cwegener Hello! I'm here to help you with any questions or issues you have. Let's work together to solve this bug.

To address the issue of silent failures in the Linux Package installation tests, you can implement a health check to ensure the container is fully operational before proceeding with further commands. Here are the steps to make the Linux Packaging script execution more resilient:

Execute systemctl is-system-running --wait: This command will wait until the SystemD manager inside the container is confirmed to be up and running.
Integrate the Health Check:
- Option 1: Use systemctl --machine=<nameofcontainer> is-system-running --wait directly in the Linux Packaging test script.
- Option 2: Adjust the podman run or docker run command to include the systemctl is-system-running --wait command as a startup health check.

By integrating this health check, you can ensure that the container is fully operational before copying the .deb or .rpm packages and performing the installation and start-up checks. This will help mitigate the silent failures observed in the current setup.

To continue talking to Dosu, mention @dosu.

mowies · 2024-08-01T11:21:37Z

I am working on a fix for this

mowies · 2024-08-07T13:05:32Z

@cwegener the fix is ready for review :)

This was referenced Jan 1, 2024

[chore]: fix linux packaging tests open-telemetry/opentelemetry-collector-contrib#30202

Merged

Replace existing deb/fpm workflows with goreleaser open-telemetry/opentelemetry-collector-contrib#28830

Closed

github-actions bot mentioned this issue Jan 2, 2024

Weekly Report: 2023-12-26 - 2024-01-02 open-telemetry/opentelemetry-collector-contrib#30242

Closed

88 tasks

cwegener mentioned this issue Mar 1, 2024

[CI/CD] build-package (deb) is broken open-telemetry/opentelemetry-collector-contrib#31443

Closed

github-actions bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Mar 4, 2024

crobert-1 removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Mar 4, 2024

github-actions bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label May 6, 2024

crobert-1 removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label May 6, 2024

github-actions bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jul 8, 2024

mowies mentioned this issue Aug 1, 2024

add linux package test healthcheck #622

Merged

mx-psi transferred this issue from open-telemetry/opentelemetry-collector-contrib Aug 1, 2024

dosubot bot added artifact:deb Issues related to DEB packages artifact:docker artifact:rpm Issues related to RPM packages enhancement New feature or request labels Aug 1, 2024

mowies mentioned this issue Aug 27, 2024

REQUEST: New membership for @mowies open-telemetry/community#2316

Closed

6 tasks

mx-psi closed this as completed in #622 Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[gh-actions] make linux package installation tests resilient to container image failures #623

[gh-actions] make linux package installation tests resilient to container image failures #623

cwegener commented Jan 1, 2024

crobert-1 commented Jan 2, 2024

github-actions bot commented Jan 2, 2024

github-actions bot commented Mar 4, 2024

github-actions bot commented May 6, 2024

github-actions bot commented Jul 8, 2024

mowies commented Aug 1, 2024

dosubot bot commented Aug 1, 2024

mowies commented Aug 1, 2024

mowies commented Aug 7, 2024