Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gh-actions] make linux package installation tests resilient to container image failures #623

Closed
cwegener opened this issue Jan 1, 2024 · 9 comments · Fixed by #622
Closed
Labels
artifact:deb Issues related to DEB packages artifact:docker artifact:rpm Issues related to RPM packages enhancement New feature or request stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed

Comments

@cwegener
Copy link

cwegener commented Jan 1, 2024

Component(s)

No response

Describe the issue you're reporting

Problem

In issue open-telemetry/opentelemetry-collector-contrib#16450 it has been discovered that the Linux Package installation tests can fail for non-obvious reasons due to the complexity of the test bed setup.

The critical part of the Linux Package testing that was failing in the referenced issue was:

  1. Run up a container image with a standard Linux distro (1 x Debian and 1 x Rocky Linux) which has SystemD as the container's ENTRYPOINT
  2. Copy the .deb / .rpm package to be tested onto the running container
  3. Perform the installation and start-up checks of the package inside the container

In the issue, the execution was silently failing at step 1 without raising an error.

The cause of the silent failure is because the container image used for the installation testing has no health check configured when the image is run. Therefore, the only failure conditions that can be captured are the ones that podman run / docker run are reporting, which are limited to:

  • 125 The error is with Podman itself
$ podman run --foo busybox; echo $?
Error: unknown flag: --foo
125
  • 126 The contained command cannot be invoked
$ podman run busybox /etc; echo $?
Error: container_linux.go:346: starting container process caused "exec: \"/etc\": permission denied": OCI runtime error
126
  • 127 The contained command cannot be found
$ podman run busybox foo; echo $?
Error: container_linux.go:346: starting container process caused "exec: \"foo\": executable file not found in $PATH": OCI runtime error
127
  • Exit code contained command exit code
$ podman run busybox /bin/sh -c 'exit 3'; echo $?
3

source: https://docs.podman.io/en/latest/markdown/podman-run.1.html#exit-status

Since docker run / podman run is used with the --detach switch in the Linux Packaging tests, a non-zero exit code from the contained command itself will never be returned by podman run / docker run

Solution

In order to make the transition from step 1. to step 2. and beyond more resilient, it is sufficient to simply make the Linux Packaging script execution wait until the SystemD manager inside of the container is confirmed to be up and running.

This can be achieved with the following:

  1. Execute systemctl is-system-running --wait inside of the container in order to wait for the system inside of the container to be fully operational
  2. Integrate the check from step 1 into the Linux Packaging execution by either:
  • Use systemctl --machine=<nameofcontainer> is-system-running --wait directly in the Linux Packaging test script
  • Or adjust the podman run command line to run the systemctl is-system-running --wait command as a Startup Health Check script and therefore make the execution of podman run wait until the container is fully operational.
@crobert-1
Copy link
Member

Makes sense to me, removing the needs triage label.

Copy link

github-actions bot commented Jan 2, 2024

Pinging code owners for testbed: @open-telemetry/collector-approvers. See Adding Labels via Comments if you do not have permissions to add labels yourself.

Copy link

github-actions bot commented Mar 4, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

  • testbed: @open-telemetry/collector-approvers

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Mar 4, 2024
@crobert-1 crobert-1 removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Mar 4, 2024
Copy link

github-actions bot commented May 6, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

  • testbed: @open-telemetry/collector-approvers

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label May 6, 2024
@crobert-1 crobert-1 removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label May 6, 2024
Copy link

github-actions bot commented Jul 8, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

  • testbed: @open-telemetry/collector-approvers

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jul 8, 2024
@mowies
Copy link
Member

mowies commented Aug 1, 2024

I think this issue should be moved into https://github.com/open-telemetry/opentelemtry-collector-releases since the package tests were moved there in #604

@mx-psi mx-psi transferred this issue from open-telemetry/opentelemetry-collector-contrib Aug 1, 2024
@dosubot dosubot bot added artifact:deb Issues related to DEB packages artifact:docker artifact:rpm Issues related to RPM packages enhancement New feature or request labels Aug 1, 2024
Copy link

dosubot bot commented Aug 1, 2024

@cwegener Hello! I'm here to help you with any questions or issues you have. Let's work together to solve this bug.

To address the issue of silent failures in the Linux Package installation tests, you can implement a health check to ensure the container is fully operational before proceeding with further commands. Here are the steps to make the Linux Packaging script execution more resilient:

  1. Execute systemctl is-system-running --wait: This command will wait until the SystemD manager inside the container is confirmed to be up and running.

  2. Integrate the Health Check:

    • Option 1: Use systemctl --machine=<nameofcontainer> is-system-running --wait directly in the Linux Packaging test script.
    • Option 2: Adjust the podman run or docker run command to include the systemctl is-system-running --wait command as a startup health check.

By integrating this health check, you can ensure that the container is fully operational before copying the .deb or .rpm packages and performing the installation and start-up checks. This will help mitigate the silent failures observed in the current setup.

To continue talking to Dosu, mention @dosu.

@mowies
Copy link
Member

mowies commented Aug 1, 2024

I am working on a fix for this

@mowies
Copy link
Member

mowies commented Aug 7, 2024

@cwegener the fix is ready for review :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
artifact:deb Issues related to DEB packages artifact:docker artifact:rpm Issues related to RPM packages enhancement New feature or request stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants