TestHermeticTaskRun is flakey #4567

jerop · 2022-02-11T14:46:54Z

Expected Behavior

TestHermeticTaskRun should only fail due to actual bugs

Actual Behavior

TestHermeticTaskRun flaked in:

Error waiting for TaskRun not-hermetic-run-as-root to finish: "not-hermetic-run-as-root" failed
Error executing command: fork/exec /tekton/scripts/script-0-wrvhk: permission denied

The text was updated successfully, but these errors were encountered:

bobcatfish · 2022-02-16T20:52:11Z

Some more context, It looks like for #4541 it failed 3 times in a row:

https://tekton-releases.appspot.com/builds/tekton-prow/pr-logs/pull/tektoncd_pipeline/4541/pull-tekton-pipeline-alpha-integration-tests/

I think Error executing command: fork/exec /tekton/scripts/script-0-wrvhk: permission denied might be a red herring - I think that might actually be from a previous run with hermetic mode on (makes me wonder if the expected failure in hermetic mode is happening for the right reason - maybe script mode doesnt work with hermetic mode? but that's another story!)

It seems like the failing taskrun is timing out:

          podName: not-hermetic-run-as-root-pod
          startTime: "2022-02-02T10:36:41Z"
          steps:
          - container: step-access-network
            imageID: docker-pullable://ubuntu@sha256:669e010b58baf5beb2836b253c1fd5768333f0d1dbcb834f7c07a4dc93f474be
            name: access-network
            terminated:
              exitCode: 1
              finishedAt: "2022-02-02T10:37:41Z"
              reason: TaskRunTimeout
              startedAt: "2022-02-02T10:36:46Z"

And then I think we're not getting any logs b/c iirc when a TaskRun times out we have to stop the pod from executing, and I think that might involved deleting the underlying pod?? I'm getting rusty though so I'm not sure XD but if so that might explain why we aren't seeing any logs for the taskrun that is timing out:

    build_logs.go:35: Could not get logs for pod not-hermetic-run-as-root-pod: pods "not-hermetic-run-as-root-pod" not found

Looking at the test that is failing, I'm wondering if it might be that the apt-get commands sometimes take more than a minute 🤔

pipeline/test/hermetic_taskrun_test.go

Lines 101 to 102 in 38b9f26

    
                   apt-get update 
        
                   apt-get install -y curl

Also use Errorf instead of Fatalf between the two tests (the hermetic test and the non-hermetic tests) so that if one fails the other will still run. In tektoncd#4567 we see that the hermetic end to end test sometimes fails, specifically it seems to be the `not-hermetic-run-as-root` version of the test, and it seems like the failure is hitting the 1 minute timeout. Looking at the test, it seems to be doing an `apt-get update` which seems like an operation that would be in grave danger of sometimes taking a while (especially depending on what version of the latest ubuntu image is running) so although I'm not sure that's what is causing the problem, I want to try doing something that is less likely to take so long but still would require network access, as well as something that would require priviledged access (which I assume is why the update was included, to capture the combo of network access and doing something priviledged)

Also use Errorf instead of Fatalf between the two tests (the hermetic test and the non-hermetic tests) so that if one fails the other will still run. In tektoncd#4567 we see that the hermetic end to end test sometimes fails, specifically it seems to be the `not-hermetic-run-as-root` version of the test, and it seems like the failure is hitting the 1 minute timeout. Looking at the test, it seems to be doing an `apt-get update` which seems like an operation that would be in grave danger of sometimes taking a while (especially depending on what version of the latest ubuntu image is running) so although I'm not sure that's what is causing the problem, I want to try doing something that is less likely to take so long but still would require network access, as well as something that would require priviledged access - which I assume is why the update was included, to capture the combo of network access and doing something priviledged. I'm still a bit confused about why both of those elements are present - I assume both are not allowed in hermetic mode but it would probably make more sense to test them separately to be sure they each fail, otherwise only one is covered (i.e. either the network call is going to fail and halt things, or the priviledged operation)

Also use Errorf instead of Fatalf between the two tests (the hermetic test and the non-hermetic tests) so that if one fails the other will still run. In tektoncd#4567 we see that the hermetic end to end test sometimes fails, specifically it seems to be the `not-hermetic-run-as-root` version of the test, and it seems like the failure is hitting the 1 minute timeout. Looking at the test, it seems to be doing an `apt-get update` which seems like an operation that would be in grave danger of sometimes taking a while (especially depending on what version of the latest ubuntu image is running) so although I'm not sure that's what is causing the problem, I want to try doing something that is less likely to take so long but still would require network access. I thought maybe that it was also trying to do somethign that required priviledged execution (i.e. running as root) but it seems like that's not something that hermetic mode drops anyway (looking at the TEP it seems to just be scoped to networking) so it doesn't feel like there is actually any need for that.

Also use Errorf instead of Fatalf between the two tests (the hermetic test and the non-hermetic tests) so that if one fails the other will still run. In #4567 we see that the hermetic end to end test sometimes fails, specifically it seems to be the `not-hermetic-run-as-root` version of the test, and it seems like the failure is hitting the 1 minute timeout. Looking at the test, it seems to be doing an `apt-get update` which seems like an operation that would be in grave danger of sometimes taking a while (especially depending on what version of the latest ubuntu image is running) so although I'm not sure that's what is causing the problem, I want to try doing something that is less likely to take so long but still would require network access. I thought maybe that it was also trying to do somethign that required priviledged execution (i.e. running as root) but it seems like that's not something that hermetic mode drops anyway (looking at the TEP it seems to just be scoped to networking) so it doesn't feel like there is actually any need for that.

bobcatfish · 2022-03-28T21:33:21Z

Hopefully this is fixed by #4567 but plz re-open if it pops up again!

jerop · 2023-03-21T16:19:01Z

Saw this flake again - https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_pipeline/6407/pull-tekton-pipeline-alpha-integration-tests/1638203930456363008

tekton-robot · 2023-06-19T16:27:30Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot · 2023-07-19T17:10:54Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

tekton-robot · 2023-08-18T17:18:51Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

tekton-robot · 2023-08-18T17:18:53Z

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jerop added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Feb 11, 2022

jerop mentioned this issue Feb 11, 2022

cleanup - ApplyContext parameters #4564

Merged

5 tasks

pritidesai added the kind/flake Categorizes issue or PR as related to a flakey test label Feb 11, 2022

bobcatfish self-assigned this Feb 16, 2022

bobcatfish mentioned this issue Feb 16, 2022

Remove lengthly operations from hermetic tests 🧪 #4590

Merged

5 tasks

lbernick added this to Pipelines V1 Feb 22, 2022

lbernick moved this to In Progress in Pipelines V1 Feb 22, 2022

bobcatfish closed this as completed Mar 28, 2022

Repository owner moved this from In Progress to Done in Pipelines V1 Mar 28, 2022

jerop mentioned this issue Mar 21, 2023

Refactor Matrix Implementation #6407

Merged

7 tasks

jerop reopened this Mar 21, 2023

lbernick mentioned this issue Apr 28, 2023

Add matrix support for using references to entire PipelineRun array parameters #6516

Merged

7 tasks

tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 19, 2023

tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 19, 2023

tekton-robot closed this as completed Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TestHermeticTaskRun is flakey #4567

TestHermeticTaskRun is flakey #4567

jerop commented Feb 11, 2022

bobcatfish commented Feb 16, 2022

bobcatfish commented Mar 28, 2022

jerop commented Mar 21, 2023

tekton-robot commented Jun 19, 2023

tekton-robot commented Jul 19, 2023

tekton-robot commented Aug 18, 2023

tekton-robot commented Aug 18, 2023

TestHermeticTaskRun is flakey #4567

TestHermeticTaskRun is flakey #4567

Comments

jerop commented Feb 11, 2022

Expected Behavior

Actual Behavior

bobcatfish commented Feb 16, 2022

bobcatfish commented Mar 28, 2022

jerop commented Mar 21, 2023

tekton-robot commented Jun 19, 2023

tekton-robot commented Jul 19, 2023

tekton-robot commented Aug 18, 2023

tekton-robot commented Aug 18, 2023