Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coreos.ignition.failure sometimes fails on RHCOS #3670

Closed
jlebon opened this issue Nov 28, 2023 · 15 comments
Closed

coreos.ignition.failure sometimes fails on RHCOS #3670

jlebon opened this issue Nov 28, 2023 · 15 comments
Assignees
Labels
jira for syncing to jira

Comments

@jlebon
Copy link
Member

jlebon commented Nov 28, 2023

coreos.ignition.failure on RHCOS sometimes passes (like it did originally in #3647) and sometimes fails with:

--- FAIL: coreos.ignition.failure (37.49s)
        qemufailure.go:63: pattern 'error creating /sysroot/notwritable.txt' in file '/var/tmp/mantle-qemu3926164821/console.log3861977006' not found

like in #3668.

@c4rt0 c4rt0 self-assigned this Nov 28, 2023
marmijo added a commit to marmijo/os that referenced this issue Nov 29, 2023
`coreos.unique.boot.failure` has been failing on aarch64 builds
and `coreos.ignition.failure` has been failing on RHCOS in general.
Let's denylist them and add a snooze while we investigate
the issues.

See: coreos/coreos-assembler#3670
and coreos/coreos-assembler#3669
@c4rt0 c4rt0 added the jira for syncing to jira label Dec 4, 2023
c4rt0 added a commit to c4rt0/coreos-assembler that referenced this issue Dec 5, 2023
@c4rt0
Copy link
Member

c4rt0 commented Dec 12, 2023

I ran a simple script locally which tested the coreos.ignition.failure 100 times. Not once was I able to reproduce this failure.

for i in {0..99}
do
	cosa kola run --qemu-image ./builds/415.92.202312121624-0/x86_64/rhcos-415.92.202312121624-0-qemu.x86_64.qcow2 coreos.ignition.failure
done

@c4rt0
Copy link
Member

c4rt0 commented Dec 14, 2023

Since I couldn't reproduce this locally, I will try to un-denylist the test and watch it fail on the pipeline, to investigate the logs and better understand the underlying issue on RHCOS 4.16. When the logs are available, I will once more add this test to the denylist.

c4rt0 added a commit to c4rt0/os that referenced this issue Dec 14, 2023
coreos.ignition.failure has been failing on RHCOS in general.

After failing to reproduce it locally I am removing it
from denylist in order to catch errors in the pipeline.
The goal of this PR is to allow it in RHCOS 4.16, catch the
errors from the log and re-introduce `coreos.ignition.failure`
to kola-denylist while working on solution.

See: coreos/coreos-assembler#3670 and coreos/coreos-assembler#3669
c4rt0 added a commit to c4rt0/os that referenced this issue Dec 14, 2023
After failing to reproduce `coreos.ignition.failure` locally I am removing it from denylist in order to catch errors in the pipeline. The goal of this PR is to allow it in RHCOS 4.16, catch the errors from the log and re-introduce `coreos.ignition.failure` to kola-denylist while working on solution.

See: coreos/coreos-assembler#3670
c4rt0 added a commit to c4rt0/os that referenced this issue Dec 14, 2023
After failing to reproduce `coreos.ignition.failure` locally I am removing it from denylist in order to catch errors in the pipeline. The goal of this PR is to allow it in RHCOS 4.16, catch the errors from the log and re-introduce `coreos.ignition.failure` to kola-denylist while working on solution.

See: coreos/coreos-assembler#3670
@dustymabe
Copy link
Member

Does this happen on x86_64 or one of the secondary arches?

If x86_64 you can check the KOLA_RUN_SLEEP option on the main build job in the devel pipeline and try to run the test manually there.

If multi-arch did you try in a debug pod?

@c4rt0
Copy link
Member

c4rt0 commented Dec 14, 2023

I only came across a reference to this issue on aarch64.
#jenkins-rhcos-art

@c4rt0
Copy link
Member

c4rt0 commented Dec 14, 2023

@marmijo comments on the matter:
#3669 (comment)
openshift/os#1401 (comment)
... coreos.ignition.failure has been failing on RHCOS in general.

There are no console logs for this test in the kola output.

marmijo added a commit to marmijo/os that referenced this issue Jan 15, 2024
This test is still failing so let's extend the snooze
for a few weeks so we can continue to investigate
coreos/coreos-assembler#3670
@c4rt0
Copy link
Member

c4rt0 commented Jan 18, 2024

Right now I am running a Jenkins debug-pod with rhcos [4.16-9.2] on aarch64 - attempting to reproduce the FAIL for coreos.ignition.failure in a loop:

[coreos-assembler]$ for i in {0..99}; do kola run coreos.ignition.failure; done                                                                    
=== RUN   coreos.ignition.failure
--- PASS: coreos.ignition.failure (27.08s)
PASS, output in tmp/kola/qemu-2024-01-18-1456-20985
=== RUN   coreos.ignition.failure
--- PASS: coreos.ignition.failure (27.22s)
PASS, output in tmp/kola/qemu-2024-01-18-1456-21055
=== RUN   coreos.ignition.failure
--- PASS: coreos.ignition.failure (27.13s)
PASS, output in tmp/kola/qemu-2024-01-18-1457-21124
=== RUN   coreos.ignition.failure
--- PASS: coreos.ignition.failure (27.12s)
PASS, output in tmp/kola/qemu-2024-01-18-1457-21206
=== RUN   coreos.ignition.failure
--- PASS: coreos.ignition.failure (27.20s)
PASS, output in tmp/kola/qemu-2024-01-18-1458-21279
=== RUN   coreos.ignition.failure
--- PASS: coreos.ignition.failure (27.30s)
PASS, output in tmp/kola/qemu-2024-01-18-1458-21351
=== RUN   coreos.ignition.failure
--- PASS: coreos.ignition.failure (27.30s)
PASS, output in tmp/kola/qemu-2024-01-18-1459-21421
=== RUN   coreos.ignition.failure
--- PASS: coreos.ignition.failure (27.28s)
PASS, output in tmp/kola/qemu-2024-01-18-1459-21492
=== RUN   coreos.ignition.failure
--- PASS: coreos.ignition.failure (27.35s)
PASS, output in tmp/kola/qemu-2024-01-18-1500-21564
=== RUN   coreos.ignition.failure
--- PASS: coreos.ignition.failure (27.26s)
PASS, output in tmp/kola/qemu-2024-01-18-1500-21644
=== RUN   coreos.ignition.failure

As of the first many runs, no failures were detected. I will attempt the same on ppc64le and the s390x.

@c4rt0
Copy link
Member

c4rt0 commented Jan 18, 2024

The same as above goes for the ppc64le:

successfully generated: rhcos-416.92.202401181531-0-qemu.ppc64le.qcow2
[coreos-assembler]$ for i in {0..99}; do kola run coreos.ignition.failure; done 
=== RUN   coreos.ignition.failure
--- PASS: coreos.ignition.failure (24.42s)
PASS, output in tmp/kola/qemu-2024-01-18-1607-21826
__   Skipping kola test pattern "pxe-*.ppcfw":
  __ https://github.com/coreos/coreos-assembler/issues/3370
=== RUN   coreos.ignition.failure
--- PASS: coreos.ignition.failure (23.97s)
PASS, output in tmp/kola/qemu-2024-01-18-1608-21864
__   Skipping kola test pattern "pxe-*.ppcfw":
  __ https://github.com/coreos/coreos-assembler/issues/3370
=== RUN   coreos.ignition.failure
--- PASS: coreos.ignition.failure (24.03s)
PASS, output in tmp/kola/qemu-2024-01-18-1608-21909
__   Skipping kola test pattern "pxe-*.ppcfw":
  __ https://github.com/coreos/coreos-assembler/issues/3370
=== RUN   coreos.ignition.failure
--- PASS: coreos.ignition.failure (23.33s)
PASS, output in tmp/kola/qemu-2024-01-18-1609-21956
__   Skipping kola test pattern "pxe-*.ppcfw":
  __ https://github.com/coreos/coreos-assembler/issues/3370
=== RUN   coreos.ignition.failure
--- PASS: coreos.ignition.failure (25.34s)
PASS, output in tmp/kola/qemu-2024-01-18-1609-21999
__   Skipping kola test pattern "pxe-*.ppcfw":
  __ https://github.com/coreos/coreos-assembler/issues/3370
=== RUN   coreos.ignition.failure
--- PASS: coreos.ignition.failure (24.36s)
PASS, output in tmp/kola/qemu-2024-01-18-1610-22033
__   Skipping kola test pattern "pxe-*.ppcfw":
  __ https://github.com/coreos/coreos-assembler/issues/3370
=== RUN   coreos.ignition.failure
--- PASS: coreos.ignition.failure (25.49s)
PASS, output in tmp/kola/qemu-2024-01-18-1610-22077
__   Skipping kola test pattern "pxe-*.ppcfw":
  __ https://github.com/coreos/coreos-assembler/issues/3370

@jlebon
Copy link
Member Author

jlebon commented Jan 18, 2024

The other thing you could do here is keep the denylist entry but warn: true. That'll run the test but make it non-fatal so it doesn't break the pipeline.

There are no console logs for this test in the kola output.

Yeah, that makes sense. The reason is that we're currently keeping the console.log file as a QEMU builder tempfile. So it gets nuked when the builder gets cleaned. I think instead we can use H.TempFile(), which allocates a tempfile in the output dir of the test.

Let's fix the console issue first before we turn it back on via warn: true.

@c4rt0
Copy link
Member

c4rt0 commented Jan 25, 2024

The other thing you could do here is keep the denylist entry but warn: true. That'll run the test but make it non-fatal so it doesn't break the pipeline.

Just for clarity, this is already implemented in here.
In my understanding what's left to do is to watch pipeline for failures...

@c4rt0
Copy link
Member

c4rt0 commented Feb 15, 2024

Today a failure was observed on 4.12 aarch64 412.86.202402141419-0.

Full jenkins console.log

[2024-02-14T16:18:54.659Z]         qemufailure.go:43: failed to establish qmp connection: 
dial unix /var/tmp/mantle-qemu89627189/qmp-1707927500223657820.sock: connect: connection refused```

@c4rt0
Copy link
Member

c4rt0 commented Feb 15, 2024

Additionally console.txt from the jenkins job

@dustymabe
Copy link
Member

So the test failed once but then passed in the rerun.

Additionally console.txt from the jenkins job

Is that log from the failed run or from the successful rerun?

I honestly think this was just some sort of infra flake.

@jlebon
Copy link
Member Author

jlebon commented Feb 15, 2024

Yeah, I wouldn't be worried about that (well, maybe we can strengthen our handling of the QMP socket). Note also it's for 4.12, which doesn't have the changes that precipitated this issue.

@c4rt0
Copy link
Member

c4rt0 commented Feb 16, 2024

Is that log from the failed run or from the successful rerun?

I first thought this log was from the failed test, but it's from the successful rerun (lesson learned).

Yeah, I wouldn't be worried about that (well, maybe we can strengthen our handling of the QMP socket). Note also it's for 4.12, which doesn't have the changes that precipitated this issue.

Noted, thanks for the comments - sorry for the noise.

c4rt0 added a commit to c4rt0/os that referenced this issue Feb 26, 2024
Failure related to this test wasn't observed in a few weeks.

See: coreos/coreos-assembler#3670
@c4rt0
Copy link
Member

c4rt0 commented Feb 26, 2024

The problem related to this issue wasn't observed in a while. I failed to reproduce it both locally and in the pipeline.

@c4rt0 c4rt0 closed this as completed Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira for syncing to jira
Projects
None yet
Development

No branches or pull requests

3 participants