Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COS-2748: denylist: enable iscsi iso-offline-install-iscsi.bios #1461

Merged

Conversation

jbtrystram
Copy link
Contributor

The multi-arch tests are skipped in kola code and
the x86 test have been working for some time now
See coreos/coreos-assembler#3705

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 14, 2024
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 14, 2024
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 14, 2024
@jlebon
Copy link
Member

jlebon commented Mar 15, 2024

The two metal jobs failed on the iSCSI tests.

@jbtrystram
Copy link
Contributor Author

there is nothing obvious in the logs, but we are missing the inner VM logs. I'll update the test to have the inner VM logs to debug this more

@jbtrystram
Copy link
Contributor Author

coreos/coreos-assembler#3763

This one should help with that

@jbtrystram
Copy link
Contributor Author

I am unable to reproduce the failure locally

@jbtrystram
Copy link
Contributor Author

/retest

@jbtrystram
Copy link
Contributor Author

/test rhcos-9-build-test-metal

@jbtrystram
Copy link
Contributor Author

/retest

1 similar comment
@jbtrystram
Copy link
Contributor Author

/retest

@jlebon
Copy link
Member

jlebon commented Mar 26, 2024

Hmm, I still see console=ttyS0,9600n8 in the kernel command line in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_os/1461/pull-ci-openshift-os-master-scos-9-build-test-metal/1772267143987990528/artifacts/test/artifacts/kola-testiso/iso-offline-install-iscsi.bios/nested_vm_console.txt, so possibly this isn't running with the latest cosa.

And indeed, if we look at https://storage.googleapis.com/test-platform-results/pr-logs/pull/openshift_os/1461/pull-ci-openshift-os-master-scos-9-build-test-metal/1772267143987990528/build-log.txt, we see

  "git": {
    "commit": "6cbe50215e2040f92342a27ab57e5f36e1c424aa",
    "origin": "https://github.com/coreos/coreos-assembler.git",
    "branch": "HEAD",
    "dirty": "false"
  },

and 6cbe50215e2040f92342a27ab57e5f36e1c424aa is the commit just before coreos/coreos-assembler@d833492.

Let's just try to retest it.

/retest

@jbtrystram
Copy link
Contributor Author

jbtrystram commented Mar 27, 2024

so this last run contains the console fix from cosa.

However the console output is still truncated : https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_os/1461/pull-ci-openshift-os-master-rhcos-9-build-test-metal/1772662083565916160/artifacts/test/artifacts/kola-testiso/iso-offline-install-iscsi.bios/nested_vm_console.txt

It looks like the nested VM dies. Could it be that the test process is killed beause it uses too much memoy?

Locally the test runs fine, the nested console log is almost 2K lines long (175 in prow) and I can see the whole boot sequence.
Here is just seems to stop.
Note that the FCOS ci runs just fine

@travier
Copy link
Member

travier commented Mar 27, 2024

We limit jobs to 3GB memory: https://github.com/openshift/release/blob/master/ci-operator/config/openshift/os/openshift-os-master.yaml#L71

If you can confirm that this is the issue then we can bump the requirement. Or skip it in Prow only.

@jlebon
Copy link
Member

jlebon commented Mar 27, 2024

We need at least 4G so that would add up indeed. Likely also there's a bug in cosa where we're reporting a timeout instead of more clearly indicating that qemu exited. I squashed one of those recently-ish.

jbtrystram added a commit to jbtrystram/coreos-assembler that referenced this pull request Mar 28, 2024
Openshift prow have a default limit to 3GB for the kola job
so reduce the memory for this test to 2GB, local testing showed
no issue with this amount.

See openshift/os#1461 (comment)
jbtrystram added a commit to jbtrystram/coreos-assembler that referenced this pull request Mar 28, 2024
Openshift prow have a default limit to 3GB for the kola job
so reduce the memory for this test to 2GB, local testing showed
no issue with this amount.

See openshift/os#1461 (comment)
jlebon pushed a commit to coreos/coreos-assembler that referenced this pull request Mar 28, 2024
Openshift prow have a default limit to 3GB for the kola job
so reduce the memory for this test to 2GB, local testing showed
no issue with this amount.

See openshift/os#1461 (comment)
@jbtrystram
Copy link
Contributor Author

/retest

@c4rt0
Copy link
Contributor

c4rt0 commented Mar 29, 2024

It timed out, so I'll...

/retest

@jbtrystram
Copy link
Contributor Author

Reducing the memory allocation for the test VM didn't do it, so it's not a oom issue..

@jbtrystram
Copy link
Contributor Author

/retest

Copy link
Contributor

@c4rt0 c4rt0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it times out, this make's sense

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 9, 2024
@jbtrystram
Copy link
Contributor Author

i am unable to reproduce the issue. I tested the setup from a prow pod and was able to boot the nested vm with this setup : gist.github.com/jbtrystram/b8edb5d33dbcf203e42c6e32e5cb7c4a

however it did took a while so maybe it's just the 10 min limit that's being hit there

So i played some more today and was not able to find anything interesting.
Outside of kola, the test works, but takes around 11min to complete.
I also tried to simply evaluate nested virt. In the prow pod I did:

  • cosa run --add-ignition autologin
    Once in the VM:
  • cd /mnt/workdir-tmp
  • cosa run --add-ignition autologin
    This did work, but a bit slowly.

@c4rt0
Copy link
Contributor

c4rt0 commented Apr 17, 2024

Not sure if I'm on the right track here, but I was comparing today out of curiosity all of the build-test-metal data between this job prow's yaml file and the bump fedora-coreos-config build-test-metal yaml

Obviously the only difference is the log part where this test fails

  - containerID: cri-o://72db2099b9a33a65bf6b9a64b3a57dc9f1240ca9d37c346aaad7b45c5315be5f
    image: image-registry.openshift-image-registry.svc:5000/ci/ci-operator@sha256:edb9193a2416342fccf1abb62b7dd4ed2b9a4ab603fff2eceb5bfce2bbec00a8
    imageID: image-registry.openshift-image-registry.svc:5000/ci/ci-operator@sha256:6ea7b86d546c1a9ad8ee8384b20d063f59f61376cb0ff2fb8d530169db6297de
    lastState: {}
    name: test
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: cri-o://72db2099b9a33a65bf6b9a64b3a57dc9f1240ca9d37c346aaad7b45c5315be5f
        exitCode: 127
        finishedAt: "2024-04-17T10:07:23Z"
        message: "0m[2024-04-17T07:08:59Z] Building build-image                         \n\e[36mINFO\e[0m[2024-04-17T07:08:59Z]
          Found existing build \"build-image-amd64\"     \n\e[36mINFO\e[0m[2024-04-17T07:11:25Z]
          Build build-image-amd64 succeeded after 2m27s \n\e[36mINFO\e[0m[2024-04-17T07:11:26Z]
          Image ci-op-pxfhb3m5/pipeline:build-image created  \e[36mfor-build\e[0m=build-image\n\e[36mINFO\e[0m[2024-04-17T07:11:26Z]
          Executing test rhcos-9-build-test-metal      \n{\"component\":\"entrypoint\",\"file\":\"k8s.io/test-infra/prow/entrypoint/run.go:169\",\"func\":\"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess\",\"level\":\"error\",\"msg\":\"Process
          did not finish before 3h0m0s timeout\",\"severity\":\"error\",\"time\":\"2024-04-17T10:07:22Z\"}\n\e[36mINFO\e[0m[2024-04-17T10:07:22Z]
          Received signal.                              \e[36msignal\e[0m=interrupt\n\e[36mINFO\e[0m[2024-04-17T10:07:22Z]
          error: Process interrupted with signal interrupt, cancelling execution...
          \n\e[36mINFO\e[0m[2024-04-17T10:07:22Z] cleanup: Deleting test pod rhcos-9-build-test-metal
          \n\e[36mINFO\e[0m[2024-04-17T10:07:22Z] Ran for 3h0m0s                               \n\e[31mERRO\e[0m[2024-04-17T10:07:22Z]
          Some steps failed:                           \n\e[31mERRO\e[0m[2024-04-17T10:07:22Z]
          \n  * could not run steps: execution cancelled\n  * could not run steps:
          step rhcos-9-build-test-metal failed: test \"rhcos-9-build-test-metal\"
          failed: could not watch pod: context canceled \n\e[36mINFO\e[0m[2024-04-17T10:07:22Z]
          Reporting job state 'failed' with reason 'executing_graph:interrupted' \n{\"component\":\"entrypoint\",\"file\":\"k8s.io/test-infra/prow/entrypoint/run.go:264\",\"func\":\"k8s.io/test-infra/prow/entrypoint.gracefullyTerminate\",\"level\":\"error\",\"msg\":\"Process
          gracefully exited before 1h0m0s grace period\",\"severity\":\"error\",\"time\":\"2024-04-17T10:07:23Z\"}\n{\"component\":\"entrypoint\",\"error\":\"process
          timed out\",\"file\":\"k8s.io/test-infra/prow/entrypoint/run.go:84\",\"func\":\"k8s.io/test-infra/prow/entrypoint.Options.internalRun\",\"level\":\"error\",\"msg\":\"Error
          executing test process\",\"severity\":\"error\",\"time\":\"2024-04-17T10:07:23Z\"}\n"
        reason: Error
        startedAt: "2024-04-17T07:07:22Z"

I keep wondering when and where is the Found existing build message created? I'm failing to find it.

@jbtrystram
Copy link
Contributor Author

I keep wondering when and where is the Found existing build message created? I'm failing to find it.

It's because inside those 3 hours, I was building images and doing some testing. Once the sleep 3h was done, the tests carried on and found the artifacts I had built manually

Copy link
Contributor

@c4rt0 c4rt0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Another one, which might sound crazy:

Why not to try 4h as in here. Could this work?

edit: scratch the above.

This one still /lgtm

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 22, 2024
@jbtrystram jbtrystram force-pushed the remove-iscsi-denylist branch 2 times, most recently from 7ce4044 to f6db60d Compare April 22, 2024 19:16
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 22, 2024
@jbtrystram
Copy link
Contributor Author

/retest

@jbtrystram jbtrystram force-pushed the remove-iscsi-denylist branch 2 times, most recently from d40cef5 to e8f6fed Compare April 22, 2024 21:47
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 22, 2024
The multi-arch tests are skipped in kola code and
the x86 test have been working for some time now

However, the iscsi test fails in prow unexpectedly
so skip it here so it can run in the RHCOS jenkins
pipeline at least.
See openshift#1492
See coreos/coreos-assembler#3705
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 22, 2024
Copy link
Contributor

@c4rt0 c4rt0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 23, 2024
@jbtrystram
Copy link
Contributor Author

jbtrystram commented Apr 23, 2024

/retitle COS-2748: denylist: enable iscsi iso-offline-install-iscsi.bios

@openshift-ci openshift-ci bot changed the title denylist: enable iscsi iso-offline-install-iscsi.bios COS-2748: denylist: enable iscsi iso-offline-install-iscsi.bios Apr 23, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 23, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Apr 23, 2024

@jbtrystram: This pull request references COS-2748 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

The multi-arch tests are skipped in kola code and
the x86 test have been working for some time now
See coreos/coreos-assembler#3705

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

travier

This comment was marked as outdated.

Copy link
Member

@travier travier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Contributor

openshift-ci bot commented Apr 23, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: c4rt0, jbtrystram, travier

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [c4rt0,jbtrystram,travier]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

1 similar comment
Copy link
Contributor

openshift-ci bot commented Apr 23, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: c4rt0, jbtrystram, travier

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [c4rt0,jbtrystram,travier]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

openshift-ci bot commented Apr 23, 2024

@jbtrystram: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 975baa5 into openshift:master Apr 23, 2024
7 checks passed
@@ -23,5 +23,7 @@
osversion:
- c9s

- pattern: iso-offline-install-iscsi.bios
tracker: https://github.com/coreos/fedora-coreos-tracker/issues/1638
# This test is failing only in prow, so it's skipped by prow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: a bit awkward to have a commented-out denylist entry. I think it'd be clearer to instead add the link to #1492 as a comment just above where we --denylist-test it in ci/prow-entrypoint.sh since that's where the denying actually happens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was meant to be a reminder that we skip this test is prow. kola-denylist.yaml get more eyes than ci-entrypoint I suppose.
I'm fine changing it if you have strong opinions :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a strong opinion either. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants