Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump to Fedora 40 #3785

Merged
merged 2 commits into from
May 1, 2024
Merged

Bump to Fedora 40 #3785

merged 2 commits into from
May 1, 2024

Conversation

jlebon
Copy link
Member

@jlebon jlebon commented Apr 25, 2024

Some of our upstream CIs (ostree, rpm-ostree) require cosa and FCOS to be on the same release. Ideally we'd fix that but there's details there and we want to move cosa anyway.

@jlebon
Copy link
Member Author

jlebon commented Apr 25, 2024

Didn't test this at all. Let's see what CI says.

@jlebon
Copy link
Member Author

jlebon commented Apr 25, 2024

openshift/release PR: openshift/release#51370

@jlebon
Copy link
Member Author

jlebon commented Apr 25, 2024

(Testing locally as well in parallel now.)

Let's also push a release and add a Quay.io tag before merging this.

@dustymabe
Copy link
Member

Let's also push a release and add a Quay.io tag before merging this.

agree. Ideally we build the next stable with at least a similar base as to what testing was done with.

@jlebon
Copy link
Member Author

jlebon commented Apr 25, 2024

Prow needs openshift/release#51370.

@jlebon
Copy link
Member Author

jlebon commented Apr 25, 2024

/retest

jmarrero
jmarrero previously approved these changes Apr 25, 2024
Copy link
Member

@jmarrero jmarrero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@travier
Copy link
Member

travier commented Apr 26, 2024

/test ci/prow/images
/test ci/prow/rhcos

Copy link

openshift-ci bot commented Apr 26, 2024

@travier: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test images
  • /test rhcos

Use /test all to run all jobs.

In response to this:

/test ci/prow/images
/test ci/prow/rhcos

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@travier
Copy link
Member

travier commented Apr 26, 2024

/test images
/test rhcos

@jlebon
Copy link
Member Author

jlebon commented Apr 26, 2024

CoreOS CI hanging at the cosa fetch --strict step. Possibly something going wrong with supermin. Prow is timing out, likely because of the same issue but for some reason we're not getting any logs there.

@jlebon
Copy link
Member Author

jlebon commented Apr 29, 2024

Seems related to virtio-serial writes from the guest side sometimes hanging for some reason. (I.e. writes to /dev/virtio-ports/cosa-cmdout.)

@jlebon
Copy link
Member Author

jlebon commented Apr 29, 2024

CoreOS CI hanging at the cosa fetch --strict step.

OK, latest commit seems to have fixed it! Looked a bit through git log v8.1.3..v8.2.2 in QEMU to see if anything obvious pops out but didn't see anything.

@dustymabe
Copy link
Member

since we have to run CI again maybe let's update: tests/containers/tang/Containerfile too.

dustymabe
dustymabe previously approved these changes Apr 30, 2024
@jlebon
Copy link
Member Author

jlebon commented Apr 30, 2024

OK weird, debugging in the pod, it looks like Prow is still hitting the same hanging issue that I thought 7857488 (#3785) fixed. And even more fun, I can't get this hang to reproduce when running manually in the pod. So I think there's a race somewhere and the commit just made it less likely.

Anyway, this now sounds like possibly some bug when combining virtio-serial and stdio. I think I'll just rework this to use a regular serial device instead of virtio-serial since that's obviously way more battle-tested.

@jlebon jlebon force-pushed the pr/f40-rebase branch 2 times, most recently from d20b066 to f124fa9 Compare May 1, 2024 15:53
@jlebon
Copy link
Member Author

jlebon commented May 1, 2024

OK, ran out of cycles trying to debug this. I've ended having to essentially revert 4eb19f4, which is unfortunate. But at least it passes CI in both Prow and CoreOS CI.

I think I'll just rework this to use a regular serial device instead of virtio-serial

The problem with this is that it doesn't work on all arches. E.g. on aarch64, adding another --serial doesn't create a /dev/ttyAMA1 device.

@jlebon
Copy link
Member Author

jlebon commented May 1, 2024

Have some work to try to create a minimal/self-contained reproducer to file a bug, but it's proving trickier than expected.

jlebon added 2 commits May 1, 2024 11:59
Some of our upstream CIs (ostree, rpm-ostree) require cosa and FCOS to
be on the same release. Ideally we'd fix that but there's details there
and we want to move cosa anyway.
This is more or less a revert of 4eb19f4.

It seems like QEMU v8.2.2 (in Fedora 40) is hitting issues when
combining virtio-serial ports and the stdio character device. When the
guest writes to the virtio-serial port, it sometimes hangs.

We can look at reverting this patch if it works again in a future
version.
@jlebon
Copy link
Member Author

jlebon commented May 1, 2024

Since CI already passed on this, let's just merge it in to unbreak CI and get to any other fallout faster.

@jlebon jlebon merged commit 79b15c8 into coreos:main May 1, 2024
2 of 5 checks passed
@jlebon jlebon deleted the pr/f40-rebase branch May 1, 2024 16:07
@@ -842,6 +845,9 @@ EOF
fi
rc="$(cat "${rc_file}")"

# cleanup tail before nuking dir containing file it's following
kill "$tail_pid"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a potential race here where tail could be killed before it finished actually printing all the output, even though qemu already exited. A simple fix is to just e.g. sleep 1 or whatever but ughhh. Really wish we could go back to the virtio-serial approach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jlebon added a commit to jlebon/coreos-assembler that referenced this pull request May 6, 2024
I can't reproduce this locally, but I have a suspicion that `tail` can
exit too quickly in some circumstances, causing truncated output:

openshift/os#1498 (comment)
coreos#3785 (comment)

Rather than having an unconditional `sleep`, let's make it easier to
test that theory by having an env var we can use to make it optional.
Then we'll test that in CI.

Mid-term, I'd like to revert 79b15c8 soon so we can go back to
virtio-serial which is just so much cleaner.
jlebon added a commit that referenced this pull request May 6, 2024
I can't reproduce this locally, but I have a suspicion that `tail` can
exit too quickly in some circumstances, causing truncated output:

openshift/os#1498 (comment)
#3785 (comment)

Rather than having an unconditional `sleep`, let's make it easier to
test that theory by having an env var we can use to make it optional.
Then we'll test that in CI.

Mid-term, I'd like to revert 79b15c8 soon so we can go back to
virtio-serial which is just so much cleaner.
jlebon added a commit to jlebon/coreos-assembler that referenced this pull request May 7, 2024
This is a follow-up to 79b15c8 ("cmdlib.sh: go back to using `tail -F`
for command output") which was subsequently reverted.

To summarize, it seems like in QEMU v8.2 (in f40), the guest sometimes
would hang when writing over virtio-serial if the device is hooked up to
the QEMU's stdio.

In testing, removing the `<&-` hack to close QEMU's stdin fixed it for
CoreOS CI but not Prow:

coreos#3785 (comment)

I think I've narrowed it down to CoreOS CI (i.e. Jenkins) allocating a
tty and Prow not. When stdin is not a tty, QEMU would immediately gets
EOF if it tries to read anything. I'm not sure exactly what happens, but
I think the virtio-serial hang is linked to this (even though there's no
userspace code in the guest trying to read from the virtio-serial port).

Work around this by explicitly feeding `/dev/zero` to QEMU's stdin.
jlebon added a commit that referenced this pull request May 8, 2024
This is a follow-up to 79b15c8 ("cmdlib.sh: go back to using `tail -F`
for command output") which was subsequently reverted.

To summarize, it seems like in QEMU v8.2 (in f40), the guest sometimes
would hang when writing over virtio-serial if the device is hooked up to
the QEMU's stdio.

In testing, removing the `<&-` hack to close QEMU's stdin fixed it for
CoreOS CI but not Prow:

#3785 (comment)

I think I've narrowed it down to CoreOS CI (i.e. Jenkins) allocating a
tty and Prow not. When stdin is not a tty, QEMU would immediately gets
EOF if it tries to read anything. I'm not sure exactly what happens, but
I think the virtio-serial hang is linked to this (even though there's no
userspace code in the guest trying to read from the virtio-serial port).

Work around this by explicitly feeding `/dev/zero` to QEMU's stdin.
dustymabe pushed a commit to dustymabe/coreos-assembler that referenced this pull request Sep 27, 2024
This is a follow-up to 79b15c8 ("cmdlib.sh: go back to using `tail -F`
for command output") which was subsequently reverted.

To summarize, it seems like in QEMU v8.2 (in f40), the guest sometimes
would hang when writing over virtio-serial if the device is hooked up to
the QEMU's stdio.

In testing, removing the `<&-` hack to close QEMU's stdin fixed it for
CoreOS CI but not Prow:

coreos#3785 (comment)

I think I've narrowed it down to CoreOS CI (i.e. Jenkins) allocating a
tty and Prow not. When stdin is not a tty, QEMU would immediately gets
EOF if it tries to read anything. I'm not sure exactly what happens, but
I think the virtio-serial hang is linked to this (even though there's no
userspace code in the guest trying to read from the virtio-serial port).

Work around this by explicitly feeding `/dev/zero` to QEMU's stdin.

(cherry picked from commit bb60451)
dustymabe pushed a commit that referenced this pull request Sep 27, 2024
This is a follow-up to 79b15c8 ("cmdlib.sh: go back to using `tail -F`
for command output") which was subsequently reverted.

To summarize, it seems like in QEMU v8.2 (in f40), the guest sometimes
would hang when writing over virtio-serial if the device is hooked up to
the QEMU's stdio.

In testing, removing the `<&-` hack to close QEMU's stdin fixed it for
CoreOS CI but not Prow:

#3785 (comment)

I think I've narrowed it down to CoreOS CI (i.e. Jenkins) allocating a
tty and Prow not. When stdin is not a tty, QEMU would immediately gets
EOF if it tries to read anything. I'm not sure exactly what happens, but
I think the virtio-serial hang is linked to this (even though there's no
userspace code in the guest trying to read from the virtio-serial port).

Work around this by explicitly feeding `/dev/zero` to QEMU's stdin.

(cherry picked from commit bb60451)
jlebon added a commit to openshift-cherrypick-robot/coreos-assembler that referenced this pull request Oct 17, 2024
This is a follow-up to 79b15c8 ("cmdlib.sh: go back to using `tail -F`
for command output") which was subsequently reverted.

To summarize, it seems like in QEMU v8.2 (in f40), the guest sometimes
would hang when writing over virtio-serial if the device is hooked up to
the QEMU's stdio.

In testing, removing the `<&-` hack to close QEMU's stdin fixed it for
CoreOS CI but not Prow:

coreos#3785 (comment)

I think I've narrowed it down to CoreOS CI (i.e. Jenkins) allocating a
tty and Prow not. When stdin is not a tty, QEMU would immediately gets
EOF if it tries to read anything. I'm not sure exactly what happens, but
I think the virtio-serial hang is linked to this (even though there's no
userspace code in the guest trying to read from the virtio-serial port).

Work around this by explicitly feeding `/dev/zero` to QEMU's stdin.

(cherry picked from commit bb60451)
dustymabe pushed a commit that referenced this pull request Oct 17, 2024
This is a follow-up to 79b15c8 ("cmdlib.sh: go back to using `tail -F`
for command output") which was subsequently reverted.

To summarize, it seems like in QEMU v8.2 (in f40), the guest sometimes
would hang when writing over virtio-serial if the device is hooked up to
the QEMU's stdio.

In testing, removing the `<&-` hack to close QEMU's stdin fixed it for
CoreOS CI but not Prow:

#3785 (comment)

I think I've narrowed it down to CoreOS CI (i.e. Jenkins) allocating a
tty and Prow not. When stdin is not a tty, QEMU would immediately gets
EOF if it tries to read anything. I'm not sure exactly what happens, but
I think the virtio-serial hang is linked to this (even though there's no
userspace code in the guest trying to read from the virtio-serial port).

Work around this by explicitly feeding `/dev/zero` to QEMU's stdin.

(cherry picked from commit bb60451)
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/coreos-assembler that referenced this pull request Oct 17, 2024
This is a follow-up to 79b15c8 ("cmdlib.sh: go back to using `tail -F`
for command output") which was subsequently reverted.

To summarize, it seems like in QEMU v8.2 (in f40), the guest sometimes
would hang when writing over virtio-serial if the device is hooked up to
the QEMU's stdio.

In testing, removing the `<&-` hack to close QEMU's stdin fixed it for
CoreOS CI but not Prow:

coreos#3785 (comment)

I think I've narrowed it down to CoreOS CI (i.e. Jenkins) allocating a
tty and Prow not. When stdin is not a tty, QEMU would immediately gets
EOF if it tries to read anything. I'm not sure exactly what happens, but
I think the virtio-serial hang is linked to this (even though there's no
userspace code in the guest trying to read from the virtio-serial port).

Work around this by explicitly feeding `/dev/zero` to QEMU's stdin.

(cherry picked from commit bb60451)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants