Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

40ignition-ostree: add autosave-xfs transposefs unit #2320

Merged
merged 7 commits into from
Apr 6, 2023

Conversation

jlebon
Copy link
Member

@jlebon jlebon commented Mar 22, 2023

Add a new transposefs unit: autosave-xfs. This unit runs after
ignition-disks and ignition-ostree-growfs, but before the restore
transposefs unit.

If the XFS root was grown, it checks if the allocation group count
(agcount) is within a reasonable amount (128 is chosen here). If
it isn't, it saves the rootfs and reformats the filesystem. The
restore unit will then restore it as usual. In the case of in-place
reprovisioning like LUKS (i.e. where the partition table isn't modified
by the Ignition config), the rootfs is still saved only once.

Ideally, instead of adding a new transposefs unit, we would make it
part of the initial save unit. But at that point, there's no way to
tell whether we should autosave without gazing even more deeply into the
Ignition config. We also don't want to unconditionally save the rootfs
when we may not need it.

Closes: coreos/fedora-coreos-tracker#1183

Comment on lines +140 to +147
# Semi-arbitrarily chosen: this is roughly ~64G currently (based on initial
# ag sizing at build time) which seems like a good rootfs size at which to
# discriminate between "throwaway/short-lived systems" and "long-running
# workload systems". It's not like XFS performance is way worse at 128.
if [ "$agcount" -lt 128 ]; then
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sandeen, ended up going with 128 as a threshold for this.

cgwalters
cgwalters previously approved these changes Mar 24, 2023
Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave this a read, looks good.

@@ -28,6 +28,8 @@ install() {
systemd-sysusers \
systemd-tmpfiles \
sort \
xfs_info \
xfs_spaceman \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧑‍🚀

@jlebon
Copy link
Member Author

jlebon commented Apr 5, 2023

Rebased for conflicts now! With FCOS releases freshly out, I think we can get this in.

Copy link
Member

@dustymabe dustymabe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM - some suggestions

Prep for automatic XFS reprovisioning.

The only way to know for sure whether the rootfs should be reprovisioned
is analyzing it after the filesystem was grown. We could do calculations
beforehand, but it'd get complex having to analyze the partition table.

Anyway, the partition growing and e.g. LUKS container resizing need
to happen before automatic reprovisioning and ignition-ostree-growfs
already knows how to do that.
No functional change.
Prep for future patch.
Currently, this code only executes via Ignition reprovisioning the
rootfs, but we're about to add code to reprovision the rootfs outside of
that path. In that case, we don't need to query the Ignition config.
Add a new transposefs unit: `autosave-xfs`. This unit runs after
`ignition-disks` and `ignition-ostree-growfs,` but before the `restore`
transposefs unit.

If the XFS root was grown, it checks if the allocation group count
(agcount) is within a reasonable amount (128 is chosen here). If
it isn't, it saves the rootfs and reformats the filesystem. The
`restore` unit will then restore it as usual. In the case of in-place
reprovisioning like LUKS (i.e. where the partition table isn't modified
by the Ignition config), the rootfs is still saved only once.

Ideally, instead of adding a new transposefs unit, we would make it
part of the initial `save` unit. But at that point, there's no way to
tell whether we should autosave without gazing even more deeply into the
Ignition config. We also don't want to unconditionally save the rootfs
when we may not need it.

Closes: coreos/fedora-coreos-tracker#1183
We weren't checking anywhere in the non-reprovisioning case that we grow
the root filesystem on first boot. Add a trivial test for this.
Prep for adding another LUKS where we want the same checks.
Add a new `ext.config.root-reprovision.autosave-xfs` test that checks
that the logic kicks in on a large enough disk. Add a similar
`ext.config.root-reprovision.luks.autosave-xfs` for the LUKS version
of this.

Sanity-check in other reprovisioning tests that autosave-xfs didn't
kick in.
@jlebon
Copy link
Member Author

jlebon commented Apr 6, 2023

Updated for comments!

@@ -202,6 +244,23 @@ case "${1:-}" in
mkdir "${saved_prep}"
fi
;;
autosave-xfs)
should_autosave=$(should_autosave_rootfs)
Copy link
Member

@dustymabe dustymabe Apr 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so (by design) now the script will exit here if an unexpected error occurs (because of set -euo pipefail) and that is how we'll handle the cases we weren't handling before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, exactly. It's no longer part of an if-statement, so the errexit should be respected.

@@ -202,6 +244,23 @@ case "${1:-}" in
mkdir "${saved_prep}"
fi
;;
autosave-xfs)
should_autosave=$(should_autosave_rootfs)
if [ "${should_autosave}" = "1" ]; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if [ "${should_autosave}" = "1" ]; then
if [ "${should_autosave}" == "1" ]; then

It always throws me off that a single = works.

You don't need to change this, I'm just calling it out.

Copy link
Member

@dustymabe dustymabe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cgwalters
Copy link
Member

Downstream bug https://issues.redhat.com/browse/OCPBUGS-16157

dustymabe added a commit to dustymabe/fedora-coreos-pipeline that referenced this pull request Jul 27, 2023
This was added to force a reprovision of the root filesystem on a
particular instance we were using for a RHCOS builder, but it's
no longer needed. Now when a system has a disk > 100G the root
filesystem will get reprovisioned automatically by the work done in
coreos/fedora-coreos-config#2320
dustymabe added a commit to coreos/fedora-coreos-pipeline that referenced this pull request Jul 28, 2023
This was added to force a reprovision of the root filesystem on a
particular instance we were using for a RHCOS builder, but it's
no longer needed. Now when a system has a disk > 100G the root
filesystem will get reprovisioned automatically by the work done in
coreos/fedora-coreos-config#2320
@cgwalters
Copy link
Member

Further fallout from this in https://issues.redhat.com/browse/OCPBUGS-16724

cgwalters added a commit to cgwalters/fedora-coreos-config that referenced this pull request Aug 24, 2023
The change in coreos#2320
has been very problematic for OpenShift because our default node
configuration is *always* over the threshold, and that causes
significant latency on instance provisioning.

First, rework the reprovision threshold to operate in terms of
disk size, which is much easier to explain and debug than
allocation group count.  (Which to be clear, *is* the real problem,
but disk size is a good enough proxy for this)

Then, bump the reprovision threshold to 1TiB.  This is comfortably
about the default OpenShift node root disk sizes, and returns
us to the prior status quo.
cgwalters added a commit to cgwalters/fedora-coreos-config that referenced this pull request Aug 24, 2023
The change in coreos#2320
has been very problematic for OpenShift because our default node
configuration is *always* over the threshold, and that causes
significant latency on instance provisioning.

First, rework the reprovision threshold to operate in terms of
disk size, which is much easier to explain and debug than
allocation group count.  (Which to be clear, *is* the real problem,
but disk size is a good enough proxy for this)

Then, bump the reprovision threshold to 1TiB.  This is comfortably
about the default OpenShift node root disk sizes, and returns
us to the prior status quo.
cgwalters added a commit to cgwalters/fedora-coreos-config that referenced this pull request Aug 24, 2023
The change in coreos#2320
has been very problematic for OpenShift because our default node
configuration is *always* over the threshold, and that causes
significant latency on instance provisioning.

Experimentally bumping to 400 allocation groups, which is about 700GiB.
This is comfortably about the default OpenShift node root disk sizes,
and returns us to the prior status quo.

While we're here, rework the logging a bit so that we *always*
log the `agcount` for debugging purposes.
@cgwalters
Copy link
Member

Followup PR in #2565

cgwalters added a commit to cgwalters/fedora-coreos-config that referenced this pull request Aug 25, 2023
The change in coreos#2320
has been very problematic for OpenShift because our default node
configuration is *always* over the threshold, and that causes
significant latency on instance provisioning.

Experimentally bumping to 400 allocation groups, which is about 700GiB.
This is comfortably about the default OpenShift node root disk sizes,
and returns us to the prior status quo.

While we're here, rework the logging a bit so that we *always*
log the `agcount` for debugging purposes.
cgwalters added a commit to cgwalters/fedora-coreos-config that referenced this pull request Aug 28, 2023
The change in coreos#2320
has been very problematic for OpenShift because our default node
configuration is *always* over the threshold, and that causes
significant latency on instance provisioning.

Experimentally bumping to 400 allocation groups, which is about 700GiB.
This is comfortably about the default OpenShift node root disk sizes,
and returns us to the prior status quo.

While we're here, rework the logging a bit so that we *always*
log the `agcount` for debugging purposes.
cgwalters added a commit to cgwalters/fedora-coreos-config that referenced this pull request Aug 28, 2023
The change in coreos#2320
has been very problematic for OpenShift because our default node
configuration is *always* over the threshold, and that causes
significant latency on instance provisioning.

Experimentally bumping to 400 allocation groups, which is about 700GiB.
This is comfortably about the default OpenShift node root disk sizes,
and returns us to the prior status quo.

While we're here, rework the logging a bit so that we *always*
log the `agcount` for debugging purposes.

Also:

- Only log to stdout for normal conditions
- Include the name of the systemd unit in the test description
  so we can cross-reference
- tests: Hoist the expected agcount of 4 to a common variable
cgwalters added a commit to cgwalters/fedora-coreos-config that referenced this pull request Aug 28, 2023
The change in coreos#2320
has been very problematic for OpenShift because our default node
configuration is *always* over the threshold, and that causes
significant latency on instance provisioning.

Experimentally bumping to 400 allocation groups, which is about 700GiB.
This is comfortably about the default OpenShift node root disk sizes,
and returns us to the prior status quo.

While we're here, rework the logging a bit so that we *always*
log the `agcount` for debugging purposes.

Also:

- Only log to stdout for normal conditions
- Include the name of the systemd unit in the test description
  so we can cross-reference
- tests: Hoist the expected agcount of 4 to a common variable
cgwalters added a commit to cgwalters/fedora-coreos-config that referenced this pull request Aug 28, 2023
The change in coreos#2320
has been very problematic for OpenShift because our default node
configuration is *always* over the threshold, and that causes
significant latency on instance provisioning.

Experimentally bumping to 400 allocation groups, which is about 700GiB.
This is comfortably about the default OpenShift node root disk sizes,
and returns us to the prior status quo.

While we're here, rework the logging a bit so that we *always*
log the `agcount` for debugging purposes.

Also:

- Only log to stdout for normal conditions
- Include the name of the systemd unit in the test description
  so we can cross-reference
- tests: Hoist the expected agcount of 4 to a common variable
cgwalters added a commit to cgwalters/fedora-coreos-config that referenced this pull request Aug 29, 2023
The change in coreos#2320
has been very problematic for OpenShift because our default node
configuration is *always* over the threshold, and that causes
significant latency on instance provisioning.

Experimentally bumping to 400 allocation groups, which is about 700GiB.
This is comfortably about the default OpenShift node root disk sizes,
and returns us to the prior status quo.

While we're here, rework the logging a bit so that we *always*
log the `agcount` for debugging purposes.

Also:

- Only log to stdout for normal conditions
- Include the name of the systemd unit in the test description
  so we can cross-reference
- tests: Hoist the expected agcount of 4 to a common variable
cgwalters added a commit that referenced this pull request Aug 29, 2023
The change in #2320
has been very problematic for OpenShift because our default node
configuration is *always* over the threshold, and that causes
significant latency on instance provisioning.

Experimentally bumping to 400 allocation groups, which is about 700GiB.
This is comfortably about the default OpenShift node root disk sizes,
and returns us to the prior status quo.

While we're here, rework the logging a bit so that we *always*
log the `agcount` for debugging purposes.

Also:

- Only log to stdout for normal conditions
- Include the name of the systemd unit in the test description
  so we can cross-reference
- tests: Hoist the expected agcount of 4 to a common variable
dustymabe pushed a commit to dustymabe/fedora-coreos-config that referenced this pull request Aug 30, 2023
The change in coreos#2320
has been very problematic for OpenShift because our default node
configuration is *always* over the threshold, and that causes
significant latency on instance provisioning.

Experimentally bumping to 400 allocation groups, which is about 700GiB.
This is comfortably about the default OpenShift node root disk sizes,
and returns us to the prior status quo.

While we're here, rework the logging a bit so that we *always*
log the `agcount` for debugging purposes.

Also:

- Only log to stdout for normal conditions
- Include the name of the systemd unit in the test description
  so we can cross-reference
- tests: Hoist the expected agcount of 4 to a common variable

(cherry picked from commit 4faba4f)
dustymabe pushed a commit that referenced this pull request Aug 30, 2023
The change in #2320
has been very problematic for OpenShift because our default node
configuration is *always* over the threshold, and that causes
significant latency on instance provisioning.

Experimentally bumping to 400 allocation groups, which is about 700GiB.
This is comfortably about the default OpenShift node root disk sizes,
and returns us to the prior status quo.

While we're here, rework the logging a bit so that we *always*
log the `agcount` for debugging purposes.

Also:

- Only log to stdout for normal conditions
- Include the name of the systemd unit in the test description
  so we can cross-reference
- tests: Hoist the expected agcount of 4 to a common variable

(cherry picked from commit 4faba4f)
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this pull request Oct 10, 2023
The change in coreos#2320
has been very problematic for OpenShift because our default node
configuration is *always* over the threshold, and that causes
significant latency on instance provisioning.

Experimentally bumping to 400 allocation groups, which is about 700GiB.
This is comfortably about the default OpenShift node root disk sizes,
and returns us to the prior status quo.

While we're here, rework the logging a bit so that we *always*
log the `agcount` for debugging purposes.

Also:

- Only log to stdout for normal conditions
- Include the name of the systemd unit in the test description
  so we can cross-reference
- tests: Hoist the expected agcount of 4 to a common variable
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this pull request Oct 10, 2023
The change in coreos#2320
has been very problematic for OpenShift because our default node
configuration is *always* over the threshold, and that causes
significant latency on instance provisioning.

Experimentally bumping to 400 allocation groups, which is about 700GiB.
This is comfortably about the default OpenShift node root disk sizes,
and returns us to the prior status quo.

While we're here, rework the logging a bit so that we *always*
log the `agcount` for debugging purposes.

Also:

- Only log to stdout for normal conditions
- Include the name of the systemd unit in the test description
  so we can cross-reference
- tests: Hoist the expected agcount of 4 to a common variable
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CoreOS autoinstall creates huge number of XFS allocation groups
3 participants