Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS as a module #3107

Merged
merged 2 commits into from
Mar 31, 2023
Merged

ZFS as a module #3107

merged 2 commits into from
Mar 31, 2023

Conversation

rouming
Copy link
Contributor

@rouming rouming commented Mar 17, 2023

This partially reverts commit 17dd29b.

With this revert we build ZFS again as a module. Why?
The linux kernel is licensed GPL, but OpenZFS is licensed CDDL.
These two are incompatible. So we compile both (kernel and ZFS)
and ship them separately.

This patch breaks collecting of the kernel dumps for the /persist
volume formatted as ZFS. We still jump into the kdump and output
the dmesg to the console, but not dump or logs are stored in the
/persist volume. This will be fixed in future releases of the
EVE-OS

cc: @deitch
cc: @eriknordmark

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question about storage-init but kicking off the tests in parallel.

@eriknordmark eriknordmark requested a review from rene March 17, 2023 21:03
@rene
Copy link
Contributor

rene commented Mar 20, 2023

@@ -34,4 +34,6 @@ zfs_set_default_parameters() {
set_module_parameter zfs zfs_smoothing_write 5
}

# Load ZFS and set parameters
modprobe zfs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that the ZFS module is loadable again, I'm wondering if doesn't make sense to pass all parameters during the loading instead of use /sys/module, that was set because module was built-in....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference? All the options start acting only when you do the mount. From kernel perspective I do not see a big difference when to update the static option variables: on modprobe or just after.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, there isn't too much difference in terms of execution, but I don't see any reason to keep two functions (set_module_parameter and zfs_set_default_parameters), that were created exclusively because the module was built-in, while we can just load the module and set the parameters at once. By the way, the right way to load the module and pass these parameters is actually by creating a file zfs.conf at /etc/modprobe.d and use the format: option <module_name> <parameter(s)>, we do this, for instance, in /hostfs/etc/modprobe.d/kms.conf. And to load zfs module, you can add the module name to /hostfs/etc/modules, unless I'm missing something that justifies this script for the module loading...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I did an attempt but it is not that simple with our container model. The description is here: #3107 (comment) I revoke the patch and leave as it is supposed to be - just a revert in order to fix the licensing problem. I assume we need to revisit this and fix how we call modprobe and make it proper way.

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the riscv build is failing for some reason.

@rouming
Copy link
Contributor Author

rouming commented Mar 21, 2023

Looks like the riscv build is failing for some reason.

Can be this line: diff -cw .config .config.new, which silently fails. Will take a look.

@rouming
Copy link
Contributor Author

rouming commented Mar 21, 2023

Difference to the previous version:

  • Fixed riscv64 build by adding ZLIB_DEFFLATE=m
  • Rebased on latest master

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rouming
Copy link
Contributor Author

rouming commented Mar 22, 2023

Difference to the previous version:

  • Get rid of a custom mechanism of setting parameters for zfs built-in module through the sysfs in a favor of a standard modprobe.d configuration file.

@rouming
Copy link
Contributor Author

rouming commented Mar 22, 2023

Difference to the previous version:

  • Have to revoke the patch which does the modprobe of the zfs instead of the custom script. The problem is that we call modprobe from a container (surprise surprise) which does not see the original rootfs, -C option does not work (modprobe version is not a GNU one). So fixing this requires time and more changes, which is not the intention of this PR (supposed to be a quick revert fix). cc: @rene

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run eden

@rene
Copy link
Contributor

rene commented Mar 22, 2023

Difference to the previous version:

* Have to revoke the patch which does the modprobe of the zfs instead of the custom script. The problem is that we call modprobe from a container (surprise surprise) which does not see the original rootfs, -C option does not work (modprobe version is not a GNU one). So fixing this requires time and more changes, which is not the intention of this PR (supposed to be a quick revert fix). cc: @rene

Ok, thanks @rouming . Let's discuss about it and create the proper issue to handle modprobe in a better (and unified) way.

LGTM.

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rouming I've re-run the eden tests about 7 times and I don't think it has passed for any of the zfs variants at any time.
(But it does work if I use the rootfs to update a device running with a ZFS /persist).
So these failures needs some investigation to make sure installation and onboarding works correctly with a ZFS /persist.

@rouming
Copy link
Contributor Author

rouming commented Mar 27, 2023

Difference to the previous version:

  • Rebased on latest master

@eriknordmark according to my understanding profile server test cases fail for some reason, for example:

time: 2023-03-24T14:24:55.433386041Z out: 	appName local-manager state changed to RUNNING
time: 2023-03-24T14:24:55.433409642Z out: 	appName app-profile-1 state changed to HALTED
time: 2023-03-24T14:24:55.433413542Z out: 	appName app-profile-2 state changed to HALTED
time: 2023-03-24T14:24:55.433416543Z out: 	appName app-profile-1-2 state changed to HALTED
time: 2023-03-24T14:29:04.224650206Z out: 	appName app-profile-1 state changed to BOOTING
time: 2023-03-24T14:29:09.227216468Z out: 	appName app-profile-1 state changed to RUNNING
time: 2023-03-24T14:29:50.265640796Z out: 	appName app-profile-2 state changed to BOOTING
time: 2023-03-24T14:29:54.269975876Z out: 	appName app-profile-2 state changed to RUNNING

so some of the vms are running, which means that the error not zfs related (I assume if /persist is readonly, could not be mounted, has errors, etc, we could not boot and run vms at all), but maybe memory/time issue? I will check more precisely.

Update: also

2023-03-24T12:32:57.9577308Z [pool_name:"persist" storage_type:STORAGE_TYPE_INFO_ZFS zfs_version:"zfs-kmod-2.1.2-1" current_raid:STORAGE_RAID_TYPE_NORAID compression_ratio:1 zpool_size:3489660928 storage_state:STORAGE_STATUS_ONLINE disks:{disk_name:{name:"/dev/sda9" serial:"QM00001"} status:STORAGE_STATUS_ONLINE state:"No error."} pool_status_msg:"OK"]

no errors, all good

@eriknordmark
Copy link
Contributor

no errors, all good

Seems like you're correct that this PR doesn't make it any worse. But glancing at eden results for other PRs it seems like 100% of the eden runs for zfs are failing since a week or two (unrelated to this PR). I have no idea whether this is an issue with flaky test(s) which somehow are more brittle when using zfs compared to ext4, or if there is an EVE-OS issue affecting ZFS. It feels scary to continue to accept PR until we know which case it is. So someone needs to root cause this failures before we proceed with any PRs.

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kick off tests again

rouming added 2 commits March 31, 2023 17:10
Don't do that! kdump kernel is not for re-formatting the FS, but
for attempting to collect a dump. Even the FS is corrupted we don't
re-format the FS. Even ZFS modules are not loaded we don't re-format
the FS.

Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
This partially reverts commit 17dd29b.

With this revert we build ZFS again as a module. Why?
The linux kernel is licensed GPL, but OpenZFS is licensed CDDL.
These two are incompatible. So we compile both (kernel and ZFS)
and ship them separately.

This patch breaks collecting of the kernel dumps for the /persist
volume formatted as ZFS. We still jump into the kdump and output
the dmesg to the console, but not dump or logs are stored in the
/persist volume. This will be fixed in future releases of the
EVE-OS

Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
@rouming rouming merged commit 37039cd into lf-edge:master Mar 31, 2023
@eriknordmark
Copy link
Contributor

@rouming note that this never passes any of the Eden tests with zfs (I think 2x(6+8)=28 tries all failed). So I don't know what the impact will be on neither product quality nor our ability to use the eden tests to test product quaity. Can we please prioritize getting the failing tests fixed? I really hate driving blind.

@rouming
Copy link
Contributor Author

rouming commented Apr 4, 2023

@eriknordmark I'm running the following locally on the latest master:

./eden config set default --key=eve.disks --value=4
./eden config set default --key=eve.disk --value=4096
./eden config set default --key=eve.ram --value=4096
./eden config set default --key=eve.cpu --value=4
./eden config set default --key eve.tag --value ${EVE_TAG}
./eden config set default --key eve.log-level --value debug

./eden setup -v debug --grub-options='set_global dom0_extra_args "$dom0_extra_args eve_install_zfs_with_raid_level "'
EDEN_TEST_STOP=n ./eden test ./tests/workflow -v debug

(copied from the zfs run from github actions)

All tests successfully pass. From what I see on github, all failures have random nature, like waited for the VM state and timeout expired. My simple assumption is that zfs is greedy to resources and since we don't control the testing environment this is just a matter of unluck to get these failures.

@dautovri Can we have more resources on github actions and somehow guarantee dedicated hardware for our tests?

@dautovri
Copy link
Member

dautovri commented Apr 4, 2023

@rouming I will create a support ticket to increase a size of GH runners, plus we can try to make own self hosted GH runners.

@rouming
Copy link
Contributor Author

rouming commented Apr 4, 2023

@dautovri Thanks, Ruslan! That what we definitely need. Let's start from the increasing number of runners. And then self hosted GH.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants