Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZTS: acl/off/posixmode never passes? #16030

Closed
rrevans opened this issue Mar 26, 2024 · 2 comments · Fixed by #16922
Closed

ZTS: acl/off/posixmode never passes? #16030

rrevans opened this issue Mar 26, 2024 · 2 comments · Fixed by #16922
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@rrevans
Copy link
Contributor

rrevans commented Mar 26, 2024

I would like help debugging why I'm unable to make acl/off/posixmode pass when running ZTS locally.

System information

Type Version/Name
Distribution Name Fedora
Distribution Version 39
Kernel Version 6.6.8-200.fc39.x86_64
Architecture x86-64
OpenZFS Version latest (8cd8ccc), DEBUG

Describe the problem you're observing

ZTS test tests/functional/acl/off/posixmode.ksh always fails on my setup.

$ ./scripts/zfs-tests.sh -vx -r tmp.run

--- Cleanup ---
Removing pool(s):     
Removing all dm(s):   
Removing loopback(s): 
Removing files(s):    

--- Configuration ---
Runfiles:        tmp.run
STF_TOOLS:       /home/vm37user/zfs/tests/test-runner
STF_SUITE:       /home/vm37user/zfs/tests/zfs-tests
STF_PATH:        /home/vm37user/zfs/tests/zfs-tests/bin
FILEDIR:         /var/tmp
FILES:           /var/tmp/file-vdev0 /var/tmp/file-vdev1 /var/tmp/file-vdev2
LOOPBACKS:       /dev/loop0 /dev/loop1 /dev/loop2
DISKS:           loop0 loop1 loop2
NUM_DISKS:       3
FILESIZE:        4G
ITERATIONS:      1
TAGS:            functional
STACK_TRACER:    no
Keep pool(s):    rpool
Missing util(s): 

/home/vm37user/zfs/tests/test-runner/bin/test-runner.py    -c "tmp.run" -T "functional" -i "/home/vm37user/zfs/tests/zfs-tests" -I "1"
Test: /home/vm37user/zfs/tests/zfs-tests/tests/functional/acl/off/setup (run as root) [00:01] [PASS]
Test: /home/vm37user/zfs/tests/zfs-tests/tests/functional/acl/off/posixmode (run as root) [00:00] [FAIL]
Test: /home/vm37user/zfs/tests/zfs-tests/tests/functional/acl/off/cleanup (run as root) [00:01] [PASS]

Results Summary
PASS	   2
FAIL	   1

Running Time:	00:00:02
Percent passed:	66.7%
Log directory:	/var/tmp/test_results/20240326T080030

Tests with results other than PASS that are expected:

Tests with result of PASS that are unexpected:

Tests with results other than PASS that are unexpected:
    FAIL acl/off/posixmode (expected PASS)

this is the runfile I'm using:

[DEFAULT]
pre = setup
quiet = False
pre_user = root
user = root
timeout = 600
post_user = root
post = cleanup
failsafe_user = root
failsafe = callbacks/zfs_failsafe
outputdir = /var/tmp/test_results
tags = ['functional']

[tests/functional/acl/off]
tests = ['posixmode']
tags = ['functional', 'acl']

The test fails in the tmpfs precheck part of the test:

$ cat /var/tmp/test_results/20240326T080030/off/posixmode/stdout
ASSERTION: Verify POSIX mode bits function correctly
SUCCESS: mount -t tmpfs tmp /var/tmp/tmp.sJ7oJMw6FP
SUCCESS: chmod 777 /var/tmp/tmp.sJ7oJMw6FP
SUCCESS: mkdir /var/tmp/tmp.sJ7oJMw6FP/dir
SUCCESS: chown :root /var/tmp/tmp.sJ7oJMw6FP/dir
SUCCESS: chmod 007 /var/tmp/tmp.sJ7oJMw6FP/dir
SUCCESS: touch /var/tmp/tmp.sJ7oJMw6FP/dir/file
SUCCESS: chown :root /var/tmp/tmp.sJ7oJMw6FP/dir/file
total 0
d------rwx. 2 root root 60 Mar 26 08:00 .
drwxrwxrwx. 3 root root 60 Mar 26 08:00 ..
-rw-r--r--. 1 root root  0 Mar 26 08:00 file
SUCCESS: ls -la /var/tmp/tmp.sJ7oJMw6FP/dir
SUCCESS: rm /var/tmp/tmp.sJ7oJMw6FP/dir/file
SUCCESS: touch /var/tmp/tmp.sJ7oJMw6FP/dir/file
SUCCESS: chown :root /var/tmp/tmp.sJ7oJMw6FP/dir/file
NOTE: user: staff2
NOTE: cmd: rm /var/tmp/tmp.sJ7oJMw6FP/dir/file
NOTE: out: 
NOTE: err: rm: cannot remove '/var/tmp/tmp.sJ7oJMw6FP/dir/file': Permission denied
ERROR: user_run staff2 rm /var/tmp/tmp.sJ7oJMw6FP/dir/file exited 1

It seems that on my setup for some reason g=--- disallows staff2, but I don't understand what's going on or why this would pass on other systems but not Fedora (SELinux? Something different about tmpfs in Linux 6.6?)

Describe how to reproduce the problem

Using the runfile above:

$ cat >/tmp.run
... the runfile contents ...
$ scripts/zfs-tests.sh -vx -r tmp.run

Include any warning/errors/backtraces from the system logs

Nothing of note

@rrevans rrevans added the Type: Defect Incorrect behavior (e.g. crash, hang) label Mar 26, 2024
@mcmilk
Copy link
Contributor

mcmilk commented Apr 19, 2024

The qemu testings on Fedora 39 seems to run the acl/off/posixmode within the distributions default settings. I have not touched selinux and so on.

But the testings on Fedora 39 are running out of time, while Fedora 38 runs all tests in ~4h ...
I will add Fedora 40 also...

@rrevans
Copy link
Contributor Author

rrevans commented Jan 3, 2025

Some progress:

TL;DR my script uses parallel build (make -j32) which causes test path binary symlinks to have the wrong user. Then the scripts create the user in group root which causes the test to fail.

The root cause is that scripts/Makefile.am does not list any dependencies for installing symlinks in the test path, so this step can run before the build is actually done.

zfs/scripts/Makefile.am

Lines 81 to 83 in d35f9f2

ALL_LOCAL += scripts-all-local
scripts-all-local: %D%/common.sh
-SCRIPT_COMMON=$< $(srcdir)/%D%/zfs-tests.sh -c

A subsequent sudo make install step in my script re-runs this and installs the missing files as root.

1. The test fails for me because staff2 is in group root

as found out by adding this patch to the tests:

diff --git a/tests/zfs-tests/include/commands.cfg b/tests/zfs-tests/include/commands.cfg
--- a/tests/zfs-tests/include/commands.cfg
+++ b/tests/zfs-tests/include/commands.cfg
@@ -40,6 +40,7 @@ export SYSTEM_FILES_COMMON='awk
     getent
     getfacl
     grep
+    groups
     gunzip
     gzip
     head
diff --git a/tests/zfs-tests/tests/functional/acl/off/posixmode.ksh b/tests/zfs-tests/tests/functional/acl/off/posixmode.ksh
--- a/tests/zfs-tests/tests/functional/acl/off/posixmode.ksh
+++ b/tests/zfs-tests/tests/functional/acl/off/posixmode.ksh
@@ -83,6 +83,7 @@ function test_posix_mode # base
 
        log_must touch $file
        log_must chown :$wheel $file
+       log_must user_run $other groups
        log_must user_run $other rm $file
 
        # file owned by user

then the test produces test_results/current/off/posixmode/stdout:

NOTE: user: staff2
NOTE: cmd: groups
NOTE: out: zfsgrp root
NOTE: err: 
SUCCESS: user_run staff2 groups

The file is created with mode 007, and tmpfs is correctly denying access based on staff2 being in group root. The test would probably pass if staff2 were only in group zfsgrp.

2. Why is user staff2 in group root?

Looking into how users are created, I found this:

# Add new users to the same group and the command line utils.
# This allows them to be run out of the original users home
# directory as long as it permissioned to be group readable.
cmd_group=$(stat --format="%G" $(command -v zfs))

The script is adding the user to the group that owns the zfs file in the test setup bin directory (part of the constrained path) to ensure that the users can run test binaries.

And sure enough for my setup that's owned by root for some reason:

$ ls -l tests/zfs-tests/bin/zfs
lrwxrwxrwx. 1 root root 22 Jan  3 09:08 tests/zfs-tests/bin/zfs -> /home/vm37user/zfs/zfs

3. Why is zfs binary in the constrained path owned by root?

I'm doing approximately this script to create a clean VM image before running tests:

set -ex
make distclean || true
./autogen.sh
./configure --enable-debug --enable-debuginfo
make -j32
sudo ./scripts/zfs.sh
sudo make install

make distclean removes all links in tests/zfs-tests/bin and after this script tests/zfs-tests/bin/zfs is owned by root, so this is a clean reproducer.

Looking very closely:

  • The make -j32 step creates most files in tests/zfs-tests/bin but not tests/zfs-tests/bin/zfs.
  • Then sudo make install creates the zfs link, but it ends up owned by root.

If the make step is replaced with make -j32 all all-am, then the zfs link is created owned by the build user, and the test passes as expected. The test also passes if make is called without any -j flag (without all-am target).

Digging into the Makefile:

  • the constrained path is created in the scripts-all-local target which runs scripts/zfs-tests.sh -c
  • this is run from all-local target by automake in the all-am target created by autogen.sh
  • the automake all-recursive target runs all-am after doing make all in each subdirectory
  • the all-am target depends on all-local
  • make without any target implies all which builds all-recursive

The culprit appears to building with parallelism.

make -j32 builds the all-am targets in parallel. Makefile contains:

all-am: Makefile $(PROGRAMS) $(LTLIBRARIES) $(SCRIPTS) $(MANS) $(DATA) \
                $(HEADERS) zfs_config.h all-local

Once the build winds down far enough to start linking binaries, the available parallelism considers other targets including all-local. This then installs the constrained path before the zfs binaries are built.

In my script sudo make install also does makes all-am before installing. But by this point the binaries now exist, so scripts/zfs-tests.sh -c creates the symlink but owned by root.

Staring very closely at the make -j32 output:

  CC [M]  /home/vm37user/zfs/module/lua/lfunc.o
ln -s -f zfs contrib/bash_completion.d/zpool
SCRIPT_COMMON=scripts/common.sh ./scripts/zfs-tests.sh -c
  CC       contrib/pam_zfs_key/pam_zfs_key_la-pam_zfs_key.lo
  CC [M]  /home/vm37user/zfs/module/lua/lgc.o
cd contrib/pyzfs && /usr/bin/python3.12 setup.py -q  egg_info -e . build
  CCLD     libicp.la
  CCLD     libunicode.la
  CCLD     libspl_assert.la
... 377 lines omitted ... 
  CCLD     mount.zfs
  CCLD     zfs_ids_to_path
  CCLD     zfs
  CCLD     zinject
  CCLD     zpool
  CCLD     zed

The rules for scripts-all-local do not depend on any binary targets, so make believes all of its dependencies are satisfied too early when using parallel builds.

rrevans added a commit to rrevans/zfs that referenced this issue Jan 3, 2025
The explicit dependencies force make to build all outputs before
running the script to populate `tests/zfs-tests/bin`. Really this only
needs to depend on targets in `tests/zfs-tests/include/commands.cfg`
but waiting for all primary targets is more straightforward.

Without this, parallel make may install symlinks before all outputs
are built. This is problematic if also doing `sudo make install` as
some files end up owned by root. This confuses ZTS into adding
non-root test users to the root group causing tests to break.

Fixes openzfs#16030

Signed-off-by: Robert Evans <evansr@google.com>
rrevans added a commit to rrevans/zfs that referenced this issue Jan 3, 2025
This updates the Makefile to be more correct for parallel make.

Fixes openzfs#16030

Signed-off-by: Robert Evans <evansr@google.com>
behlendorf pushed a commit that referenced this issue Jan 4, 2025
This updates the Makefile to be more correct for parallel make.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Robert Evans <evansr@google.com>
Closes #16030
Closes #16922
behlendorf pushed a commit to behlendorf/zfs that referenced this issue Jan 4, 2025
This updates the Makefile to be more correct for parallel make.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Robert Evans <evansr@google.com>
Closes openzfs#16030
Closes openzfs#16922
lundman pushed a commit to openzfsonwindows/openzfs that referenced this issue Jan 26, 2025
This updates the Makefile to be more correct for parallel make.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Robert Evans <evansr@google.com>
Closes openzfs#16030
Closes openzfs#16922
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants