Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for --filter-rpath-sanity-libs to skip RPATH sanity check for designated libraries #4119

Merged
merged 32 commits into from
Dec 22, 2022

Conversation

casparvl
Copy link
Contributor

@casparvl casparvl commented Nov 2, 2022

fixes #4095

This PR will allow a (configurable) list of libraries to be ignored during the RPATH sanity check. This is e.g. needed when cross-compiling an executable that links against libcuda.so.1. The way NVIDIA supports cross-compilation is by compiling against a stub library, provided by the CUDA toolkit. The stub library only provides the symbols, but not the imlementation of the actual function that should be used at runtime. For example, <prefix>/CUDA/11.7.0/lib/stubs/libcuda.so.1 contains all the symbols of the libcuda.so.1, but no implementations. Thus, it should never be used at runtime: the libcuda.so.1 that comes with the driver installation is supposed to be used at runtime.

That's also why we don't RPATH anything that's in a stubs directory (see e.g. #2683 and this snippet of the easybuild-framework code).

The issue is that the sanity_check_rpath() step will still try to unset LD_LIBRARY_PATH and see if all libraries are found correctly. On nodes that have the CUDA driver installed, that works fine, as libcuda.so.1 is picked up from a default location (/usr/lib64/libcuda.so.1). However, on CPU nodes, this fails. The fact that libcuda.so.1 is not found on CPU nodes when cross compiling is entirely legitimate though.

A point of discussion is what the default behaviour should be for EasyBuild, i.e. for which default list of libraries it will be 'accepted' if they are not found in the sanity_check_rpath() step. In my opinion, it makes sense to include libcuda.so, libcuda.so.1, libnvidia-ml.so and libnvidia-ml.so.1 here, as these are the libraries for which only stubs are provided in the CUDA toolkit and are thus supposed to be provided by the driver. For some other libraries (e.g. libcublas.so), there are stubs in the stubs directory, but also real ones in the <prefix>/CUDA/11.7.0/lib/ directory. Thus, I assume one would want those RPATH-ed against the latter locations.

Caspar van Leeuwen and others added 3 commits November 1, 2022 18:36
…ty check configurable via a command line argument. Also, use only one regular expression to immediately capture the name of the library that is missing. This simplifies the check to see if it is present in the list if ignoreable libraries.
… using --rpath-filter and 2) checks if the rpath sanity check passes if that same library isn't only filtered while setting RPATH (with --rpath-filter), but also ignored when checking in the RPATH sanity check (with -filter-rpath-sanity-libs)
@casparvl
Copy link
Contributor Author

casparvl commented Nov 2, 2022

For completeness: this is an example of what the drivers look like

ls -al /usr/lib64/libnvidia-ml.so*
lrwxrwxrwx. 1 root root      25 Apr 27  2022 /usr/lib64/libnvidia-ml.so -> libnvidia-ml.so.515.43.04
lrwxrwxrwx. 1 root root      25 Apr 27  2022 /usr/lib64/libnvidia-ml.so.1 -> libnvidia-ml.so.515.43.04
-rwxr-xr-x. 1 root root 1683960 Apr 26  2022 /usr/lib64/libnvidia-ml.so.515.43.04

i.e. here, the *.so.1 and *.so are softlinks to the versioned-library

@boegel boegel changed the title Skip rpath check for designated libraries Skip RPATH sanity check for designated libraries Nov 9, 2022
@boegel boegel changed the title Skip RPATH sanity check for designated libraries add support for --filter-rpath-sanity-libs to skip RPATH sanity check for designated libraries Nov 9, 2022
Caspar van Leeuwen added 3 commits November 14, 2022 12:18
…ed in a prefix that contains libtoy. Thus, make is filter the pattenr .*ltoy.* so that this always matches, regardless of ltoy.so's prefix
…d libtoy.so. There must be another reason why the test suite was failing to filter this from the linkeage before...
@casparvl
Copy link
Contributor Author

casparvl commented Nov 15, 2022

I'm puzzled by the CI failures of the unit tests... I added this to the test_toy_rpath test:

        # test sanity error when --rpath-filter is used to filter a required library
        toy_ec_txt = read_file(os.path.join(test_ecs, 't', 'toy', 'toy-0.0.eb'))
        toy_ec_txt += "\ndependencies = [('libtoy', '0.0', '', SYSTEM)]"
        toy_ec_txt += "\nbuildopts = '-ltoy'"
        toy_ec = os.path.join(self.test_prefix, 'toy.eb')
        write_file(toy_ec, toy_ec_txt)
        error_pattern = r"Sanity check failed\: Library libtoy\.so not found"
        self.assertErrorRegex(EasyBuildError, error_pattern, self.test_toy_build, ec_file=toy_ec,
                              extra_args=['--rpath', '--experimental', '--rpath-filter=.*libtoy.*'],
                              raise_error=True, verbose=False)

What this is meant to do is:

  • Add libtoy as a dependency
  • Run with --rpath, but then filter libtoy from being RPATH-ed
  • Fail on the sanity_check_rpath, as libtoy "failed" to be RPATH-ed (on purpose)
  • Check that this produces the expected failure message

On my local machine, I run python3 -m test.framework.toy_build rpath (using the skip_rpath_check branch, of course) and it works fine. I've actually tried to check it out on another machine, just to make sure I didn't magically fail to check some change in or something.

Yet, in the EasyBuild CI, this produces a failure. If I understand it correctly, this failure tells me that the last line in the above addition did not actually produce an error at all. That puzzles me: how can a --rpath --rpath-filter=.*libtoy.* build not fail the sanity_check_rpath? The only way I can imagine is if libtoy.so is also available somewhere in the default search path, but I have no idea how to check that (if I had the full build logs, I could...)

…this change should only be temporary, to diagnose the issue
@casparvl
Copy link
Contributor Author

casparvl commented Nov 15, 2022

Hm, for some reason it doesn't seem to pick up on

toy_ec_txt += "\nbuildopts = '-ltoy'"

I guess, since the ldd in the RPATH sanity check shows (see raw log of this CI run):

2022-11-15T09:58:27.4634375Z == 2022-11-15 09:58:26,617 run.py:682 DEBUG Using default regular expression: (?<![(,-]|\w)(?:error|segmentation fault|failed)(?![(,-]|\.?\w)
2022-11-15T09:58:27.4634882Z == 2022-11-15 09:58:26,617 run.py:215 DEBUG run_cmd: running cmd ldd /tmp/eb-b2lZfe/eb-JYJdDP/eb-DiAzuk/tmpW7VnBk/software/toy/0.0/bin/toy (in /tmp/eb-b2lZfe/eb-JYJdDP/eb-DiAzuk/tmpW7VnBk/software/toy/0.0)
2022-11-15T09:58:27.4635239Z == 2022-11-15 09:58:26,618 run.py:234 INFO running cmd: ldd /tmp/eb-b2lZfe/eb-JYJdDP/eb-DiAzuk/tmpW7VnBk/software/toy/0.0/bin/toy 
2022-11-15T09:58:27.4635652Z == 2022-11-15 09:58:26,662 run.py:648 DEBUG cmd "ldd /tmp/eb-b2lZfe/eb-JYJdDP/eb-DiAzuk/tmpW7VnBk/software/toy/0.0/bin/toy" exited with exit code 0 and output:
2022-11-15T09:58:27.4635828Z 	linux-vdso.so.1 (0x00007ffeaf974000)
2022-11-15T09:58:27.4636134Z 	libc.so.6 => /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libc.so.6 (0x00007f2b6e347000)
2022-11-15T09:58:27.4636338Z 	/lib64/ld-linux-x86-64.so.2 (0x00007f2b6e540000)

Yet, this looks as intended, which suggest it is picked up:

2022-11-15T09:58:27.4431459Z == 2022-11-15 09:58:26,204 run.py:215 DEBUG run_cmd: running cmd  gcc toy.c -o toy -ltoy (in /tmp/eb-b2lZfe/eb-JYJdDP/eb-DiAzuk/tmpslLuej/toy/0.0/system-system/toy-0.0)
2022-11-15T09:58:27.4431705Z == 2022-11-15 09:58:26,204 run.py:234 INFO running cmd:  gcc toy.c -o toy -ltoy 
2022-11-15T09:58:27.4432009Z == 2022-11-15 09:58:26,329 run.py:648 DEBUG cmd " gcc toy.c -o toy -ltoy" exited with exit code 0 and output:

@casparvl
Copy link
Contributor Author

casparvl commented Nov 15, 2022

Ok, this confuses me:

2022-11-15T09:59:20.0294216Z == 2022-11-15 09:58:25,242 run.py:215 DEBUG run_cmd: running cmd ldd /tmp/eb-b2lZfe/eb-JYJdDP/eb-DiAzuk/tmpW7VnBk/software/libtoy/0.0/bin/toy (in /tmp/eb-b2lZfe/eb-JYJdDP/eb-DiAzuk/tmpW7VnBk/software/libtoy/0.0)
2022-11-15T09:59:20.0294621Z == 2022-11-15 09:58:25,242 run.py:234 INFO running cmd: ldd /tmp/eb-b2lZfe/eb-JYJdDP/eb-DiAzuk/tmpW7VnBk/software/libtoy/0.0/bin/toy 
2022-11-15T09:59:20.0295026Z == 2022-11-15 09:58:25,287 run.py:648 DEBUG cmd "ldd /tmp/eb-b2lZfe/eb-JYJdDP/eb-DiAzuk/tmpW7VnBk/software/libtoy/0.0/bin/toy" exited with exit code 0 and output:
2022-11-15T09:59:20.0295246Z 	linux-vdso.so.1 (0x00007ffe7bba5000)
2022-11-15T09:59:20.0295723Z 	libtoy.so => /tmp/eb-b2lZfe/eb-JYJdDP/eb-DiAzuk/tmpW7VnBk/software/libtoy/0.0/lib/libtoy.so (0x00007fd74f57e000)
2022-11-15T09:59:20.0296201Z 	libc.so.6 => /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libc.so.6 (0x00007fd74f38c000)

It says libtoy is found here, even though the LD_LIBRARY_PATH should be unset for the RPATH sanity check...? I'll have a close look at this section of the log, see if somehow the LD_LIBRARY_PATH is set to something we don't expect...

EDIT: ah, so there is a toy binary that is installed as part of libtoy-0.0.eb, and that's the one above .That's linked to libtoy, and that will find it's .so without problems, because $ORIGIN/../lib is always added to RPATH. So, Ignore this comment, it has nothing to do with the actual issue.

@casparvl
Copy link
Contributor Author

Found the issue: Ubuntu passes --as-needed by default to the linker (see https://wiki.ubuntu.com/ToolChain/CompilerFlags#A-Wl.2C--as-needed ). You can see this when you run

echo "int main(void) {}" | gcc $(dpkg-buildflags --get LDFLAGS) -o /dev/null -v -x c - &> /dev/stdout| grep collect

on an Ubuntu system. Most systems will end with a --no-as-needed flag, so that any -lmylib passed by the user on command line will effectively default to --no-as-needed behavior. Ubuntu systems don't, they end with a --as-needed flag, which thus applies as the default to any of the -l arguments passed by the user. Annoying, because the ld manual states that "--no-as-needed restores the default behavior", implicitely saying this is the default.

Of course, I can pass the -Wl,--no-as-needed flag myself in the unit test, but a more robust fix is probably to just use an executable that actually uses a symbol from libtoy, so that even with --as-needed it is still linked.

Caspar van Leeuwen added 6 commits November 16, 2022 14:40
… This makes sure that regardless of whether --as-needed is passed or not, the library will always be linked by the linker. For now, we still print the full output of the build, so that we can check (once) of it all makes sense now. If it does, I'll do one more commit to limit the test to only the check that parses the error for a certain pattern
…e driver as default libs to be filtered from the RPATH sanity check
akesandgren
akesandgren previously approved these changes Nov 17, 2022
Copy link
Contributor

@akesandgren akesandgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@boegel boegel added this to the 4.x milestone Nov 17, 2022
@easybuilders easybuilders deleted a comment from boegelbot Nov 17, 2022
@easybuilders easybuilders deleted a comment from casparvl Nov 17, 2022
@easybuilders easybuilders deleted a comment from boegelbot Nov 17, 2022
@easybuilders easybuilders deleted a comment from boegelbot Nov 17, 2022
@easybuilders easybuilders deleted a comment from boegelbot Nov 17, 2022
easybuild/framework/easyblock.py Outdated Show resolved Hide resolved
easybuild/framework/easyblock.py Outdated Show resolved Hide resolved
easybuild/framework/easyblock.py Outdated Show resolved Hide resolved
easybuild/framework/easyblock.py Outdated Show resolved Hide resolved
easybuild/tools/config.py Outdated Show resolved Hide resolved
easybuild/tools/options.py Outdated Show resolved Hide resolved
test/framework/sandbox/sources/toy2/toy2-0.0/toy2.c Outdated Show resolved Hide resolved
error_pattern = r"Sanity check failed\: Library libtoy\.so not found"
self.assertErrorRegex(EasyBuildError, error_pattern, self.test_toy_build, ec_file=toy_ec,
extra_args=['--rpath', '--experimental', '--rpath-filter=.*libtoy.*'],
name='toy2', raise_error=True, verbose=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also run again without --rpath-filter, for completeness sake?
It's generally a good idea to make sure that the test setup is done fully correctly...

Suggested change
name='toy2', raise_error=True, verbose=False)
args = ['--rpath', '--experimental', '--rpath-filter=.*libtoy.*']
self.assertErrorRegex(EasyBuildError, error_pattern, self.test_toy_build, ec_file=toy_ec,
extra_args=args, name='toy2', raise_error=True, verbose=False)
# works fine if --rpath-filter is not used (since then libtoy is RPATH'ed)
self.test_toy_build(ec_file=toy_ec, name='toy2', extra_args=args, raise_error=True)

You will need to clean up the install dir to make sure the next part of the test still works as expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to clean? A rebuild will clean the installdir anyway, doesn't it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed

# not rpath-ed. Then, we use --filter-rpath-sanity-libs to make sure the RPATH sanity checks ignores
# the fact that libtoy.so is not found. Thus, this build should complete succesfully
args = ['--rpath', '--experimental', '--rpath-filter=.*libtoy.*', '--filter-rpath-sanity-libs=libtoy.so']
self.test_toy_build(ec_file=toy_ec, name='toy2', extra_args=args, raise_error=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe here we should also do a check on the output of readelf -d on the binary, and check the RPATH section with a regex (to make sure it doesn't include libtoy)?
That mainly makes sense if my suggestion above to also run once without --rpath-filter is applied.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, good idea, but I need some help with this. How do I retrieve the installdir in a test? And what is the appropriate way to run readelf -d in a test, do we use the run_cmd function for that? (if so, I need to find where that is defined again... :))

Caspar van Leeuwen added 2 commits November 18, 2022 19:06
…lter_rpath_sanity_libs, make default RPATH filter as list and have a comment explaining the default
Copy link
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@boegel boegel merged commit 1d55137 into easybuilders:develop Dec 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sanity_check_rpath should not check for libcuda.so.1
3 participants