Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ASSERT and CRASH with two clients: regression in 8.0 #4501

Closed
derekbruening opened this issue Oct 29, 2020 · 24 comments · Fixed by #6976
Closed

ASSERT and CRASH with two clients: regression in 8.0 #4501

derekbruening opened this issue Oct 29, 2020 · 24 comments · Fixed by #6976

Comments

@derekbruening
Copy link
Contributor

$ cd /work/dr/releases/rel-7.1.0/DynamoRIO-Linux-7.1.0-1
$ bin64/drrun -client samples/bin64/libbbcount.so 0 "" -client samples/bin64/libbbsize.so 1 "" -- ls
Client bbcount is running
Client bbsize is running
ACKNOWLEDGEMENTS  bin32  bin64	cmake  docs  drmemory  dynamorio  ext  include	lib32  lib64  License.txt  logs  README  samples  tools
Number of basic blocks seen: 4694
               Maximum size: 36 instructions
               Average size:   4.9 instructions

Instrumentation results:
    166909 basic block executions
       891 basic blocks needed flag saving
      3803 basic blocks did not

But:

$ cd /work/dr/releases/rel-8.0.0/DynamoRIO-Linux-8.0.0-1
$ bin64/drrun -client samples/bin64/libbbcount.so 0 "" -client samples/bin64/libbbsize.so 1 "" -- ls
<Application /usr/bin/ls (552079).  DynamoRIO internal crash at PC 0x000000007113730d.  Please report this at http://dynamorio.org/issues/.  Program aborted.
Received SIGSEGV at unknown pc 0x000000007113730d in thread 552079
Base: 0x0000000071000000
Registers:eax=0x00007fe32d851860 ebx=0x00007fe32d85cc80 ecx=0x00007ffe34cfca1f edx=0x00007fe3ae08d8aa
	esi=0x00007fe3ae08b6e8 edi=0x0000000000000000 esp=0x00007ffe34cfca10 ebp=0x0000000000000000
	r8 =0x000000000000118e r9 =0x0000000000000007 r10=0x0000000000000043 r11=0x0000000000217a08
	r12=0x00007fe3ae08d8aa r13=0x00007fe32d851860 r14=0x00007ffe34cfca1f r15=0x0000000000000000
	eflags=0x0000000000010206
version 8.0.0, build 1
-no_dynamic_options -client_lib '/work/dr/releases/rel-7.1.0/DynamoRIO-Linux-7.1.0-1/rel-8.0.0/DynamoRIO-Linux-8.0.0-1/samples/bin64/libbbcount.so;0;;/work/dr/releases/rel-7.1.0/DynamoRIO-Linux-7.1.0-1/rel-8.0.0/DynamoRIO-Linux-8.0.0-1/samples/bin64/libbbsize.so;1;' -code_api -stack_size 56K -signal_stack_size 32K -max_e
0x00000000713d849d 0x6974706f5f63696d>

Debug build reports this assert:

  #6  0x00000000710dc095 in d_r_internal_error (file=0x713983c8 "/home/travis/build/DynamoRIO/dynamorio/core/unix/module_elf.c", line=1131, 
    expr=0x7139919b "pd != NULL && name != NULL") at /home/travis/build/DynamoRIO/dynamorio/core/utils.c:176
#7  0x00000000712f7445 in module_lookup_symbol (sym=0x7fffb47889a8, pd=0x0) at /home/travis/build/DynamoRIO/dynamorio/core/unix/module_elf.c:1131
#8  0x00000000712f8616 in module_relocate_symbol (rel=0x7fffb478ab08, pd=0x7fff33f5cf30, is_rela=true)
    at /home/travis/build/DynamoRIO/dynamorio/core/unix/module_elf.c:1523
#9  0x00000000712f888a in module_relocate_rela (modbase=0x7fffb4788000 "\177ELF\002\001\001", pd=0x7fff33f5cf30, start=0x7fffb478aaf0, 
    end=0x7fffb478b330) at /home/travis/build/DynamoRIO/dynamorio/core/unix/module_elf.c:1595
#10 0x00000000712edee6 in privload_relocate_os_privmod_data (opd=0x7fff33f5cf30, mod_base=0x7fffb4788000 "\177ELF\002\001\001")
    at /home/travis/build/DynamoRIO/dynamorio/core/unix/loader.c:1125
#11 0x00000000712ee0ae in privload_relocate_mod (mod=0x7fff33f5ccf8) at /home/travis/build/DynamoRIO/dynamorio/core/unix/loader.c:1159
#12 0x00000000712ec614 in privload_process_imports (mod=0x7fff33f5ccf8) at /home/travis/build/DynamoRIO/dynamorio/core/unix/loader.c:587
#13 0x00000000711d8fe8 in privload_load_process (privmod=0x7fff33f5ccf8) at /home/travis/build/DynamoRIO/dynamorio/core/loader_shared.c:827
#14 0x00000000711d889f in privload_load (filename=0x7fffffffb710 "/home/bruening/dr/releases/DynamoRIO-Linux-8.0.0-1/ext/lib64/debug/libdrmgr.so", 
    dependent=0x7fff33f5c798, client=true) at /home/travis/build/DynamoRIO/dynamorio/core/loader_shared.c:695
#15 0x00000000712ecbc9 in privload_locate_and_load (impname=0x7fffb457cbcf "libdrmgr.so", dependent=0x7fff33f5c798, reachable=true)
    at /home/travis/build/DynamoRIO/dynamorio/core/unix/loader.c:682
#16 0x00000000712ec5b3 in privload_process_imports (mod=0x7fff33f5c798) at /home/travis/build/DynamoRIO/dynamorio/core/unix/loader.c:567
#17 0x00000000711d8fe8 in privload_load_process (privmod=0x7fff33f5c798) at /home/travis/build/DynamoRIO/dynamorio/core/loader_shared.c:827
#18 0x00000000711d889f in privload_load (filename=0x7fffffffba20 "/home/bruening/dr/releases/DynamoRIO-Linux-8.0.0-1/ext/lib64/debug/libdrreg.so", 
    dependent=0x7fff33f5c238, client=true) at /home/travis/build/DynamoRIO/dynamorio/core/loader_shared.c:695
#19 0x00000000712ecbc9 in privload_locate_and_load (impname=0x7fffb436a855 "libdrreg.so", dependent=0x7fff33f5c238, reachable=true)
    at /home/travis/build/DynamoRIO/dynamorio/core/unix/loader.c:682
#20 0x00000000712ec5b3 in privload_process_imports (mod=0x7fff33f5c238) at /home/travis/build/DynamoRIO/dynamorio/core/unix/loader.c:567
#21 0x00000000711d8fe8 in privload_load_process (privmod=0x7fff33f5c238) at /home/travis/build/DynamoRIO/dynamorio/core/loader_shared.c:827
#22 0x00000000711d889f in privload_load (filename=0x7fffffffbd30 "/home/bruening/dr/releases/DynamoRIO-Linux-8.0.0-1/ext/lib64/debug/libdrx.so", 
    dependent=0x7fff33f4a308, client=true) at /home/travis/build/DynamoRIO/dynamorio/core/loader_shared.c:695
#23 0x00000000712ecbc9 in privload_locate_and_load (impname=0x7fffb411d839 "libdrx.so", dependent=0x7fff33f4a308, reachable=true)
    at /home/travis/build/DynamoRIO/dynamorio/core/unix/loader.c:682
#24 0x00000000712ec5b3 in privload_process_imports (mod=0x7fff33f4a308) at /home/travis/build/DynamoRIO/dynamorio/core/unix/loader.c:567
#25 0x00000000711d8fe8 in privload_load_process (privmod=0x7fff33f4a308) at /home/travis/build/DynamoRIO/dynamorio/core/loader_shared.c:827
#26 0x00000000711d7169 in privload_process_early_mods () at /home/travis/build/DynamoRIO/dynamorio/core/loader_shared.c:141
#27 0x00000000711d7346 in loader_init_epilogue (dcontext=0x7fff33f56f40) at /home/travis/build/DynamoRIO/dynamorio/core/loader_shared.c:211
#28 0x0000000071052844 in dynamorio_app_init () at /home/travis/build/DynamoRIO/dynamorio/core/dynamo.c:638
#29 0x00000000712efc4d in privload_early_inject (sp=0x7fffffffdd10, 
    old_libdr_base=0x7ffff797f000 <error: Cannot access memory at address 0x7ffff797f000>, old_libdr_size=6815744)
    at /home/travis/build/DynamoRIO/dynamorio/core/unix/loader.c:2012

Along with fixing this, we need to add a test of multiple clients.
We have tests of registering multiple clients but apparently not of running multiple
clients which is a little surprising. OTOH this is a rarely used feature these days:
instead we refactor clients into libraries for explicit coordination.

@AssadHashmi
Copy link
Contributor

OTOH this is a rarely used feature these days:
instead we refactor clients into libraries for explicit coordination.

Can you elaborate on this method please?

AIUI the multiple clients feature is the best (only?) method for the following use-case: a proprietary/closed-source client running in conjunction with an open-source client.

@derekbruening
Copy link
Contributor Author

OTOH this is a rarely used feature these days:
instead we refactor clients into libraries for explicit coordination.

Can you elaborate on this method please?

AIUI the multiple clients feature is the best (only?) method for the following use-case: a proprietary/closed-source client running in conjunction with an open-source client.

Mixing closed-source and open-source should work just as well with libraries. The library interface can easily be a binary interface with the implementation a black box.

If the two clients are truly self-contained and separate and the output from one won't be used by the other, running them as separate top-level clients can make sense. But when trying to combine features into a single tool, libraries seem the way go, allowing clean programmatic control over the internals.

Running completely separate clients/tools does seem to happen quite a bit for things like drcachesim address trace analysis, running different analyses at once: but there we generally do offline analysis and the analysis tools are not DR clients and are quite different, with no concerns such as scratch registers or resource isolation that a DR client has which can make it difficult to combine or run simultaneously. But, at least in our usage, it seems rare to run two separate DR clients at once: do you end up doing that?

For combining features, a library interface with explicit API calls generally works better than having separate top-level clients. Separating out concerns, isolating functionality, and creating modular units makes these libraries more re-usable. E.g., where possible it seems better to have a library function that takes in a scratch register, rather than a function that goes and obtains its own scratch registers, pushing all responsibility for which scratches to use up to the top level client, avoiding conflicts and confusion over the scratch strategy. Similarly, for logging or how to handle errors it is best to push that up to a single top-level client.

Thus, instead of just having a drcov code coverage client that is run at the top level alongside others, we have a drcovlib code coverage library that can be added to any client (and is used by drmemory for fuzzing features). Instead of just a drsyscall or drstrace client, we have a drsyscall extension library.

@AssadHashmi
Copy link
Contributor

Thanks for the detailed answer @derekbruening.

But, at least in our usage, it seems rare to run two separate DR clients at once: do you end up doing that?

Yes, frequently. A typical case is a user running their instrumentation client with our closed-source emulation client.
It's not immediately obvious to me how we could easily get around this requirement without running two separate DR clients at once.

@derekbruening
Copy link
Contributor Author

But, at least in our usage, it seems rare to run two separate DR clients at once: do you end up doing that?

Yes, frequently. A typical case is a user running their instrumentation client with our closed-source emulation client.
It's not immediately obvious to me how we could easily get around this requirement without running two separate DR clients at once.

I assume your emulation client has a simple interaction with other clients: it only has an app2app transformation and nothing else, e.g. I also assume it could alternatively be a library that a user's instrumentation client would link with, instead of a separate client.

@yury-khrustalev
Copy link
Contributor

I would like to add another use case for using several top-level clients: one a compulsory emulation client which is necessary for an application to run, another is an optional client which a user can add for additional features (e.g. memory tracing, instruction counting, performance analysis etc).

It might be easier to develop separate optional clients to enable different types of analysis (and also to allow users to develop their own clients following some set of restrictions built on top of restrictions for all DR clients).

Yes, they all can be linked to the emulation library, but one may still want to use several such clients together, in which case them being linked to the emulation library will create a problem for it (this library will have to handle multiple requests).

I think supporting several clients is a very useful feature. We can still use libraries approach, but when more convenient we can allow users to select which clients they are loading.

@yury-khrustalev
Copy link
Contributor

Debug build reports this assert:

@derekbruening, could you please explain how to obtained this log? it looks very much like a call trace, I have always struggled with debug assert reports because they point to a place in code but don't produce a Java-like stack trace. Thank you.

@derekbruening
Copy link
Contributor Author

Debug build reports this assert:

@derekbruening, could you please explain how to obtained this log? it looks very much like a call trace, I have always struggled with debug assert reports because they point to a place in code but don't produce a Java-like stack trace. Thank you.

Yes, it is just a backtrace in gdb. Likely you are hitting one or both of two problems in gdb: 1) gdb does not do a good job of walking the callstack when the current $pc is not in a known function, such as when it's in generated code, even when a simple frame pointer walk would work; 2) DR and its libraries are not loaded by ld.so, so gdb does not find the symbols. For 1) see https://github.com/DynamoRIO/dynamorio/wiki/Debugging#call-stacks; for 2) see https://github.com/DynamoRIO/dynamorio/wiki/Debugging#loading-client-symbols. I also have a python gdb helper that automates symbols for DR itself; I should check it in.

@derekbruening
Copy link
Contributor Author

There's also libdynamorio.so-gdb.py but it has been broken for a long time: #2100.

@derekbruening
Copy link
Contributor Author

I think supporting several clients is a very useful feature. We can still use libraries approach, but when more convenient we can allow users to select which clients they are loading.

Sure, SGTM. Let's get a test added to avoid future regressions.

@derekbruening
Copy link
Contributor Author

I also have a python gdb helper that automates symbols for DR itself; I should check it in.

PR #4505. Really the drsymload should be integrated into libdynamorio.so-gdb.py: part of reviving that script which is #2100.

@AssadHashmi
Copy link
Contributor

I assume your emulation client has a simple interaction with other clients: it only has an app2app transformation and nothing else, e.g. I also assume it could alternatively be a library that a user's instrumentation client would link with, instead of a separate client.

The main interaction between clients is via drmgr_is_emulation_start(), drmgr_is_emulation_end() and drmgr_get_emulated_instr_data().

The distinct separation at runtime of a collection of instrumentation clients from a collection of emulation clients is convenient not least because we can have more than one emulation client running with an instrumentation client, with the selection of which emulation client to load based on h/w capability.

@yury-khrustalev
Copy link
Contributor

I'm trying to get to the bottom of it (still in progress). It looks like os_privmod_data is not set for the first client mentioned on the command line (apparently privload_create_os_privmod_data is not called for it).

@derekbruening
Copy link
Contributor Author

One approach would be a git bisect between release_7.1.0 and release_8.0.0-1.

@yury-khrustalev
Copy link
Contributor

yury-khrustalev commented Nov 5, 2020

Found: 9293e7a is the first bad commit.

bin64/drrun -client api/bin/libbbcount.so 0 "" -client api/bin/libbbsize.so 1 "" -- ls

results in

<Starting application /usr/bin/ls (22649)>
<Initial options = -no_dynamic_options -client_lib 'build/api/bin/libbbcount.so;0;;build/api/bin/libbbsize.so;1;' -code_api -stack_size 64K -signal_stack_size 64K -max_elide_jmp 0 -max_elide_call 0 -vmm_block_size 64K -initial_heap_unit_size 192K -initial_heap_nonpers_size 192K -initial_global_heap_unit_size 192K -heap_commit_increment 64K -cache_commit_increment     64K -early_inject -emulate_brk -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct >
<Application /usr/bin/ls (22649).  Internal Error: DynamoRIO debug check failure: core/unix/module_elf.c:1120 pd != NULL && name != NULL
(Error occurred @0 frags)
version 7.91.18282, custom build
-no_dynamic_options -client_lib 'build/api/bin/libbbcount.so;0;;build/api/bin/libbbsize.so;1;' -code_api -stack_size 64K     -signal_stack_size 64K -max_elide_jmp 0 -max_elide_call 0 -vmm_block_size 64K -initial_heap_unit_size 192K -initial_hea
0x0000ffffdddb6b70 0x0000ffffb3f1991c
0x0000ffffdddb6d10 0x0000ffffb417fd44
0x0000ffffdddb6f40 0x0000ffffb4180fdc
0x0000ffffdddb7070 0x0000ffffb4181268
0x0000ffffdddb70e0 0x0000ffffb4175c14
0x0000ffffdddb7120 0x0000ffffb4175de4
0x0000ffffdddb7140 0x0000ffffb41743d8
0x0000ffffdddb7170 0x0000ffffb4020b88
0x0000ffffdddb71d0 0x0000ffffb40203b8
0x0000ffffdddb71f0 0x0000ffffb41749a4
0x0000ffffdddb7260 0x0000ffffb4174364
0x0000ffffdddb7490 0x0000ffffb4020b88
0x0000ffffdddb74f0 0x0000ffffb401e868
0x0000ffffdddb7520 0x0000ffffb401eab4
0x0000ffffdddb7760 0x0000ffffb3e8937c
build/api/bin/libbbsize.so=0x000000004c8e0000
build/ext/lib64/debug/libdrmgr.so=0x0000000073000000
build/api/bin/libbbcount.so=0x0000000072000000>

@derekbruening, I would be grateful fi you could look at it because the diff is quite complex. Maybe you could point me to some place in the diff? Thank you.

@yury-khrustalev
Copy link
Contributor

When privload_process_imports is called for the first time (for the last client on command line, let's say it's libfoo.so) from privload_load_process, it ends up walking through the dependencies of libfoo.so (which could be libdrreg.so and libdrmgr.so). This leads to access to os_privmod_data of the first client on the command line (for which this field was not yet created, it is only to be created later during next iterating in the for loop in privload_process_early_mods).

It happens because during module_lookup_symbol which is transiently called from privload_process_imports (see stack trace above), we do:

mod = privload_first_module();

but of course this first module has not been fully initialised yet. Similarly only one module of all module has been initialised by this point.

Somehow we should call privload_create_os_privmod_data for all module before importing modules, though I'm not sure how to do it.

@yury-khrustalev
Copy link
Contributor

yury-khrustalev commented Apr 21, 2022

I just tried 9.0.19097 and it seems this issue is still present and coming from the same place at core/unix/module_elf.c. Now one has to use -client options to introduce several clients, and it fails with

<Application /home/user/devel/some/build/hello (17190).  DynamoRIO internal crash at PC 0x0000ffff8b47cfc4.  Please report this at http://dynamorio.org/issues/.  Program aborted.
Received SIGSEGV at unknown pc 0x0000ffff8b47cfc4 in thread 17190
Base: 0x0000ffff8b300000
Registers:	eflags=0x0000000020000000
version 9.0.0, custom build
-no_dynamic_options -client_lib '/home/user/devel/some/build/libone/libone.so;0;;/home/user/devel/some/build/libtwo/libtwo.so;1;' -client_lib64 '/home/user/devel/some/build/libone/libone.so;0;;/home/user/devel/some/build/libtwo/libtwo.so;1;' -code_api -stack_size 64K -signal_stack_size 64K -max_el
0x0000ffffcdfd76a0 0x0000ffff8b47e5d8
0x0000ffffcdfd77b0 0x0000ffff8b477d90
0x0000ffffcdfd7820 0x0000ffff8b478bf8
0x0000ffffcdfd7840 0x0000ffff8b36d7f8
0x0000ffffcdfd7880 0x0000ffff8b477a60
0x0000ffffcdfd78e0 0x0000ffff8b478b8c
0x0000ffffcdfd7b40 0x0000ffff8b36d9b0
0x0000ffffcdfd7b90 0x0000ffff8b31ba84
0x0000ffffcdfd7de0 0x0000ffff8b479bcc
0x0000ffffcdfd7e20 0x0000ffff8b461568>

In debug build:

<Starting application /home/user/devel/some/debug/hello (18794)>
<Initial options = -no_dynamic_options -client_lib '/home/user/devel/some/debug/libone/libone.so;0;;/home/user/devel/some/debug/libtwo/libtwo.so;1;' -client_lib64 '/home/user/devel/some/debug/libone/libone.so;0;;/home/user/devel/some/debug/libtwo/libtwo.so;1;' -code_api -stack_size 64K -signal_stack_size 64K -max_elide_jmp 0 -max_elide_call 0 -vmm_block_size 64K -initial_heap_unit_size 64K -initial_heap_nonpers_size 64K -initial_global_heap_unit_size 512K -max_heap_unit_size 4M -heap_commit_increment 64K -cache_commit_increment 64K -cache_bb_unit_init 64K -cache_bb_unit_max 64K -cache_bb_unit_quadruple 64K -cache_trace_unit_init 64K -cache_trace_unit_max 64K -cache_trace_unit_quadruple 64K -cache_shared_bb_unit_init 512K -cache_shared_bb_unit_max 512K -cache_shared_bb_unit_quadruple 512K -cache_shared_trace_unit_init 512K -cache_shared_trace_unit_max 512K -cache_shared_trace_unit_quadruple 512K -cache_bb_unit_upgrade 64K -cache_trace_unit_upgrade 64K -cache_shared_bb_unit_upgrade 512K -cache_shared_trace_unit_upgrade 512K -early_inject -emulate_brk -no_inline_ignored_syscalls -no_per_thread_guard_pages -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct >
<Application /home/user/devel/some/debug/hello (18794).  Internal Error: DynamoRIO debug check failure: /home/user/devel/some/dynamorio/core/unix/module_elf.c:1130 pd != NULL && name != NULL
(Error occurred @0 frags in tid 18794)
version 9.0.0, custom build
-no_dynamic_options -client_lib '/home/user/devel/some/debug/libone/libone.so;0;;/home/user/devel/some/debug/libtwo/libtwo.so;1;' -client_lib64 '/home/user/devel/some/debug/libone/libone.so;0;;/home/user/devel/some/debug/libtwo/libtwo.so;1;' -code_api -stack_size 64K -signal_stack_size 64K -max_el
0x0000ffffd6118980 0x0000ffffa07edc64
0x0000ffffd6118b20 0x0000ffffa0ad7804
0x0000ffffd6118d60 0x0000ffffa0ad8c20
0x0000ffffd6118e90 0x0000ffffa0ad8f94
0x0000ffffd6118f00 0x0000ffffa0accf84
0x0000ffffd6118f40 0x0000ffffa0acd258
0x0000ffffd6118f70 0x0000ffffa0acb4c8
0x0000ffffd6118fa0 0x0000ffffa08f90ec
0x0000ffffd6119000 0x0000ffffa08f8908
0x0000ffffd6119020 0x0000ffffa0acbc1c
0x0000ffffd6119090 0x0000ffffa0acb410
0x0000ffffd61192c0 0x0000ffffa08f90ec
0x0000ffffd6119320 0x0000ffffa08f6d6c
0x0000ffffd6119350 0x0000ffffa08f6fec
0x0000ffffd6119590 0x0000ffffa075c434
/home/user/devel/some/debug/dynamorio/lib64/debug/libdynamorio.so=0x0000ffffa0730000
/home/user/devel/some/debug/libtwo/libtwo.so=0x0000000043990000
/home/user/devel/some/debug/dynamorio/ext/lib64/debug/libdrmgr.so=0x0000000073000000
/home/user/devel/some/debug/libone/libone.so=0x0000000043950000>

@yury-khrustalev
Copy link
Contributor

This is related to #3850: when looking up symbols, DR tries to obtain os_privmod_data and it might not yet be initialised (actually all clients except the first one will not be initialised), and this causes NPE.

@derekbruening, would it be correct to skip uninitialised modules when mod->is_client is true? Clients should be leaves of the dependency tree and therefore would never provide any symbols for any other modules. To this end, we could always skip all client modules except for current one (for which module_lookup_symbol is called and for which lookup would have already been done before the loop over all modules)?

I have tried this fix on x86 and Arm64 and it seems to be working.

@derekbruening
Copy link
Contributor Author

When privload_process_imports is called for the first time (for the last client on command line, let's say it's libfoo.so) from privload_load_process, it ends up walking through the dependencies of libfoo.so (which could be libdrreg.so and libdrmgr.so). This leads to access to os_privmod_data of the first client on the command line (for which this field was not yet created, it is only to be created later during next iterating in the for loop in privload_process_early_mods).

It happens because during module_lookup_symbol which is transiently called from privload_process_imports (see stack trace above), we do:

mod = privload_first_module();

but of course this first module has not been fully initialised yet. Similarly only one module of all module has been initialised by this point.

Somehow we should call privload_create_os_privmod_data for all module before importing modules, though I'm not sure how to do it.

Thank you for this explanation: this makes it clear.

It looks like privload_load_process() calls privload_add_areas() before processing imports or relocs, and privload_add_areas() is what calls privload_create_os_privmod_data(). So would a solution be to have privload_process_early_mods() first do a pass through the early libs and call privload_add_areas() for each, and then a separate pass to call privload_load_process(), and have privload_load_process() skip the call to privload_add_areas() if privmod->os_privmod_data != NULL?

Please also add a regression test to the test suite with two clients to ensure this use case doesn't break in the future.

@yury-khrustalev
Copy link
Contributor

yury-khrustalev commented May 24, 2022

@derekbruening, yes, this solution works too. But there is another issue -- when in the end the libraries are unloaded and when DR calls privload_call_lib_func(opd->fini_array[i]); from privload_call_entry(... DLL_PROCESS_EXIT) there is a segfault in the second client library's __do_global_dtors_aux. Both clients work fine alone and each library would fail as described when put last on the drrun command line. So it's not the destructor code's fault but rather something in the context. Any help here would be greatly appreciated.

Note: this behaviour can only be reproduced in release build because in debug mode part of the code that is responsible for calling the destructors is not used.

It appears that by the time this second destructor is called the PLT entry __cxa_finalize@plt that is invoked from it is corrupted and execution jumps to the middle of libdynamorio.so.

@yury-khrustalev
Copy link
Contributor

The above segfault is happening in glibc's __cxa_finalize at its first instruction that writes to stack. It happens for the second client library when its fini_array entries are being called (this results in __do_global_dtors_aux calling glibc's __cxa_finalize).

00000000000352c8 <__cxa_finalize>:
   352c8: fd 7b ba a9   stp     x29, x30, [sp, #-96]!
   ...

The calls stack looks like this (the top frame is shown incorrectly due to some glitch in GDB -- it in fact refers to __cxa_finalize+0 which can be seen from analysing relocations):

#0  0x0000fffff7e93758 in encode_opndsgen_0dc0c000_401f0fff (pc=0x0, instr=0xa7025203d206772, enc=1629513082, di=0x69665f6178635f5f) at build/dynamorio/opnd_encode_funcs.h:2695
#1  0x0000ffffb3fa19c8 in __do_global_dtors_aux ()
#2  0x00000000711dac98 in privload_call_lib_func (func=0xffffb3fa1998 <__do_global_dtors_aux>) at dynamorio/core/unix/loader.c:1024
#3  0x00000000711da068 in privload_call_entry (dcontext=0xffffffffffffffff, privmod=0xfffdb4006368, reason=2) at dynamorio/core/unix/loader.c:660
#4  0x000000007109c134 in privload_call_entry_if_not_yet (dcontext=0xffffffffffffffff, privmod=0xfffdb4006368, reason=2) at dynamorio/core/loader_shared.c:121
#5  0x000000007109d890 in privload_unload (privmod=0xfffdb4006368) at dynamorio/core/loader_shared.c:737
#6  0x000000007109c594 in loader_exit () at dynamorio/core/loader_shared.c:248
#7  0x0000000071040c3c in dynamo_process_exit () at dynamorio/core/dynamo.c:1575
#8  0x00000000711bd0c4 in cat_done_saving_dstack ()

@yury-khrustalev
Copy link
Contributor

yury-khrustalev commented May 31, 2022

@derekbruening, yes, this solution works too.

Correction: It seems like the proposed solution results in a memory leak. The solution I originally used (skipping clients in the while loop in module_lookup_symbol) doesn't have this issue.

I didn't notice it at first because the fini_array segfault only appears in non-debug build while only debug build would enable checks for memory leaks.

@egrimley-arm
Copy link
Contributor

How did this used to work, before 9293e7a?

@egrimley-arm
Copy link
Contributor

Some more specific questions:

Apparently the main purpose of 9293e7a was to add TLS handling to the Windows loader. Was the patch intended to have no effect (in some sense) on the way the loader works on Linux?

Multiple clients used to work on Linux. Did they used to work on Windows?

Do multiple clients now/still work on Windows?

If 9293e7a did not add any new feature to Linux then presumably the previous code demonstrates that it is possible to have multiple clients on Linux (though perhaps something has been added since then that is incompatible with the way the loader used to work). Is there a good (fundamental) reason why multiple clients on Windows would be hard to implement, something to do with the way TLS and the calling of initialisers work on Windows, for example?

Would it be acceptable to have multiple clients working only on Linux?

Would it be acceptable to separate out the Linux and Windows loaders a bit more, at the cost of increasing the total quantity of code, if that makes it easier for developers, who may not have access to both systems, to more easily improve the implementation on one without breaking it on the other?

Thanks for any answers. I'm still not able to see the wood for the trees in the private loader.

@derekbruening
Copy link
Contributor Author

I would vote for not adding code divergence for different platforms in the shared private loader code as it will just increase complexity and maintenance of what is already complex. It sounded like @yury-khrustalev had a solution by modifying module_lookup_symbol.

egrimley-arm added a commit that referenced this issue Sep 10, 2024
Some other client modules because some will not be initialised and
clients should be leaves of the dependency tree and not provide
symbols for other modules.

Also add a test that two clients can be loaded without a crash.

Fixes #4501

Change-Id: I9eb4838b349e06094653491c669ab133e92c048c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants