Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Symbol pollution for pr8132 #8262

Closed
wants to merge 1 commit into from

Conversation

markalle
Copy link
Contributor

@markalle markalle commented Dec 1, 2020

There's an incoming PR
#8132
that makes more MCAs get built into the main library, and that brings more potential symbol name pollution.

My testcase
open-mpi/ompi-tests-public#3
identified a bunch of symbols without recognized prefixes when I built 8132, so this PR adds a bunch of prefixes and/or makes symbols static if they didn't look like they were used anywhere.

I think 2/3 of these changes are in treematch, but they span other files too.

@rhc54
Copy link
Contributor

rhc54 commented Dec 1, 2020

@bosilca Last I recall, "treematch" is an import from another code project, yes? I seem to recall that we are not supposed to make changes here for it, and there is no symbol "leakage" issue since it isn't an OMPI code.

@bosilca
Copy link
Member

bosilca commented Dec 2, 2020

Yes, treematch is an import, we should avoid or minimize the changes that might put our version at odds with the original.

ompi/mca/coll/adapt/coll_adapt_ibcast.c Show resolved Hide resolved
ompi/mca/coll/han/coll_han.h Outdated Show resolved Hide resolved
ompi/mca/coll/han/coll_han_module.c Outdated Show resolved Hide resolved
@@ -45,7 +45,7 @@ BEGIN_C_DECLS
extern int mca_common_monitoring_output_stream_id;
extern int mca_common_monitoring_enabled;
extern int mca_common_monitoring_current_state;
extern opal_hash_table_t *common_monitoring_translation_ht;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another solution would have been to move the mca_common_monitoring_get_world_rank function into the .c file (it is too large to be inlined anyway), and then this symbol could be restricted to the scope of the .c file (and can therefore be completely removed from the header file).

@@ -179,14 +179,14 @@ void ADIOI_GPFS_ReadStridedColl(ADIO_File fd, void *buf, int count,
/* One-sided aggregation needs the amount of data per rank as well because the difference in
* starting and ending offsets for 1 byte is 0 the same as 0 bytes so it cannot be distiguished.
*/
if ((gpfsmpio_read_aggmethod == 1) || (gpfsmpio_read_aggmethod == 2)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This provides little conveniency while increasing the burden on the maintainer of the ROMIO integration. I would carefully assess the cost.

@@ -37,13 +37,13 @@ long bglocklessmpio_f_type;
int gpfsmpio_bg_nagg_pset;
int gpfsmpio_pthreadio;
int gpfsmpio_p2pcontig;
int gpfsmpio_write_aggmethod;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is confusing that some of these symbols have now a different prefix than the rest.

@@ -47,7 +47,7 @@ struct mca_mtl_request_t;
* These are called internally by the library when the send
* is completed from its perspective.
*/
extern void (*send_completion_callbacks[])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this array is necessary anymore.

@@ -2,7 +2,7 @@
#include <stdio.h>
#include "IntConstantInitializedVector.h"

int intCIV_isInitialized(int_CIVector * v, int i)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As indicated by @rhc54 this is imported code that should mostly be left as is. How is this symbol even spilling out ? It seems to be used only in this file and it is not declared as extern in the header !?

@@ -183,13 +183,13 @@ static int mca_vprotocol_pessimist_component_finalize(void)
int mca_vprotocol_pessimist_enable(bool enable) {
if(enable) {
int ret;
if((ret = vprotocol_pessimist_sender_based_init(_mmap_file_name,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't understand why this symbol is globally visible ?

@markalle
Copy link
Contributor Author

markalle commented Dec 2, 2020

Thanks for the review, if treematch isn't to be modified, then I think PR8132 has got to leave it out of libmpi.so, since it's by far the largest offender in terms of globally exported symbols with no reasonable prefix in front of them. I'll back out some of these changes, especially treematch and romio and see where the test stands then. I don't really object to allowing "gpfsmpio_" for example as one of our acceptable prefixes. Let me see what remains after reverting some of these.

@rhc54
Copy link
Contributor

rhc54 commented Dec 2, 2020

if treematch isn't to be modified, then I think PR8132 has got to leave it out of libmpi.so,

Why? Someone is linking against the "treematch" library (even if it is thru us), so of course the "treematch" symbols are exposed. Your criteria seems somewhat arbitrary here - "gpfsmpio" is okay because it is an IBM symbol? Or is there some other reason why that prefix should be considered okay?

Perhaps the better approach would be for the community to decide what symbols represent an issue and which don't, and then generate a PR to deal with them? I don't recall such a discussion taking place - perhaps I missed it?

@markalle
Copy link
Contributor Author

markalle commented Dec 2, 2020

If treematch was only exporting names like tm_compute_mapping I wouldn't object, but it's also exporting choose ,create_work, exchange, get_time, and update_val for example. Those are too generic a names to be requiring an application writer to know they have to avoid. If the users's app has its own create_work the likely failure is ungraceful with OMPI's treematch calling into the app's redefinition of create_work.

I have another PR opened for a testcase which is my proposal for what exported symbol names should be allowed/disallowed.
open-mpi/ompi-tests-public#3
Generally I would say any uncommon prefix is acceptable, as the point is to not collide with things an app or another library is using.

If we're just being practical, then I'd allow a pretty long list of acceptable prefixes. If we're trying to be really clean and easy to document I'd shorten the list more.

@rhc54
Copy link
Contributor

rhc54 commented Dec 2, 2020

You're missing my point, Mark - my point is that this PR is making rather arbitrary delineations between what is and isn't an acceptable prefix. This is something the community needs to discuss in one of its meetings, and not on a low-bandwidth PR. It needs to factor in how we are handling 3rd-party code bases in general. We understand how to deal with our own internal code - but treematch and ROMIO aren't in that category.

@awlauria
Copy link
Contributor

awlauria commented Dec 2, 2020

If treematch was only exporting names like tm_compute_mapping I wouldn't object, but it's also exporting choose ,create_work, exchange, get_time, and update_val for example. Those are too generic a names to be requiring an application writer to know they have to avoid. If the users's app has its own create_work the likely failure is ungraceful with OMPI's treematch calling into the app's redefinition of create_work.

I agree that those types of symbols are too generic to be exporting, and should be cleaned up. Maybe we can target those such symbols and minimize the pr?

@rhc54
Copy link
Contributor

rhc54 commented Dec 2, 2020

I don't disagree with that assessment - but I would defer further work until the community can figure out how they want such things addressed. Again, this is a 3rd-party package - it isn't part of the OMPI code base, and we aren't free to edit it at will.

Let's put it on the schedule for next Tues and see what we can figure out.

@bosilca
Copy link
Member

bosilca commented Dec 2, 2020

Why not taking a drastically different approach, and allow the linker to only export the MPI API ? As an example libtool has -export-symbols to provide such a capability.

@rhc54
Copy link
Contributor

rhc54 commented Dec 2, 2020

Interesting idea - how would it impact the dll's? I thought one reason symbols are remaining visible is so that the components can access symbols in the core library? I suppose if we are going to "slurp" all the components into the core lib, then maybe all that goes away and it won't really matter - but that would (I should think) mean that all components have to be absorbed.

@markalle
Copy link
Contributor Author

markalle commented Dec 2, 2020

I understand the objection to changing external code, that's why I said I intend to remove all the treematch and romio changes from this PR. The part I was objecting to with my example symbols is it sounded like you were saying since we're linking against treematch that a user should expect all the treematch symbols to be exposed and users should thus know not to conflict with them.

@boslica how does -export-symbols that work with symbols used internally but across multiple .c files? If the exports can be controlled that strongly that would be great.

@rhc54
Copy link
Contributor

rhc54 commented Dec 3, 2020

since we're linking against treematch that a user should expect all the treematch symbols to be exposed

It is a point of concern, I agree - a user should expect to see 3rd-party symbols, but may not realize the full roster of packages being pulled in by OMPI. We've run into this before, particularly with the conflict between libev and libevent. Not sure how one goes about fully advertising to users "here are ALL the symbols you might see, depending upon which plugins are active".

I don't know the right answer, frankly. Historically, our approach has been pretty simple:

  • properly prefix the internal variables
  • post bug fixes and symbol corrections to the upstream 3rd-party packages, and then update our "copy" when they get committed (and usually released, depending on the delay)

Lately, we have been moving away from embedding 3rd-party packages. I'd expect to no longer see libevent, hwloc, or PMIx pretty in v6. PRRTE might stick around for another major release, though it seems to be getting picked up by downstream packagers and may no longer require it after v5.

This is why I'm advocating for a discussion. If we are going to "un-embed" all 3rd-party packages, then many of these problems go away. It should at least be factored into our near-term approach.

@markalle
Copy link
Contributor Author

markalle commented Dec 3, 2020

I'm not sure about the portability of this, but I at least like the functionality of the GCC 4.0 stream's -fvisibility=hidden and __attribute__((__visibility__("default"))) to do what @boslica said. Within a shared library it lets the symbols span multiple .c files with no problem. However it doesn't let components access internal symbols back in the core library unless they were exported.

Eg I made a play model with:
libfoo.so with foo() as its exported API call, but also an internal util_from_libfoo() function.
libbar.so with bar() that prints a mesg and also calls util_from_libfoo()
main.c calls foo() which dlopen/dlsyms to call bar()
So libfoo.so is kind of like libmpi.so, and libbar.so is kind of like an MCA.so trying to use symbols back in libmpi.so.

Anyway the above only worked if util_from_libfoo() was exported. Not too surprising, but I wanted to confirm since I wasn't sure

@bosilca
Copy link
Member

bosilca commented Dec 3, 2020

That's exactly what our *_DECLSPEC is supposed to achieve, but apparently it does not work for static libraries, at least not the way we expect/want it to work.

Symbol name pollution fix: adding "ompi_" or similar
prefixes to a bunch of symbols, or making some symbols
static if they were very isolated.

Signed-off-by: Mark Allen <markalle@us.ibm.com>
@markalle markalle force-pushed the symbol_pollution_for_pr8132 branch from 8d75629 to 83bf160 Compare December 3, 2020 21:41
@markalle
Copy link
Contributor Author

markalle commented Dec 3, 2020

Right now any function that's not static is globally visible, so if it spans multiple .c files and thus can't be static, it's out there for potential conflict with an app. So for a lot of those symbols above where you were asking how it's getting accessed from outside its module -- it wasn't really outside the module, it just was used in two different .c files within the module and therefore isn't static, and at least with our current default build that means it's globally visible.

I kind of like an approach that's more explicit and just exporting MPI_*, but the low-tech solution is to just put prefixes on anything that's meant to be internal. Even though prefixes are a simplistic solution, they work pretty well.

I just repushed with treematch and romio removed. For now it still has the same approach in adapt and vprotocol etc (only adding prefixes or making things static). I just wanted to get rid of the external package changes to at least not muddy the waters with those.

@gpaulsen
Copy link
Member

We are discussing this in context of #8132.

@jsquyres
Copy link
Member

There's a few issues at play here:

  • We need non-MPI_* symbols exported from libmpi (and friends) so that DSO components can find/use those symbols at run time.
  • We have apparently gotten quite sloppy about properly prefixing non-static symbols throughout the core and components.
    • We may have been relying on OMPI_DECLSPEC too much (e.g., it doesn't help in the case of static builds).
  • Data point: as of Jan 2021 (i.e., Open MPI 4.0.x, 4.1.x, and master), our DSOs are loaded into the public scope.
  • One open question that I do not remember offhand: if component A links against libfoo.so and is opened in a private scope, and component B links against libfoo.so and is also opened in a private scope, do we effectively have one copy or two copies of libfoo.so?

@markalle
Copy link
Contributor Author

@jsquyres about how many libfoo.so, I believe it's still one. I made a toy program to see empirically and that's how it went anyway. And you can have some messes in the symbol binding there too, like suppose

  • libfoo.so defines foo()
  • componentA.so is linked against libfoo.so but also defines its own foo()
  • componentB.so is linked against libfoo.so and calls foo()
    Component B is really going to expect its foo() to come from the libfoo.so it's linked against, but it doesn't. It calls the foo() over in component A which nobody's directly linked against.

If you just have this situation:

  • component A defines whatever symbols it wants
  • main library dlopen's components
    I don't know of any way that generic symbols defined in component A can step on or be stepped on by anything else. And that's where the MCAs are today, pre-8132 anyway

But as soon as libmpi.so or even componentA.so is linked against a libsomething.so that has generic symbols, then collisions become possible

@gpaulsen
Copy link
Member

gpaulsen commented Apr 5, 2021

@markalle Is this still an issue on master?

@markalle
Copy link
Contributor Author

markalle commented Aug 15, 2022

Yes, I just built on master to see what the current state of the world is and now that PR #8132 is in, libmpi.so is indeed filled with global symbols with generic names that can too-easily conflict with user apps.

This specific PR I think inspired a handful of fixes into treematch, but many conflicts remain.

How to see the problem:
start with a regular build of OMPI main (checkout and autogen/configure/make/make install)

*** For a simple by-hand first pass to see approximately what's there:
nm install_dir/lib/libmpi.so | grep ' [TWBCDVR] ' | grep -v -e ' MPI_' -e ' mpi_' -e ' PMPI_' -e ' MPIX_' -e ' MPL_' -e ' ADIOI_' -e ' ADIO_' -e ' ompi_' -e ' OMPI_' -e ' mca_' | less

This shows all the global symbols, and removes some basics like MPI_Foo, ADIOI_Foo, ompi_foo, that should be considered acceptable as exports.

Just the first few lines of global symbol name pollution:

000000000020c2f0 T adapt_module_cached_topology
00000000003e0db8 D adapt_topology_cache_item_t_class
000000000031d210 T add_edge_3
000000000031c3f0 T add_to_bucket
000000000032a5f0 T add_to_list
0000000000244ac0 T ad_get_env_vars
000000000032c9c0 T adjacency_asc
0000000000329530 T adjacency_dsc
000000000032a0c0 T aggregate_obj_weight
000000000031ad70 T algo
000000000031df30 T allocate_vertex
000000000031b430 T allocate_vertex2
00000000002e9a10 T append_frag_to_ordered_list
00000000003e08c0 D available_components
000000000031b130 T balancing

*** making a testcase that fails, to demonstrate why pollution is a problem

I think any global export lacking some kind of prefix should be considered a bug. Here's an example of why they matter in case it's not clear why symbol name pollution causes bugs (ie, I'm not raising an issue about aesthetics here, it's about failing code).

If I browse the list of functions being exported and pick "append_frag_to_ordered_list" there's no reason a user writing an Application wouldn't potentially use that exact function name, and if they do, maybe creating an app that looks like this:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <mpi.h>
static void
myprint(char *mesg)
{
    char str[4096];
    printf("*** %s -- in App's function\n", mesg);
    fflush(stdout);
    sprintf(str,
        "gdb -q -batch -ex bt -p %d < "
        "/dev/null 2>&1 | grep -v "
        "  -e '^Missing separate debuginfo for ' "
        "  -e '^Warning:.*No such file or directory' "
        "  -e 'Inferior .*process .*detached' "
        "  -e ' zypper install -C ' | "
        "sed -e 's/^/[p%d] /'", getpid(), getpid());
    system(str);
    fflush(stdout);
}
void append_frag_to_ordered_list() { myprint("append_frag_to_ordered_list"); }
int
main(int argc, char *argv[])
{
    int   myrank, nranks;
    char  myhost[MPI_MAX_PROCESSOR_NAME];
    int   len;
    char  *sbuf, *rbuf;
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
    MPI_Comm_size(MPI_COMM_WORLD, &nranks);
    MPI_Get_processor_name(myhost, &len);
    sbuf = malloc(10000000);
    rbuf = malloc(10000000);
    MPI_Bcast(sbuf, 100, MPI_CHAR, 0, MPI_COMM_WORLD);
    MPI_Bcast(sbuf, 1000, MPI_CHAR, 0, MPI_COMM_WORLD);
    MPI_Bcast(sbuf, 10000, MPI_CHAR, 0, MPI_COMM_WORLD);
    MPI_Bcast(sbuf, 100000, MPI_CHAR, 0, MPI_COMM_WORLD);
    MPI_Bcast(sbuf, 1000000, MPI_CHAR, 0, MPI_COMM_WORLD);
    MPI_Bcast(sbuf, 10000000, MPI_CHAR, 0, MPI_COMM_WORLD);
    MPI_Sendrecv(sbuf, 10, MPI_CHAR, (myrank+1)%nranks, 99,
                 rbuf, 10, MPI_CHAR, (myrank+nranks-1)%nranks, 99,
                 MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    MPI_Sendrecv(sbuf, 1000, MPI_CHAR, (myrank+1)%nranks, 99,
                 rbuf, 1000, MPI_CHAR, (myrank+nranks-1)%nranks, 99,
                 MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    MPI_Sendrecv(sbuf, 100000, MPI_CHAR, (myrank+1)%nranks, 99,
                 rbuf, 100000, MPI_CHAR, (myrank+nranks-1)%nranks, 99,
                 MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    MPI_Sendrecv(sbuf, 10000000, MPI_CHAR, (myrank+1)%nranks, 99,
                 rbuf, 10000000, MPI_CHAR, (myrank+nranks-1)%nranks, 99,
                 MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    MPI_Finalize();
    free(sbuf);
    free(rbuf);
    printf("past finalize at rank %d/%d (on %s)\n", myrank, nranks, myhost);
    exit(0);
}

And run with something like

export OPAL_PREFIX=/path/to/ompinstall
export LD_LIBRARY_PATH="${OPAL_PREFIX}/lib"
$OPAL_PREFIX/bin/mpicc -o x simplempi.c
hosts=hostA:1,hostB:1
$OPAL_PREFIX/bin/mpirun \
    --host $hosts \
    --mca coll ^hcoll \
    --mca pml ob1 --mca btl tcp,self \
    ./x

The stack trace comes up as deep in OMPI code where suddenly it thinks it's calling OMPI's "append_frag_to_ordered_list" but it's actually calling the user Application's redefinition.

[p295482] #3  0x0000000010000dc8 in myprint ()
[p295482] #4  0x0000000010000e34 in append_frag_to_ordered_list ()
[p295482] #5  0x000020000036b368 in mca_pml_ob1_recv_frag_match () from /smpi_dev/markalle/Projects/OMPIDev.symbol/install/lib/libmpi.so.0
[p295482] #6  0x0000200000d2a658 in mca_btl_tcp_endpoint_recv_handler () from /u/markalle/space/Projects/OMPIDev.symbol/install/lib/libopen-pal.so.0
[p295482] #7  0x000020000112a58c in event_process_active_single_queue () from /lib64/libevent_core-2.1.so.6
[p295482] #8  0x000020000112b020 in event_base_loop () from /lib64/libevent_core-2.1.so.6
[p295482] #9  0x0000200000c736b8 in opal_progress_events () from /u/markalle/space/Projects/OMPIDev.symbol/install/lib/libopen-pal.so.0
[p295482] #10 0x0000200000c73838 in opal_progress () from /u/markalle/space/Projects/OMPIDev.symbol/install/lib/libopen-pal.so.0
[p295482] #11 0x000020000013d070 in ompi_request_default_wait () from /smpi_dev/markalle/Projects/OMPIDev.symbol/install/lib/libmpi.so.0
[p295482] #12 0x00002000001a36e8 in PMPI_Sendrecv () from /smpi_dev/markalle/Projects/OMPIDev.symbol/install/lib/libmpi.so.0
[p295482] #13 0x00000000100010f4 in main ()

There are multiple ways to solve it, but I think the simplest is make sure all the globally exported symbols have some prefix like "ompi_" in front of them, and also have a testcase running that looks for un-prefixed exports. I checked in a testcase that I guess isn't actually being run at open-mpi/ompi-tests-public#3 that would error about the state of today's OMPI master

@jsquyres
Copy link
Member

@open-mpi/ompi-gatekeeper-5-0-x The issues discussed on this PR have implications for the v5.0.0 release. Bottom line: since components are now, by default, slurped into the base libraries, we are seeing a lot of symbol pollution. Technically, this has likely been around for a long time, but it may not have been noticed since the default was to build components as DSOs. This should probably be resolved, or at least documented.

@jsquyres
Copy link
Member

See also #10708.

@awlauria
Copy link
Contributor

awlauria commented Oct 24, 2022

Closing this - thanks for opening #10708. @markalle if you see anything missing from this PR in the efforts to clean up these symbols from @drwootton please port them over from this pr.

Also: I fixed the labels to correctly show the target branch.

@awlauria awlauria closed this Oct 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants