-
Notifications
You must be signed in to change notification settings - Fork 387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/psm3: Sometimes doesn't clean up stale shm regions #9893
Comments
We have noticed this in our internal MPICH tests. |
If a job fails or is killed, the provider has no chance to cleanup shared memory. I am not aware of any issues with cleanup for normally completed jobs. For failed jobs, cleanup such as follows is required and is typically integrated into job schedule post job cleanup: rm -rf /dev/shm/psm3_shm.* /dev/shm/sem.psm3_nic_affinity* /dev/shm/ If you are seeing a lack of cleanup for jobs which completed normally, and fully shutdown all OFI interfaces, please provide more details. I do not see such issues with OpenMPI nor Intel MPI |
@ToddRimmer in prov/shm there is a signal handler that cleans up the /dev/shm region in the event of a failure signal or abrupt failure. https://github.com/ofiwg/libfabric/blob/main/prov/shm/src/smr_signal.h. We don't do this manual cleanup for psm3 and it looks like mpich/argonne doesn't either. From what I can tell it looks like the "normal completion" path is okay. Can a signal handler like in prov/shm be implemented to solve this failure/interrupt path issue? |
In our experience, a common job failure mode is for middleware or some other layer to call abort(), which results in a segfault signal and no guaranteed clean way for provider to cleanup as process memory status is suspect. So, our recommendation has been to place the cleanup commands in the job scheduler epilog script, the commands are provided in a script so documented in our Intel Ethernet Fabric Suite Package. This approach has worked well in the field for over a decade with PSM3 and its predecessors. Can explore signal handler for some other signals, but given the need to address the common abort() case, the epilog approach is simple and covers all cases. |
@ToddRimmer: Does abort() send a sigsegv (signal segmentation fault)? I thought it sends sigabrt (signal abort) to the running process. In either case a signal handler would be able to interrupt both of these signals (or any signal except sigkill [signal kill]) and cleanup the /dev/shm/psm3* files before completing the signal. This is already done and working for the shm provider so maybe we can add it as an enhancement for psm3? I think having to add an extra cleanup outside of the code, that the code can and should handle, doesn't make sense for the long run. I can add your workaround solution of the: |
exploring. We do have an existing signal handler, can you try an experiment with PSM3_BACKTRACE=1 In this case our signal handler gets installed and it will call exit for SIGTERM and SIGINT. We do have an atexit() function installed to do the cleanup, so I suspect in your case, exit is not being called. Granted, exit() is defined as "not safe" for use in a signal handler, however for SIGINT and SIGTERM is it typically safe enough. If this addresses the issue, you have a workaround and I can be confident that some layer above won't override our signal handler an could enable it in general (I believe in the past we had such situations, so were only able to do the atexit() approach). |
PSM3 sometimes doesn't clean up stale shm regions. After a test is done you can still see some psm3.xxx.xxx shm regions in /dev/shm
Run a test several times. Then check /dev/shm and notice there are open shm regions that aren't cleaned up
It is expected that tests clean up all of their resources on completion.
Fabtests don't appear to report any errors and just leak the stale shm regions.
Environment:
Rhel 8.6, 8.7, and 9.3, psm3
The text was updated successfully, but these errors were encountered: