Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Porting payu to non-NCI machinery #323

Open
ChrisC28 opened this issue Mar 25, 2022 · 12 comments · May be fixed by #326
Open

Porting payu to non-NCI machinery #323

ChrisC28 opened this issue Mar 25, 2022 · 12 comments · May be fixed by #326

Comments

@ChrisC28
Copy link

There has been some discussion in the context of a MOM6 project about the potential for porting Payu to non-NCI hardware.

I would be interested in getting payu up and running on the Pawsey HPC system. This system uses Cray architecture with slurm as a scheduler.

I notice that there appears to be some support for slurm in Payu, as there is a slurm schedular class. However, I'm unsure of the process for porting payu to another system.

@aidanheerdegen
Copy link
Collaborator

Initial work to port to pawsey is here

#326

@aidanheerdegen
Copy link
Collaborator

Thanks @aidanheerdegen I appear to have got the double_gyre case working.
Regarding payu @Pawsey, it seems to be working "well enough" for now. Only thing I'd add is that it would be useful to put the work directory on \scratch for more realistic runs. We can deal with additional support for Pawsey when and if it arises and I may be able to get some support from CSIRO for that.
I'm going to make an attempt to get eac_10 working today. Almost certain I'll run into an issue I can't solve, so, thanks in advance for the continuing help! It's been quite a while (~9 years) since I used payu or MOM.

It is simple to change the laboratory location. Just set shortpath

shortpath: /scratch/pawsey0410/

It seems pawsey are pretty enthusiastic with purging scratch, so it would make sense to keep your executables and input dirs on /group and use full paths to them, and use some sort of synching to copy the data to /group. That way the laboratory can be deleted and reinstated pretty much automatically.

There are examples of auto-synching scripts in the COSIMA experiment repos, e.g.

https://github.com/COSIMA/1deg_era5_iaf/blob/master/sync_data.sh

This is invoked with an option in config.yaml

https://github.com/COSIMA/1deg_era5_iaf/blob/master/config.yaml#L79

@ChrisC28
Copy link
Author

Recently noticed some oddities with project accounting on Pawsey. Essentially, my project wasn't being debited.

Turns out that the slurm argument equivalent to the -P pbs argument is -A (for Account ). As far as I can tell rummaging around in the code, the slurm scheduler does not pass a project argument.

A single line of code should fix the problem:
pbs_flags.append('-A {project}'.format(project=pbs_project))
However, the relevant code on Pawsey is read-only.

@aidanheerdegen
Copy link
Collaborator

Sorry Chris, this slipped through the cracks. Feel free to ping me again if it looks like I've forgotten.

I have updated the payu version on magnus with this change.

The modified payu code is in this PR #326

I can step you through the process of building your own conda environment with this modified code if that is useful.

@aidanheerdegen aidanheerdegen linked a pull request Jul 27, 2023 that will close this issue
@reillyja
Copy link

Hi - I've just got a quick clarification question about tracking changes before I summarise the latest Setonix issues.

Just a simple workflow question as my github skills are still in their infancy;

  • So we clone the payu github repo; I've cloned this to $MYSOFTWARE
  • Then when we do a pip install . the python libraries are copied to the $MYSOFTWARE/setonix/lib path and the executables are copied to $MYSOFTWARE/setonix/bin path.
  • When we make changes to the python scripts, we do this in the lib path, which doesn't feed back to the git repository in $MYSOFTWARE.
    Firstly Is this the correct way of importing/editing the scripts? and if so, do we just need to copy the python packages back to the git repository to push the changes up to the github fork?

Thanks!

@angus-g
Copy link
Collaborator

angus-g commented Sep 20, 2023

Is this the correct way of importing/editing the scripts?

No, you should use pip install -e ., which means that only a link is installed into the lib path. When you edit the contents of the repository, that's reflected in the module you import.

@reillyja
Copy link

reillyja commented Sep 20, 2023

Firstly - after cloning my forked payu repo, using pip install -e . the working directory for editing the scripts was assigned to $MYSOFTWARE/conda_install/lib/python3.10/site-packages/payu-1.0.19-py3.10.egg/ (i.e. instead of the default $MYSOFTWARE/setonix/python/lib/python3.10/site-packages/payu/ directory when using pip install .. Any ideas why that would be?

Nonetheless, I've made a couple of edits to envmod.py based on Dale Roberts' comments in this Hive post and also just commented out a couple of lines of slurm.py. Other than that, it's identical to the current master branch.

The error I'm getting now comes out of the mom6.err file as:
/scratch/pawsey0410/jreilly/mom6/work/eac_sthpac-forced_v3/MOM6-SIS2: error while loading shared libraries: libnetcdf.so.19: cannot open shared object file: No such file or directory

  • I looked at where MOM6-SIS2 expects libnetcdf.so.19 and it has a path extending from /software/setonix/2022.11/..., whilst this file also exists in /software/setonix/2023.08/....

I tried making the system software path in the envmod.py file more specific, directing it to /software/setonix/2022.11/ instead of just /software/setonix/ however that caused other issues for software required from the 2023.08 path.

I'm starting to think it might just be easier to recompile the model so that everything points to the 2023.08 software directories. Any other comments on this?

My forked repo is at https://github.com/reillyja/payu btw.

@angus-g
Copy link
Collaborator

angus-g commented Sep 20, 2023

the working directory for editing the scripts was assigned to [...] Any ideas why that would be?

Did you have a conda environment activated?

I'm starting to think it might just be easier to recompile the model so that everything points to the 2023.08 software directories. Any other comments on this?

Per September Pawsey update, the 2022.11 environment is no longer supported due to other changes. It would be significantly easier to rebuild the model than to fiddle with all the required path changes.

@dsroberts
Copy link

Hi @reillyja, these changes look fine for what you're trying to do. Did you end up getting the missing shared library issue resolved? To properly backport these changes, there would need to be a new config option(s) to take into account Lmod and possibly the Cray environment as well. The module unload step would have to go for Cray systems, but may be able to be left in for other Lmod+Spack systems.

Actually, I just had a thought. If core_modules (https://github.com/reillyja/payu/blob/master/payu/experiment.py#L37C34-L37C34) can be moved to a config option, you could populate that with the standard list of modules loaded on login on Setonix, rather than skipping the module unload steps entirely. This is safer I think as submitting jobs on Setonix works like having qsub -V specified on Gadi, meaning all modules you've loaded in your current session are carried through to the job. I'm not a fan of this, it leads to very inconsistent environments between jobs.

@ChrisC28
Copy link
Author

Hi all,

Thanks for your help in this matter.

Noting that at least one issue appears to be related to the compilation of the model, does this issue belong here? A port of payu to Setonix (and slurm more generally) would be extremely welcome, but may be unrelated to issues we are discussing here.

Should I open another issue/Hive discussion regarding model recompilation? It's currently compiled with gfortran, but I'd like to at least test a version using the cray compiler suite (we had issue compiling FMS in the past with the Cray compilers, and I might need some help there).

@aidanheerdegen
Copy link
Collaborator

aidanheerdegen commented Sep 22, 2023

We had a meeting with @reillyja and @ChrisC28 and managed to get MOM6+SIS2 compiled under the updated environment as suggested by @angus-g

#323 (comment)

It required some modifications to the FMS cmake config. FMS built ok using Angus' build config

https://github.com/angus-g/mom6-cmake

but when we tried to use this compiler library in the mom6 build it complained about some non-existent build directories in FMS.

Removing references to mosaic2/include and column_diagnostics/include from the CMakeLists.txt and re-running cmake in the FMS build dir solved the issue:

https://github.com/NOAA-GFDL/FMS/blob/main/CMakeLists.txt#L363
https://github.com/NOAA-GFDL/FMS/blob/main/CMakeLists.txt#L369
https://github.com/NOAA-GFDL/FMS/blob/main/CMakeLists.txt#L380

@VanuatuN
Copy link

Hi @ChrisC28 @reillyja @dsroberts,

I'm currently trying to run ACCESS-OM2 with SLURM on Leonardo supercomputer in Italy (CINECA). All executables are compiled after a month of struggling.

Did you manage to run outside of Gadi with slurm?

Any help and comments are much appreciated, I would be very much grateful for any advice.

Thanks
Natalia

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants