Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ocean readme with new compass environment #462

Merged
merged 2 commits into from
Mar 9, 2020

Conversation

xylar
Copy link
Collaborator

@xylar xylar commented Mar 3, 2020

This is based on the metapackage from MPAS-Tools (https://github.com/MPAS-Dev/MPAS-Tools/tree/master/compass)

@xylar
Copy link
Collaborator Author

xylar commented Mar 3, 2020

@mark-petersen, if you can include this with your compass-only merges, that would be great! If you can do your testing with this new compass_0.1.0 environment, that would also be very good. compass_py3.7 should be considered deprecated and will be removed soonish.

@xylar
Copy link
Collaborator Author

xylar commented Mar 4, 2020

Update: the compass_0.1.0 environment is deployed on cori (chmodding/chowning right now).

This is based on the metapackage from MPAS-Tools.
@xylar
Copy link
Collaborator Author

xylar commented Mar 6, 2020

I just deployed compass_0.1.1 because of the bug fix to the mesh conversion tools (MPAS-Dev/MPAS-Tools#296). So I've updated the README once again to reflect this.

Just built and uploaded
@xylar
Copy link
Collaborator Author

xylar commented Mar 7, 2020

In my testing of e3sm_coupling, I discovered that there are some missing tools in compass. I added them and updated to v0.1.2 in MPAS-Dev/MPAS-Tools#298

@mark-petersen
Copy link
Contributor

@xylar I merged into ocean/develop locally, and tested the QU240 init using COMPASS. On both grizzly and badger, with:

module unload python; source /usr/projects/climate/SHARED_CLIMATE/anaconda_envs/load_latest_compass.sh

The first base_mesh step runs on a log-in node, but fails on a compute node on both grizzly and badger with:

(compass_0.1.2) ba117:base_mesh$ ./run.py
[cli_0]: write_line error; fd=11 buf=:cmd=init pmi_version=1 pmi_subversion=1
:
system msg for write_line failure : Bad file descriptor
[cli_0]: Unable to write to PMI_fd
[cli_0]: write_line error; fd=11 buf=:cmd=get_appnum
:
system msg for write_line failure : Bad file descriptor
Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(586):
MPID_Init(175).......: channel initialization failed
MPID_Init(463).......: PMI_Get_appnum returned -1
[cli_0]: write_line error; fd=11 buf=:cmd=abort exitcode=1093647
:
system msg for write_line failure : Bad file descriptor
Traceback (most recent call last):
  File "./run.py", line 16, in <module>
    subprocess.check_call(['python', '-m', 'jigsaw_to_MPAS.build_mesh'])
  File "/usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.2/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['python', '-m', 'jigsaw_to_MPAS.build_mesh']' returned non-zero exit status 15.

Same thing with the straight python command from the new package:

(compass_0.1.2) ba117:base_mesh$ pwd
/lustre/scratch4/turquoise/mpeterse/runs/nightly/ocean_model_200309_PR462/ocean/global_ocean/QU240/init/base_mesh
(compass_0.1.2) ba117:base_mesh$ which python
/usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.2/bin/python
(compass_0.1.2) ba117:base_mesh$ python jigsaw_to_MPAS/build_mesh.py
[cli_0]: write_line error; fd=11 buf=:cmd=init pmi_version=1 pmi_subversion=1
:
system msg for write_line failure : Bad file descriptor
[cli_0]: Unable to write to PMI_fd
[cli_0]: write_line error; fd=11 buf=:cmd=get_appnum
:
system msg for write_line failure : Bad file descriptor
Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(586):
MPID_Init(175).......: channel initialization failed
MPID_Init(463).......: PMI_Get_appnum returned -1
[cli_0]: write_line error; fd=11 buf=:cmd=abort exitcode=1093647
:
system msg for write_line failure : Bad file descriptor

It looks like some interaction between python and MPI, but I don't understand it. I created another case with the current head of ocean/develop (without this PR) and the same thing happens, so it must be the new compass_0.1.2 package. On a log-in node this base_mesh step is successful.

@xylar
Copy link
Collaborator Author

xylar commented Mar 9, 2020

Thanks @mark-petersen. I agree, this clearly isn't anything to do with this PR itself. I'll experiment with this and make sure I can reproduce it and then hopefully find a fix.

@xylar
Copy link
Collaborator Author

xylar commented Mar 9, 2020

I can definitely reproduce the problem :-( Working on which package is causing it. Not just MPI. Doesn't seem to have to do with loading the system MPI either.

@xylar
Copy link
Collaborator Author

xylar commented Mar 9, 2020

@mark-petersen, I was able to get thing to run just fine if I call mpirun, e.g.:

mpirun -np 1 ./setup_testcase.py ...

The following also works:

srun -n 1 ./setup_testcase.py ...

The issue is anything that uses the libnetcdf package that was built for MPI, including the netcdf4 python package used by -- pretty much everything.

That's obviously not a solution, just an observaiton.

@xylar
Copy link
Collaborator Author

xylar commented Mar 9, 2020

I wonder if the change isn't with the compass environment but something on LANL IC?

@mark-petersen
Copy link
Contributor

Hmm.. I can't get that to work with the QU240 init:

output

(compass_0.1.2) gr0407:base_mesh$ pwd
/lustre/scratch4/turquoise/mpeterse/runs/nightly/ocean_model_200309_PR462/ocean/global_ocean/QU240/init/base_mesh
(compass_0.1.2) gr0407:base_mesh$ module list
Currently Loaded Modules:
  1) git/2.21.0   2) gcc/5.3.0   3) openmpi/1.10.5   4) netcdf/4.4.1   5) parallel-netcdf/1.5.0   6) pio/1.7.2
(compass_0.1.2) gr0407:base_mesh$ which python
/usr/projects/climate/SHARED_CLIMATE/anaconda_envs/base/envs/compass_0.1.2/bin/python

(compass_0.1.2) gr0407:base_mesh$ mpirun -np 1 ./run.py
App launch reported: 1 (out of 1) daemons - 1 (out of 1) procs
[cli_0]: write_line error; fd=11 buf=:cmd=init pmi_version=1 pmi_subversion=1
:

(compass_0.1.2) gr0407:base_mesh$ mpirun -n 1 ./jigsaw_to_MPAS/build_mesh.py
App launch reported: 1 (out of 1) daemons - 1 (out of 1) procs
[cli_0]: write_line error; fd=11 buf=:cmd=init pmi_version=1 pmi_subversion=1
:

(compass_0.1.2) gr0407:base_mesh$ mpirun -n 1 python ./jigsaw_to_MPAS/build_mesh.py
App launch reported: 1 (out of 1) daemons - 1 (out of 1) procs
[cli_0]: write_line error; fd=11 buf=:cmd=init pmi_version=1 pmi_subversion=1
:

(compass_0.1.2) gr0407:base_mesh$ srun -n 1 ./run.py
[cli_0]: write_line error; fd=11 buf=:cmd=init pmi_version=1 pmi_subversion=1
:

I'm not sure how that differs from your success above. Did you run setup_testcase.py and then successfully run the test case as well?

@xylar
Copy link
Collaborator Author

xylar commented Mar 9, 2020

@mark-petersen, no , I just ran setup_testcase.py because that was the first thing that gave me the same error as you saw. I didn't have time to do the full test case and I don't right now. I'll try again in awhile and get back to you if I come up with a more robust solution.

@xylar
Copy link
Collaborator Author

xylar commented Mar 9, 2020

@mark-petersen, I agree, this "fix" is not working in more general cases.

@xylar
Copy link
Collaborator Author

xylar commented Mar 9, 2020

@mark-petersen, per our phone call, I created a compass_0.1.2_serial environment and made load_latest_compass.sh point to that instead. Please try that out and let me know how it goes. If you try to run ESMF_RegridWeightGen with multiple processors, it will break (because it will try to do the same thing n times rather than actually working in parallel), so beware of that.

@mark-petersen
Copy link
Contributor

Yes, that worked great. compass_0.1.2_serial passed nightly regression suite on a compute node. Thanks, I can proceed with PRs now, except those involving the E3SM mapping files.

@mark-petersen mark-petersen self-requested a review March 9, 2020 21:26
@mark-petersen mark-petersen self-assigned this Mar 9, 2020
Copy link
Contributor

@mark-petersen mark-petersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passes nightly regression suite in serial version. See comments about parallel version problems, but that is unrelated to this particular compass package version.

@mark-petersen mark-petersen merged commit 53f2505 into MPAS-Dev:ocean/develop Mar 9, 2020
@xylar xylar deleted the update_compass_readme branch March 9, 2020 21:43
ashwathsv pushed a commit to ashwathsv/MPAS-Model that referenced this pull request Jul 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants