Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List of supported platforms for the release #67

Closed
climbfuji opened this issue Jan 23, 2020 · 104 comments
Closed

List of supported platforms for the release #67

climbfuji opened this issue Jan 23, 2020 · 104 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@climbfuji
Copy link
Collaborator

Here is a link to a spreadsheet that lists the supported platforms/compilers and who has access to these:

https://docs.google.com/spreadsheets/d/122uasMhD8aF_s6jUpy7t_Io5JSNGYOmBddtT6kh0dSQ/edit#gid=0

Please use this spreadsheet to indicate the readiness for testing and who is testing on which platform (maybe also indicate success/failures - this has not been set up yet in the spreadsheet). We can also use this GitHub issue to report successes/failures.

@rsdunlapiv
Copy link
Collaborator

In discussions with @ligiabernardet on the documentation, we had come to the conclusion that there can be confusion over what is meant by a "supported" platform. With respect to CIME, we are thinking of this in two categories:

  • supported means that the platform has the basic prerequisites to run whole application, but it does not mean that machine-specific files have been set up for CIME
  • preconfigured means that CIME has been set up already with machine-specific files, and so the app should work out-of-the-box with no porting steps required

There is also the idea of:

  • tested platform is a specific machine that members of the release team have actually tested in some way, e.g., run the CIME regression test suite, or run the model regression tests

Clearing up the terminology will be important. So if a user from a university says that have a linux cluster with intel 19 and intelMPI and dependent libraries, then that would be a supported platform in the sense that if there is an issue with installation we would expect to help them. It would not, however, be a tested platform since no one from the release team has worked on that machine.

A Microsoft Windows desktop is not a supported or tested platform.

@arunchawla-NOAA arunchawla-NOAA added the documentation Improvements or additions to documentation label Jan 29, 2020
@ligiabernardet
Copy link
Collaborator

Can someone please add a column to this spreadsheet and indicate which platforms are preconfigured?

@climbfuji
Copy link
Collaborator Author

I will add the column but the entries for many will be TBD - will depend on how far we get in the next weeks.

@ligiabernardet
Copy link
Collaborator

Given the definition of preconfigured platform "preconfigured means that CIME has been set up already with machine-specific files, and so the app should work out-of-the-box with no porting steps required", I do not understand why the spreadsheet mentions certain OS as "in progress" wrt preconfiguring. How can MacOS Catalina ever be preconfigured? A user's Mac laptop will not be preconfigured upon purchase; the user will have to configure it.
Does the spreadsheet need to be modified so that only specific machines with actual names (e.g., Hera, Cheyenne etc.) can be preconfigured?

@rsdunlapiv
Copy link
Collaborator

I agree - a MacOS laptop is never preconfigured - you always have to go through the process of installing NCEPlibs and setting up CIME on that laptop. (Unless the configuration was so standard that CIME would always just work out of the box on MacOS - but that seems very unlikely.)

@climbfuji
Copy link
Collaborator Author

Yes, we need two different categories - preconfigured and supported.

@mvertens
Copy link
Collaborator

mvertens commented Feb 3, 2020 via email

@jedwards4b
Copy link
Collaborator

I think that we can create a cime port that will work on macos or linux - the user would need to create inputdata and output directories and set environment variables for them along with the set of env variables in NOAA-EMC/NCEPLIBS#30 however as of this post I am still not able to build NCEPLIBS on a mac.

@climbfuji
Copy link
Collaborator Author

I think you should be able to use the two repos I sent you earlier, the macos rpath update was made (thanks, Kyle). If you don't want to use the NCEPLIBS-external, then you need to install the dependencies by yourself. But I assume (and it would be good if someone tested it) that the problem with NetCDF from NCEPLIBS-external has something to do with my machine setup. Let me know if you need further instructions. Thanks!

@jedwards4b
Copy link
Collaborator

@climbfuji I am not yet able to use the two repos you indicated and am having problems well before getting to the netcdf build. The latest problem seems to be in the hdf5 build:

/Users/jedwards/src/NCEPLIBS-external/build/hdf5/src/hdf5-build/CMakeFiles/CheckIncludeFiles/C_HAVE_QUADMATH.c:2:10: fatal error: 'quadmath.h' file not found
#include <quadmath.h>
         ^~~~~~~~~~~~
1 error generated.

@climbfuji
Copy link
Collaborator Author

This is macOS, right? So, the question is which way you chose to install your prerequisites: compiler, MPI library. Your error could be related to using homebrew's mpi library, which uses the default Apple gcc (which is clang with much less functionality than LLVM's clang). Continues below the list ...

  1. Use homebrew to install gcc@9, then set environment variables CC=gcc-9, FC=gfortran-9, CXX=g++-9.
  2. Download and compile openmpi-4.0.2 or mpich-3.3.1 manually with those compilers, install to a place outside of homebrew to not mess it up. Add the mpi bin directory to PATH and the MPI lib directory to LD_LIBRARY_PATH.

Another note I found in my install instructions for Mojave:

# Fix missing header files in /usr/include for macOS Mojave - doesn't exist on Catalina?
open /Library/Developer/CommandLineTools/Packages/macOS_SDK_headers_for_macOS_10.14.pkg

@jedwards4b
Copy link
Collaborator

Yesterday I was using a second Mac, not the one I first attempted to install on and attempted to follow your instructions to the letter. I humbly submit - if I can't do it it's not ready for public consumption.

@climbfuji
Copy link
Collaborator Author

Please check if you did the last part, which was not included in my instructions (somehow missed that, because my new/test system uses Catalina and not Mojave).

# Fix missing header files in /usr/include for macOS Mojave - doesn't exist on Catalina?
open /Library/Developer/CommandLineTools/Packages/macOS_SDK_headers_for_macOS_10.14.pkg

@jedwards4b
Copy link
Collaborator

jedwards4b commented Feb 4, 2020 via email

@climbfuji
Copy link
Collaborator Author

They are not documented yet because we haven't finalized the instructions. You can either choose to wait until we have tested this out entirely, or you will have to accept that you are a testing buddy who will run into trouble in order to help pointing out flaws in the process (and I very much appreciate having people to find issues in the process). Knowing that we have to add this fix to the missing header files for Mojave system is one example for it. I am sorry, but I just can't work faster.

There are some notes in the README.md of NCEPLIBS-externals, https://github.com/NOAA-EMC/NCEPLIBS-external/tree/master or (this is where I will make the updates today) https://github.com/climbfuji/NCEPLIBS-external/tree/esmf_make_remove_curl_add_wgrib2. It would be super helpful if you could try the "fix missing header files" step and see if this solves the problem you ran into, and note anything you find that is not working in the issues on my fork (while working with my branches).

If you prefer to test the libraries on a better supported platform, please use a generic linux box with the gnu compilers (or intel compilers) in the meanwhile. Thank you for your patience and help!

@jedwards4b
Copy link
Collaborator

@climbfuji I don't expect a finished set of instructions - I'm just suggesting a shared doc that lists the steps and that we can modify as we go along. I think we need to get all of the instructions in one place.

@climbfuji
Copy link
Collaborator Author

I think we can work directly on the place where this should go, I started putting the instructions for Catalina there:

https://github.com/NOAA-EMC/NCEPLIBS-external/wiki

Formatting can be improved, but this should get us started.

@jedwards4b
Copy link
Collaborator

jedwards4b commented Feb 4, 2020 via email

@climbfuji
Copy link
Collaborator Author

Now you have write permissions.

@rsdunlapiv
Copy link
Collaborator

@climbfuji and @jedwards4b what is the status of the MacOS build, instructions, and CIME testing?

@jedwards4b
Copy link
Collaborator

I am now able to run the chgres_cube on the mac but I am getting a dynamic library load error from the model that I haven't been able to figure out:

[1] dyld: Library not loaded: @rpath/libnetcdff.7.dylib
[1]   Referenced from: /Users/jedwards/projects/scratch/SMS_Lh3.C96.GFSv15p2.homebrew_gnu.20200207_135212_sv4w0f/bld/ufs.exe
[1]   Reason: image not found

That @rpath should be /usr/local/ufs-release-v1/lib what is really confusing is that chgres shows this same problem with the netcdf libraries but runs anyway.

@climbfuji
Copy link
Collaborator Author

Does chgres really show

[1] dyld: Library not loaded: @rpath/libnetcdff.7.dylib
[1]   Referenced from: ... /chgres_cube.exe
[1]   Reason: image not found

?

@jedwards4b
Copy link
Collaborator

jedwards4b commented Feb 8, 2020 via email

@ceceliadid
Copy link

@climbfuji I can still ask some GST folks to build the libs, I just don't want it to be the default for them.

@climbfuji
Copy link
Collaborator Author

The GST as a temporary step is not the problem (for me at least). But the UFS release when it goes out the door on March 6 must require users to build the libraries by themselves on configured but not supported platforms.

@ceceliadid
Copy link

This is the current definition of level 2 support:
Configurable platforms are platforms where all of the required libraries for building community releases of UFS models and applications are expected to install successfully, but are not available in a central place (it may be a user directory). Applications and models are expected to build and run once the required libraries are built or located.

This implies that it is permissible to have a non-centralized location for the libs. Do you want to change the wording and say they always have to be built?

@jedwards4b
Copy link
Collaborator

@ceceliadid how about if we document the environment variables that need to be set and provide suggested values for them (that should work) with some wording about these not being official install paths? That way the user doesn't need to build these libraries or download the inputdata unless they choose to do so.

@ceceliadid
Copy link

@jedwards4b That seems okay to me, but it seems like a more general policy question that others will have to answer.

I tried building the WM on Stampede2 without CIME with the paths that Arun provided (from a few comments back) and it bails out quickly with CMake Error at cmake/FindESMF.cmake:12 (file):
file STRINGS file
"/work/01118/tg803972/stampede2/UFS/NCEP_LIBS_ALL.1.0.0.beta01/esmf.mk" cannot be read

Is that a permission problem with the lib dir path? I can see the directory work/01118/tg803972/stampede2 but I can't see anything in the UFS dir.

@jedwards4b
Copy link
Collaborator

@ceceliadid
ESMFMKFILE should be set to /work/01118/tg803972/stampede2/UFS/NCEPLIBS-external/build/install/lib64/esmf.mk

@ceceliadid
Copy link

@jedwards4b Thanks but that still gives me:
ESMFMKFILE: /work/01118/tg803972/stampede2/UFS/NCEPLIBS-external/build/install/lib64/esmf.mk
CMake Error at cmake/FindESMF.cmake:12 (file):
file STRINGS file
"/work/01118/tg803972/stampede2/UFS/NCEPLIBS-external/build/install/lib64/esmf.mk"
cannot be read.

@climbfuji
Copy link
Collaborator Author

Same problem here. This is why I am insisting on project spaces and not user directories. Instructions for compiling the libraries on stampede will be available shortly, straightforward and quick.

login2(422)$ ls /work/01118/tg803972/stampede2/UFS/NCEPLIBS-external/build/install/lib64/esmf.mk
ls: cannot access /work/01118/tg803972/stampede2/UFS/NCEPLIBS-external/build/install/lib64/esmf.mk: Permission denied
login2(423)$ whoami
tg854455
login2(424)$ groups
G-81589

@uturuncoglu
Copy link
Collaborator

I am not sure way you can't access because Jim is not in my group and he could access. Anyway, you could also install externals also using following information.

# Stampede2
source /opt/apps/lmod/lmod/init/sh
module purge
module load TACC python/2.7.15 intel/18.0.2 cmake/3.10.2
module rm mvapich2
module load impi/18.0.2 pnetcdf/1.11.0
module load netcdf/4.6.2
module load cmake
module load hdf5/1.10.4

# NCEPLIBS-external
git clone --recursive -b master https://github.com/NOAA-EMC/NCEPLIBS-external.git
cd NCEPLIBS-external/
mkdir build-all
cd build-all
NETCDF=$TACC_NETCDF_DIR CC=mpicc FC=mpif90 CXX=mpicxx cmake -DBUILD_MPI=OFF -DBUILD_NETCDF=OFF -DBUILD_PNG=OFF -DBUILD_JASPER=OFF -DMPITYPE=intelmpi -DCMAKE_INSTALL_PREFIX=$PWD/install ..
make -j 4

# NCEPLIBS
git clone --recursive -b ufs-v1.0.0.beta01 https://github.com/NOAA-EMC/NCEPLIBS.git NCEP_LIBS_ALL.1.0.0.beta01
cd NCEP_LIBS_ALL.1.0.0.beta01
mkdir build-all
cd build-all
NETCDF=$TACC_NETCDF_DIR HDF5=$TACC_HDF5_DIR CC=mpicc FC=mpif90 CXX=mpicxx cmake -DMPITYPE=intelmpi -DCMAKE_INSTALL_PREFIX=$PWD/install -DEXTERNAL_LIBS_DIR=/work/01118/tg803972/stampede2/UFS/NCEPLIBS-external/build/install ..
make

@jedwards4b
Copy link
Collaborator

@ceceliadid Sorry - those directories and files are world readable, I don't understand why you can 't read them. But I guess we've validated Dom's viewpoint.

@climbfuji
Copy link
Collaborator Author

What the sysadmins did on gaea a while ago is that they made the very top-level directories readable to the respective groups only. Nobody realized that. If that's not the case, maybe they put ACL restrictions on one of the directories in the path?

@climbfuji
Copy link
Collaborator Author

See here for the PR containing the setup instructions for Stampede with Intel: NOAA-EMC/NCEPLIBS-external#29

@ceceliadid
Copy link

I can see all the dirs in work/01118/tg803972/stampede2 just not read into any of them. Anyway @climbfuji I guess I'm coming around to needing to build libs for now. Thanks for the instructions I'll give it a shot.

@climbfuji
Copy link
Collaborator Author

Follow-up question: does this also affect the files in DIN_LOC_ROOT? On the generic platforms, the data is downloaded using the ./check_input_data --download command in lack of a shared DIN_LOC_ROOT location.

@jedwards4b
Copy link
Collaborator

jedwards4b commented Feb 24, 2020 via email

@climbfuji
Copy link
Collaborator Author

login2(426)$ ls /work/01118/tg803972/stampede2/UFS/ufs_inputdata
ls: cannot access /work/01118/tg803972/stampede2/UFS/ufs_inputdata: Permission denied

so I fear we need to use the download script there as well ... or get it from HPSS / Niagara?

@climbfuji
Copy link
Collaborator Author

Ok, so right now this is not working on orion, because it wants me to set DIN_LOC_ROOT. I assume the "transition" to using the download script is not complete (also, the hashes in manage externals have not been updated to include the generic linux pieces for the two cime repositories - sidenote).

login2(495)$  ./cime/scripts/create_newcase --case c96-gfsv15p2-gnu --compset GFSv15p2 --res C96 --workflow ufs-mrweather --machine stampede2-skx
Compset longname is FCST_ufsatm%v15p2_SLND_SICE_SOCN_SROF_SGLC_SWAV
Compset specification file is /scratch/06146/tg854455/ufs-mrweather-app/my-ufs-sandbox/src/model/FV3/cime/cime_config/config_compsets.xml
Automatically adding SESP to compset
Compset forcing is
ATM component is UFSATM Atmosphere with:CCPP physics version 15p2
LND component is Stub land component
ICE component is Stub ice component
OCN component is Stub ocn component
ROF component is Stub river component
GLC component is Stub glacier (land ice) component
WAV component is Stub wave component
ESP component is Stub external system processing (ESP) component
Pes     specification file is /scratch/06146/tg854455/ufs-mrweather-app/my-ufs-sandbox/src/model/FV3/cime/cime_config/config_pes.xml
Compset specific settings: name is RUN_STARTDATE and value is 2019-08-29
Compset specific settings: name is START_TOD and value is 0
Compset specific settings: name is COMP_CLASSES and value is ATM
Compset specific settings: name is CHECK_TIMING and value is FALSE
Machine is stampede2-skx
Pes setting: grid match    is a%C96
Pes setting: compset_match is ufsatm
Pes setting: grid          is a%C96_l%null_oi%null_r%null_g%null_w%null_z%null_m%null
Pes setting: compset       is FCST_ufsatm%v15p2_SLND_SICE_SOCN_SROF_SGLC_SWAV_SESP
Pes setting: tasks       is {'NTASKS_ATM': 108}
Pes setting: threads     is {'NTHRDS_ATM': 1}
Pes setting: rootpe      is {}
Pes setting: pstrid      is {}
Pes other settings: {}
Pes comments: none
 Compset is: FCST_ufsatm%v15p2_SLND_SICE_SOCN_SROF_SGLC_SWAV_SESP
 Grid is: a%C96_l%null_oi%null_r%null_g%null_w%null_z%null_m%null
 Components in compset are: ['ufsatm', 'slnd', 'sice', 'socn', 'srof', 'sglc', 'swav', 'sesp']
Using project from config_machines.xml: TG-ATM190009
No charge_account info available, using value from PROJECT
ufs model version found: 450047f
Batch_system_type is slurm
job is case.chgres USER_REQUESTED_WALLTIME None USER_REQUESTED_QUEUE None
job is case.run USER_REQUESTED_WALLTIME None USER_REQUESTED_QUEUE None
job is case.gfs_post USER_REQUESTED_WALLTIME None USER_REQUESTED_QUEUE None
job is case.st_archive USER_REQUESTED_WALLTIME None USER_REQUESTED_QUEUE None
 Creating Case directory /scratch/06146/tg854455/ufs-mrweather-app/my-ufs-sandbox/c96-gfsv15p2-gnu
login2(496)$ cd c96-gfsv15p2-gnu
login2(497)$ ./case.setup
ERROR: Undefined env var 'DIN_LOC_ROOT'

@jedwards4b
Copy link
Collaborator

@climbfuji DIN_LOC_ROOT is where the data is OR where the download will put the data if it isn't already there, there is no separate script.

@climbfuji
Copy link
Collaborator Author

Sure, but I don't need to set this on macos or generic linux. I assume that all supported but not preconfigured platforms would be treated the same way? Only preconfigured platforms would have the data in a central location, predefined as DIN_LOC_ROOT?

@jedwards4b
Copy link
Collaborator

The issue is that we have some inconsistency - on macos and linux we are defining
<DIN_LOC_ROOT>$ENV{UFS_INPUT}/ufs-inputdata</DIN_LOC_ROOT>

So we should probably define it the same way on stampede and Orian

@climbfuji
Copy link
Collaborator Author

Yes, I think that would be great. Thanks for adjusting the stampede configuration so quickly yesterday. I am currently working on orion. Tomorrow I'll do gaea (down for maintenance today).

@ceceliadid
Copy link

@arunchawla-NOAA @jedwards4b @climbfuji I changed the text for level 2 support to this, which I think is more accurate:

Configurable platforms are platforms where all of the required libraries for building community releases of UFS models and applications are expected to install successfully, but are not available in a central place. Applications and models are expected to build and run once the required bundled libraries (NCEPLIBS) and third-party libraries (NCEP-external) are built.

This is on the Supported Platforms page
https://github.com/ufs-community/ufs/wiki/Supported-Platforms-and-Compilers

If that looks okay and the rest of that page looks okay I think we could close this ticket since we would have an agreed on Supported Platforms listing. Any remaining issues with level 2 supported platforms could be addressed in more specific tickets.

@climbfuji
Copy link
Collaborator Author

climbfuji commented Feb 25, 2020 via email

@jedwards4b
Copy link
Collaborator

jedwards4b commented Feb 25, 2020 via email

@ceceliadid
Copy link

Thanks, made that change and will close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

9 participants