Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write out a job script for suite and test cases during setup #376

Merged
merged 9 commits into from
Apr 28, 2022

Conversation

xylar
Copy link
Collaborator

@xylar xylar commented Apr 25, 2022

On supported machines, write out a job script during setup. The job script uses mache to figure out what account, partition, QOS and constraint to use. It defaults to 1 hour wall-clock time (typically, more than enough) and determines the number of nodes based on the geometric mean (often a practical compromise) of the minimum and target number of cores required for a test case or suite.

Users (and developers) can alter the defaults with a config file, as detailed in the updated User's Guide.

@pep8speaks
Copy link

pep8speaks commented Apr 25, 2022

Hello @xylar! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-04-27 07:32:19 UTC

@xylar
Copy link
Collaborator Author

xylar commented Apr 25, 2022

Testing

I am in the process of running a couple of tests on each machine.

I will check off here as I successfully run the ocean pr test suite on each machine:

  • Anvil
  • Badger
  • Chrysalis
  • Compy
  • Cori-Haswell

I will check off here when I have successfully run the MALI full_integration suite on each machine:

  • Anvil
  • Badger
  • Chrysalis
  • Cori-Haswell (errors in validation of 2 tests that we have seen before)

Comment on lines +36 to +42
# as a rule of thumb, let's do the geometric mean between min and target
cores = np.sqrt(target_cores*min_cores)
nodes = int(np.ceil(cores/cores_per_node))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The geometric mean seems to be a good guess for what folks might want but I'm up for other suggestions for how to handle the node count.

@xylar xylar self-assigned this Apr 25, 2022
@xylar xylar added enhancement New feature or request framework labels Apr 25, 2022
@xylar
Copy link
Collaborator Author

xylar commented Apr 25, 2022

The account on Badger (from mache) is e3sm. That account no longer exists but there isn't an obvious default. I think that's a case where a little config file is the best option:

$ cat t22_ocean_time_step.cfg 
[parallel]

account = t22_ocean_time_step
$ compass suite -s -f t22_ocean_time_step.cfg ...

@xylar
Copy link
Collaborator Author

xylar commented Apr 25, 2022

@sbrus89, @mark-petersen and @matthewhoffman, I'm still finishing my testing on Cori and Compy but this is ready for you to test and/or review when you want.

If any of you feel like this isn't a feature you would use anyway, no need to review it.

Copy link
Collaborator

@mark-petersen mark-petersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works great, I love it! It will really simplify my process, as I usually want to run many tests simultaneously and can only run one interactive job. Thanks also for the flag for the job account. I tested on Badger with and without the flag, as follows:

./conda/configure_compass_env.py --conda /usr/projects/climate/mpeterse/miconda3 \
--compiler gnu --env_name job_script
source load_job_script_badger_gnu_mvapich.sh
compass suite -s -c ocean -t nightly -p $EXEDIR -w $WORKDIR -f t22_ocean_time_step.cfg
cd $WORKDIR
sbatch job_script.sh
# wait...
cat compass.o*

On LANL IC machines, let's use the t22_ocean_time_step as the default. I own that, and I'm happy to add anyone for COMPASS testing, which is small change (I think I've already added most ocean users). That is better than e3sm which does not exist for anyone. When I apply next year, I'll choose a more inclusive account name.

@@ -110,27 +101,9 @@ def write(work_dir, test_cases, mpas_core=None, config_filename=None,
provenance_file.close()


def _get_mpas_git_version(mpas_core, config_filename, mpas_model_path):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this section was moved to a different file. This seems somewhat unrelated to the PR description, so just checking this was intentional to include in this PR. If it's just a convenience thing for planned future changes, that's fine with me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can understand that this seems unrelated but it is, in fact, related.

For provenance, it was previously necessary to have config options that come from most sources (defaults, mache, machine-specific, MPAS-core and user) but where a test group and test case has not yet been determined so those are not included.

Then, quite redundantly, I built all this up again for each test case, adding more config options for the test group and test case, and to call the test case's configure() method.

That code was duplicated because it initially seemed trivial but slowly got more complex. For this PR, I needed this "basic" set of config options in a 3rd place (for making job scrips) and it was clearly time to fix the unnecessary duplicate code. That's what you're seeing here.

I could pull this out into its own PR if you like but it is related.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did pull this out into its own PR: #378

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was fine with it being included in this PR, but pulling it out is fine too.

[parallel]

# account for running diagnostics jobs
account = t22_ocean_time_step
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to have the default be the account that the most people have access to. Trevor, Holly, and I won't be able to use this for MALI tests (unless Mark wanted to add us, but I wouldn't expect that to happen just for this reason). It would be convenient if for us we could default to t22_iceocean, but I expect making a core-specific default is not worth the trouble. A different option is that if the account is not included at all in the job script, SLURM will use the user's default account (which for me is set to t22_iceocean). The only downside there is if a new user has not set a default account, but I would suggest it makes sense for users to do that on LANL IC. So I guess my preference would be for account to not be included in the Badger job script.

@xylar , @mark-petersen , what do you think about this issue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, no default account is best. I have taken it out (left it blank).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be convenient if for us we could default to t22_iceocean, but I expect making a core-specific default is not worth the trouble.

This would be possible but hard to maintain. There isn't currently a good way to have config options for an MPAS core and machine. We could add account_ocean and account_landice config options that get used if account is blank. But the t22_* accounts are obviously a moving target whereas the default account seems easy. If folks haven't set a default, they will just have to weight the relative pain of providing a config file or setting a default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have adjusted my comment - by the end of writing it I felt that leaving account blank to make use of a user's default is the best solution. I agree that implementing and maintaining core-specific values sounds like a bad idea.

{%- endif %}

source load_compass_env.sh
compass run
Copy link
Member

@matthewhoffman matthewhoffman Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add the name of the pickle file to run here. If you have multiple suites set up in one directory, then as written this will fail. (We often do, so I encountered that error.) Similarly, for a single test case, does it make more sense to have this script appear in the root of the workdir (as happens now), or in the subdir where the pickle file for that test lives?

That said, it's reasonable enough to consider this submission script as a template and expect that a user can/will adjust it for the specific suite/pickle they want to run. If we do insert the pickle name here, then we probably want to put that in the script name as well so that setting up multiple suites doesn't clobber existing scripts and so it's clear that you are submitting a script specific to a particular suite/pickle.

Copy link
Collaborator Author

@xylar xylar Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I can't believe I didn't think of that. I have added it now. It's pretty obvious that we want that feature. The suite name isn't included in compass run for test case job scripts but it part of suite ones.

text = template.render(job_name=job_name, account=account,
nodes=f'{nodes}', wall_time=wall_time, qos=qos,
partition=partition, constraint=constraint)
script_filename = os.path.join(work_dir, 'job_script.sh')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding 'compass' in the name? And then possibly the suite name or test case number (see my other comment)? E.g. compass_job_script.suite_full_integration.sh or compass_job_script.test_61.sh. If that's too complicated or requires too many changes, it's fine as it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that having the suite name makes total sense. I think the suite name will just be custom when you set up test cases with numbers. I don't think test_61_62_63 or anything like that is very helpful. I'm not sure if that information is easily available by the time the job script is being created either.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding 'compass' in the name?

I did this but I'm not sure why it's important. I would think the compass work directory would implicitly only be for doing stuff with compass.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those changes sound good. I agree you can leave out the 'compass' if you prefer - this is all clearly happening within compass!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put in compass_job_script. The only slight downside to me is that the log file and the job script start with the same thing so tab completion is more tedious. That's a silly thing to worry about, though.

Copy link
Member

@matthewhoffman matthewhoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xylar , this is a fantastic quality of life addition, and I was actually thinking about this in the last few days! I made a few comments about some of the details of the implementation. I'm not committed to any of them, but wanted to have a little bit more discussion about expected behavior and conventions before this PR gets finalized.

@xylar xylar force-pushed the add_job_scripts branch 2 times, most recently from 126e85e to 8743f25 Compare April 27, 2022 06:55
@xylar xylar force-pushed the add_job_scripts branch 2 times, most recently from cbc0ecc to 85e79b7 Compare April 27, 2022 07:21
@xylar
Copy link
Collaborator Author

xylar commented Apr 27, 2022

@matthewhoffman, I believe I've addressed your requested changes. Please give it another look.

I also realized that I had made an error in generating the job scripts for individual tests -- those job scripts were ending up in the base work directory instead of the tests themselves. I fixed this and have now tested that you can run a test case on its own as well.

@xylar xylar removed the request for review from sbrus89 April 27, 2022 07:34
@xylar
Copy link
Collaborator Author

xylar commented Apr 27, 2022

Taking @sbrus89 off as a reviewer since he's been out sick.

@sbrus89 sbrus89 self-requested a review April 27, 2022 20:59
Copy link
Collaborator

@sbrus89 sbrus89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xylar, this worked great for me running the nightly suite on Compy.

@sbrus89
Copy link
Collaborator

sbrus89 commented Apr 27, 2022

This also worked for me on a single test case: compass setup -t ocean/soma/32km/default

@xylar
Copy link
Collaborator Author

xylar commented Apr 27, 2022

Thanks, @sbrus89! @matthewhoffman, could you take another look and update your review if you're happy?

@matthewhoffman
Copy link
Member

@xylar , I took another look and the new behavior addresses all the issues I brought up. I tried a couple suites and one single test case.

One side question - is it normal to have compass_job_script.custom.sh and custom.pickle appear in the workdir root when setting up a single test? I remember the concept of custom suites, but I've never explored it myself.

Also, if you want to drop compass_ from the job script name, that's fine with me. With the SLURM log files also starting with compass, tab completion is less efficient, but that is certainly a minor consideration.

@sbrus89
Copy link
Collaborator

sbrus89 commented Apr 28, 2022

I was wondering about the compass_job_script.custom.sh and custom.pickle as well. Will those always be specific to the most recent single test case setup?

@xylar
Copy link
Collaborator Author

xylar commented Apr 28, 2022

One side question - is it normal to have compass_job_script.custom.sh and custom.pickle appear in the workdir root when setting up a single test? I remember the concept of custom suites, but I've never explored it myself.

Yes, that's the normal behavior. A test case or several of them set up with compass setup is considered a "custom" suite.

@xylar
Copy link
Collaborator Author

xylar commented Apr 28, 2022

I was wondering about the compass_job_script.custom.sh and custom.pickle as well. Will those always be specific to the most recent single test case setup?

Yes. In general, you should use different work directories for setting up different sets of test cases (so that the custom suite doesn't get overwritten).

@sbrus89 and @matthewhoffman, are you both using the same work directory for setting up several test cases or suites, one after the other?

@xylar
Copy link
Collaborator Author

xylar commented Apr 28, 2022

I think I'm going to leave compass_ in the job script name just to save the time of rerunning CI and re-testing. We can clean it up later if we want...

@xylar
Copy link
Collaborator Author

xylar commented Apr 28, 2022

Thanks @mark-petersen, @matthewhoffman and @sbrus89 for the reviews! I hope this is a useful new feature.

@xylar xylar merged commit 1b98ab4 into MPAS-Dev:master Apr 28, 2022
@xylar xylar deleted the add_job_scripts branch April 28, 2022 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request framework
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants