-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write out a job script for suite and test cases during setup #376
Conversation
TestingI am in the process of running a couple of tests on each machine. I will check off here as I successfully run the ocean
I will check off here when I have successfully run the MALI
|
# as a rule of thumb, let's do the geometric mean between min and target | ||
cores = np.sqrt(target_cores*min_cores) | ||
nodes = int(np.ceil(cores/cores_per_node)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The geometric mean seems to be a good guess for what folks might want but I'm up for other suggestions for how to handle the node count.
The account on Badger (from
|
@sbrus89, @mark-petersen and @matthewhoffman, I'm still finishing my testing on Cori and Compy but this is ready for you to test and/or review when you want. If any of you feel like this isn't a feature you would use anyway, no need to review it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works great, I love it! It will really simplify my process, as I usually want to run many tests simultaneously and can only run one interactive job. Thanks also for the flag for the job account. I tested on Badger with and without the flag, as follows:
./conda/configure_compass_env.py --conda /usr/projects/climate/mpeterse/miconda3 \
--compiler gnu --env_name job_script
source load_job_script_badger_gnu_mvapich.sh
compass suite -s -c ocean -t nightly -p $EXEDIR -w $WORKDIR -f t22_ocean_time_step.cfg
cd $WORKDIR
sbatch job_script.sh
# wait...
cat compass.o*
On LANL IC machines, let's use the t22_ocean_time_step
as the default. I own that, and I'm happy to add anyone for COMPASS testing, which is small change (I think I've already added most ocean users). That is better than e3sm
which does not exist for anyone. When I apply next year, I'll choose a more inclusive account name.
compass/provenance.py
Outdated
@@ -110,27 +101,9 @@ def write(work_dir, test_cases, mpas_core=None, config_filename=None, | |||
provenance_file.close() | |||
|
|||
|
|||
def _get_mpas_git_version(mpas_core, config_filename, mpas_model_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this section was moved to a different file. This seems somewhat unrelated to the PR description, so just checking this was intentional to include in this PR. If it's just a convenience thing for planned future changes, that's fine with me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can understand that this seems unrelated but it is, in fact, related.
For provenance, it was previously necessary to have config options that come from most sources (defaults, mache, machine-specific, MPAS-core and user) but where a test group and test case has not yet been determined so those are not included.
Then, quite redundantly, I built all this up again for each test case, adding more config options for the test group and test case, and to call the test case's configure()
method.
That code was duplicated because it initially seemed trivial but slowly got more complex. For this PR, I needed this "basic" set of config options in a 3rd place (for making job scrips) and it was clearly time to fix the unnecessary duplicate code. That's what you're seeing here.
I could pull this out into its own PR if you like but it is related.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did pull this out into its own PR: #378
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was fine with it being included in this PR, but pulling it out is fine too.
compass/machines/badger.cfg
Outdated
[parallel] | ||
|
||
# account for running diagnostics jobs | ||
account = t22_ocean_time_step |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense to have the default be the account that the most people have access to. Trevor, Holly, and I won't be able to use this for MALI tests (unless Mark wanted to add us, but I wouldn't expect that to happen just for this reason). It would be convenient if for us we could default to t22_iceocean
, but I expect making a core-specific default is not worth the trouble. A different option is that if the account is not included at all in the job script, SLURM will use the user's default account (which for me is set to t22_iceocean
). The only downside there is if a new user has not set a default account, but I would suggest it makes sense for users to do that on LANL IC. So I guess my preference would be for account to not be included in the Badger job script.
@xylar , @mark-petersen , what do you think about this issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, no default account is best. I have taken it out (left it blank).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be convenient if for us we could default to
t22_iceocean
, but I expect making a core-specific default is not worth the trouble.
This would be possible but hard to maintain. There isn't currently a good way to have config options for an MPAS core and machine. We could add account_ocean
and account_landice
config options that get used if account
is blank. But the t22_*
accounts are obviously a moving target whereas the default account seems easy. If folks haven't set a default, they will just have to weight the relative pain of providing a config file or setting a default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have adjusted my comment - by the end of writing it I felt that leaving account blank to make use of a user's default is the best solution. I agree that implementing and maintaining core-specific values sounds like a bad idea.
compass/job/template.sh
Outdated
{%- endif %} | ||
|
||
source load_compass_env.sh | ||
compass run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add the name of the pickle file to run here. If you have multiple suites set up in one directory, then as written this will fail. (We often do, so I encountered that error.) Similarly, for a single test case, does it make more sense to have this script appear in the root of the workdir (as happens now), or in the subdir where the pickle file for that test lives?
That said, it's reasonable enough to consider this submission script as a template and expect that a user can/will adjust it for the specific suite/pickle they want to run. If we do insert the pickle name here, then we probably want to put that in the script name as well so that setting up multiple suites doesn't clobber existing scripts and so it's clear that you are submitting a script specific to a particular suite/pickle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I can't believe I didn't think of that. I have added it now. It's pretty obvious that we want that feature. The suite name isn't included in compass run
for test case job scripts but it part of suite ones.
compass/job/__init__.py
Outdated
text = template.render(job_name=job_name, account=account, | ||
nodes=f'{nodes}', wall_time=wall_time, qos=qos, | ||
partition=partition, constraint=constraint) | ||
script_filename = os.path.join(work_dir, 'job_script.sh') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding 'compass' in the name? And then possibly the suite name or test case number (see my other comment)? E.g. compass_job_script.suite_full_integration.sh
or compass_job_script.test_61.sh
. If that's too complicated or requires too many changes, it's fine as it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that having the suite name makes total sense. I think the suite name will just be custom
when you set up test cases with numbers. I don't think test_61_62_63
or anything like that is very helpful. I'm not sure if that information is easily available by the time the job script is being created either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding 'compass' in the name?
I did this but I'm not sure why it's important. I would think the compass work directory would implicitly only be for doing stuff with compass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those changes sound good. I agree you can leave out the 'compass' if you prefer - this is all clearly happening within compass!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put in compass_job_script
. The only slight downside to me is that the log file and the job script start with the same thing so tab completion is more tedious. That's a silly thing to worry about, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xylar , this is a fantastic quality of life addition, and I was actually thinking about this in the last few days! I made a few comments about some of the details of the implementation. I'm not committed to any of them, but wanted to have a little bit more discussion about expected behavior and conventions before this PR gets finalized.
This is helpful because the constraint isn't named correctly in the current release of mache.
126e85e
to
8743f25
Compare
cbc0ecc
to
85e79b7
Compare
@matthewhoffman, I believe I've addressed your requested changes. Please give it another look. I also realized that I had made an error in generating the job scripts for individual tests -- those job scripts were ending up in the base work directory instead of the tests themselves. I fixed this and have now tested that you can run a test case on its own as well. |
Taking @sbrus89 off as a reviewer since he's been out sick. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xylar, this worked great for me running the nightly suite on Compy.
This also worked for me on a single test case: |
Thanks, @sbrus89! @matthewhoffman, could you take another look and update your review if you're happy? |
@xylar , I took another look and the new behavior addresses all the issues I brought up. I tried a couple suites and one single test case. One side question - is it normal to have Also, if you want to drop |
I was wondering about the |
Yes, that's the normal behavior. A test case or several of them set up with |
Yes. In general, you should use different work directories for setting up different sets of test cases (so that the custom suite doesn't get overwritten). @sbrus89 and @matthewhoffman, are you both using the same work directory for setting up several test cases or suites, one after the other? |
I think I'm going to leave |
Thanks @mark-petersen, @matthewhoffman and @sbrus89 for the reviews! I hope this is a useful new feature. |
On supported machines, write out a job script during setup. The job script uses
mache
to figure out what account, partition, QOS and constraint to use. It defaults to 1 hour wall-clock time (typically, more than enough) and determines the number of nodes based on the geometric mean (often a practical compromise) of the minimum and target number of cores required for a test case or suite.Users (and developers) can alter the defaults with a config file, as detailed in the updated User's Guide.