Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LAMMPS tests to EESSI test-suite #131

Merged
merged 39 commits into from
Aug 14, 2024
Merged

Conversation

laraPPr
Copy link
Collaborator

@laraPPr laraPPr commented Apr 2, 2024

No description provided.

@laraPPr laraPPr marked this pull request as draft April 2, 2024 14:56
@laraPPr
Copy link
Collaborator Author

laraPPr commented Apr 4, 2024

These tests are CPU only to run LAMMPS with GPU you need the execuatble lmp_machine instead of lmp see https://docs.lammps.org/Speed_gpu.html

@laraPPr
Copy link
Collaborator Author

laraPPr commented Apr 17, 2024

The test are now compatible with the lammps package in EESSI and https://github.com/easybuilders/easybuild-easyconfigs/blob/develop/easybuild/easyconfigs/l/LAMMPS/LAMMPS-2Aug2023_update2-foss-2023a-kokkos-CUDA-12.1.1.eb.

So I think that it is ready for review.

@laraPPr laraPPr marked this pull request as ready for review April 17, 2024 14:47
@boegel
Copy link
Contributor

boegel commented Apr 17, 2024

@laraPPr Can we somehow avoid including the *.rhodo files in the repository?
One of them is rather big (though it's basically a text file, but one with 191k lines 😅)

@laraPPr
Copy link
Collaborator Author

laraPPr commented Apr 17, 2024

@laraPPr Can we somehow avoid including the *.rhodo files in the repository?

One of them is rather big (though it's basically a text file, but one with 191k lines 😅)

We had discussed this in the last test-suite sync. And said that it was oke for this one. I've already minimized its impact as much as possible by using a reframe option that it does not copy it to the stage directory. But we can discuss it again tomorrow.

@boegel
Copy link
Contributor

boegel commented Apr 18, 2024

Can you add a small README file that mentions where we got the input files, when they were obtained, from which version, what the SHA256 checksum is, etc.?


@performance_function('img/s')
def perf(self):
regex = r'^(?P<perf>[.0-9]+)% CPU use with [0-9]+ MPI tasks x [0-9]+ OpenMP threads'
Copy link
Collaborator

@smoors smoors Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't look right, this is the %CPU usage.

performance should be one of the following:

Performance: 0.823 ns/day, 29.175 hours/ns, 4.761 timesteps/s, 152.338 katom-step/s

Copy link
Collaborator Author

@laraPPr laraPPr Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance: 205379.307 tau/day, 475.415 timesteps/s, 15.213 Matom-step/s this than?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your line comes from the lj test, while mine comes from the rhodo test.
let's take timesteps/s, as this unit is available for both tests?

Copy link
Collaborator Author

@laraPPr laraPPr Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now added tau/day for lj and ns/day for rhodo but I can also take the timesteps

Copy link
Collaborator

@smoors smoors Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i just checked, and both tau/day for lj and ns/day for rhodo scale in exactly the same way as timesteps/s, so it doesn't really matter.

i do have a slight preference for timesteps/s as it is easy to understand. otherwise, can you add a comment explaining what exactly tau/day means?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also do not know what tau/day is and google does not have a ready explanation. So changed it to the timesteps/s for lj and rhodo

eessi/testsuite/tests/apps/lammps/lammps.py Outdated Show resolved Hide resolved
@laraPPr
Copy link
Collaborator Author

laraPPr commented Aug 6, 2024

Just nocticed something going really wrong when testing lammps related to this issue #132

ERROR on proc 0: Cannot open input script in.rhodo: No such file or directory (src/lammps.cpp:542)
Last command: (unknown)

I'm also seeing this one but I'm not sure if I'm triggering it because I don't see it when their are no duplicate modules and I also cannot find this error in any of the test-reports from the CI that I am running (TypeError: can only join an iterable)

sbatch: error: QOSMaxSubmitJobPerUserLimit
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

EDIT: The sbatch; error: triggers a bug in reframe older than 4.3 (I am using 4.2) See reframe-hpc/reframe#2885

@smoors
Copy link
Collaborator

smoors commented Aug 8, 2024

seems to work well, though haven't tested on GPUs, our GPU nodes are too busy at the moment.

[ RUN      ] EESSI_LAMMPS_rhodo %scale=1_8_node %device_type=cpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos /90d91dda @hydra:zen4+default
[ RUN      ] EESSI_LAMMPS_lj %scale=1_8_node %device_type=cpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos /28a69ba4 @hydra:zen4+default
[       OK ] (1/2) EESSI_LAMMPS_lj %scale=1_8_node %device_type=cpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos /28a69ba4 @hydra:zen4+default
P: perf: 562.325 timesteps/s (r:0, l:None, u:None)
==> setup: 0.116s compile: 0.010s run: 49.726s sanity: 0.022s performance: 0.002s total: 49.982s
[       OK ] (2/2) EESSI_LAMMPS_rhodo %scale=1_8_node %device_type=cpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos /90d91dda @hydra:zen4+default
P: perf: 33.704 timesteps/s (r:0, l:None, u:None)
==> setup: 0.126s compile: 0.008s run: 57.376s sanity: 0.016s performance: 0.004s total: 57.656s
[----------] all spawned checks have finished

[  PASSED  ] Ran 2/2 test case(s) from 2 check(s) (0 failure(s), 0 skipped, 0 aborted)
[==========] Finished on Thu Aug  8 11:06:54 2024 

===================================================================================================================================================================================================================
PERFORMANCE REPORT
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[EESSI_LAMMPS_rhodo %scale=1_8_node %device_type=cpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos /90d91dda @hydra:zen4:default]
  num_tasks_per_node: 8
  num_tasks: 8
  num_cpus_per_task: 1
  performance:
    - perf: 33.704 timesteps/s (r: 0 timesteps/s l: -inf% u: +inf%)
[EESSI_LAMMPS_lj %scale=1_8_node %device_type=cpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos /28a69ba4 @hydra:zen4:default]
  num_tasks_per_node: 8
  num_tasks: 8
  num_cpus_per_task: 1
  performance:
    - perf: 562.325 timesteps/s (r: 0 timesteps/s l: -inf% u: +inf%)

@smoors
Copy link
Collaborator

smoors commented Aug 8, 2024

the only thing i'm still missing is an energy sanity check. i would use TotEng (total energy) for this.
for lj the total energy seems to be always exactly the same after 100 steps:

   Step          Temp          E_pair         E_mol          TotEng         Press·····
         0   1.44          -6.7733681      0             -4.6134356     -5.0197073····
       100   0.7574531     -5.7585055      0             -4.6223613      0.20726105···

for rhodo, the total energy is also very stable, with only the last digit showing slight differences:

------------ Step            100 ----- CPU =    0.4625225 (sec) -------------
TotEng   =    -25290.7300 KinEng   =     21591.9085 Temp     =       301.0906·
PotEng   =    -46882.6385 E_bond   =      2567.9807 E_angle  =     10781.9571·
E_dihed  =      5198.7492 E_impro  =       216.7864 E_vdwl   =     -1902.6618·
E_coul   =    206659.5228 E_long   =   -270404.9730 Press    =         6.7407·
Volume   =    308134.2285

@casparvl
Copy link
Collaborator

casparvl commented Aug 8, 2024

I see Sam gave great comments on the content. I'm not familiar with LAMMPS, but at least checked that the runs succeeded.

All succeeded both on Karolina:

[----------] all spawned checks have finished

[  PASSED  ] Ran 26/26 test case(s) from 26 check(s) (0 failure(s), 0 skipped, 0 aborted)
[==========] Finished on Thu Aug  8 13:34:46 2024+0200
Log file(s) saved in '/home/it4i-casparl/EESSI/reframe_runs/logs/reframe_20240808_133124.log'

And on Snellius:

[----------] all spawned checks have finished

[  FAILED  ] Ran 52/52 test case(s) from 26 check(s) (1 failure(s), 0 skipped, 0 aborted)
[==========] Finished on Thu Aug  8 13:44:38 2024+0200

The failure was due to a node failure, a rerun of that particular test succeeded. So... everything looks good from my side :)

@casparvl
Copy link
Collaborator

casparvl commented Aug 8, 2024

Still todo: add check that a CUDA-build is built with Kokkos, this test requires it.

num_default = 0 # If this test already has executable opts, they must have come from the command line
hooks.check_custom_executable_opts(self, num_default=num_default)
if not self.has_custom_executable_opts:
# should also check if the lammps is installed with kokkos.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder for me to also look at this one again

@casparvl
Copy link
Collaborator

casparvl commented Aug 8, 2024

Karolina, LAMMPS/2Aug2023_update2-foss-2023a-kokkos

1 rank, rhodo:

------------ Step            100 ----- CPU =     37.46569 (sec) -------------
TotEng   =    -25290.7299 KinEng   =     21591.9085 Temp     =       301.0906

128 ranks, rhodo:

------------ Step            100 ----- CPU =    0.5564088 (sec) -------------
TotEng   =    -25290.7302 KinEng   =     21591.9084 Temp     =       301.0906

16*128 ranks, rhodo:

------------ Step            100 ----- CPU =    0.4018642 (sec) -------------
TotEng   =    -25290.7300 KinEng   =     21591.9085 Temp     =       301.0906

1 rank, lj:

       100   0.7574531     -5.7585055      0             -4.6223613      0.20726105

128 ranks, lj:

       100   0.7574531     -5.7585055      0             -4.6223613      0.20726105

16*128 ranks, lj:

       100   0.7574531     -5.7585055      0             -4.6223613      0.20726105

Snellius, LAMMPS/2Aug2023_update2-foss-2023a-kokkos

------------ Step            100 ----- CPU =     30.80414 (sec) -------------
TotEng   =    -25290.7299 KinEng   =     21591.9085 Temp     =       301.0906

128 ranks, rhodo:

------------ Step            100 ----- CPU =    0.4229304 (sec) -------------
TotEng   =    -25290.7302 KinEng   =     21591.9084 Temp     =       301.0906

16*128 ranks, rhodo:

------------ Step            100 ----- CPU =     47.82593 (sec) -------------
TotEng   =    -25290.7300 KinEng   =     21591.9085 Temp     =       301.0906

1 rank, lj:

       100   0.7574531     -5.7585055      0             -4.6223613      0.20726105

128 ranks, lj:

       100   0.7574531     -5.7585055      0             -4.6223613      0.20726105

16*128 ranks, lj:

       100   0.7574531     -5.7585055      0             -4.6223613      0.20726105

@casparvl
Copy link
Collaborator

casparvl commented Aug 8, 2024

Just as another check, I ran with an older version of LAMMPS (LAMMPS/23Jun2022-foss-2022a-kokkos):

128 ranks,lj:

       100   0.7574531     -5.7585055      0             -4.6223613      0.20726105

128 ranks, rhodo:

------------ Step            100 ----- CPU =    0.4350029 (sec) -------------
TotEng   =    -25290.7300 KinEng   =     21591.9085 Temp     =       301.0906

Seems the total energy is consistent across versions. But, I am getting:

WARNING: skipping evaluation of performance variable 'perf': not enough matches of pattern '^Performance: [.0-9]+ tau/day, (?P<perf>[.0-9]+) timesteps/s,' in file 'rfm_job.out' so as to extract item 0

I'm not entirely sure why, because I do see:

Performance: 1200769.669 tau/day, 2779.559 timesteps/s

And that seems to match...?

@casparvl
Copy link
Collaborator

casparvl commented Aug 8, 2024

With yet another module, LAMMPS/2Aug2023_update2-foss-2023a-kokkos-CUDA-12.1.1, on H100 GPUs:
4 ranks (gpus), lj:

       100   0.7574531     -5.7585055      0             -4.6223613      0.20726105

4 ranks (gpus), rhodo:

------------ Step            100 ----- CPU =    0.5430899 (sec) -------------
TotEng   =    -25290.7302 KinEng   =     21591.9084 Temp     =       301.0906

So, also consistent results for TotEng on GPU.

@laraPPr
Copy link
Collaborator Author

laraPPr commented Aug 8, 2024

Just as another check, I ran with an older version of LAMMPS (LAMMPS/23Jun2022-foss-2022a-kokkos):

128 ranks,lj:

       100   0.7574531     -5.7585055      0             -4.6223613      0.20726105

128 ranks, rhodo:

------------ Step            100 ----- CPU =    0.4350029 (sec) -------------
TotEng   =    -25290.7300 KinEng   =     21591.9085 Temp     =       301.0906

Seems the total energy is consistent across versions. But, I am getting:

WARNING: skipping evaluation of performance variable 'perf': not enough matches of pattern '^Performance: [.0-9]+ tau/day, (?P<perf>[.0-9]+) timesteps/s,' in file 'rfm_job.out' so as to extract item 0

I'm not entirely sure why, because I do see:

Performance: 1200769.669 tau/day, 2779.559 timesteps/s

And that seems to match...?

Its the comma can remove that one

@casparvl
Copy link
Collaborator

casparvl commented Aug 8, 2024

Ah, the trailing comma, you're right!

@casparvl
Copy link
Collaborator

casparvl commented Aug 8, 2024

I also tried to run on GPUs. Works fine, except multinode rhodo - but that seems to be a problem in our UCX stack https://bugs.launchpad.net/ubuntu/+source/ucx/+bug/2055222
I'm not sure why I'm not encountering that for the lj case, but the error is so clearly the one in that bug report, that I don't believe the issue is in this test itself.

@laraPPr
Copy link
Collaborator Author

laraPPr commented Aug 9, 2024

Added the energy check and added support for the LAMMPS with the GPU package (no kokkos)

Copy link
Collaborator

@casparvl casparvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is pure MPI, it's probably good to call hooks.set_compact_process_binding.
Also, please call req_memory_per_node and specify how much memory the test needs. This makes sure the test gets skipped on systems with insufficient memory.

Copy link
Collaborator

@casparvl casparvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, did two more successful runs: on Snellius & Karolina. All good. Going in, thanks @laraPPr !

@casparvl casparvl merged commit 74d0d82 into EESSI:main Aug 14, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants