Overwhelmed storage with too many read requests when running on > 14,000 MPI processes #781

fyli16 · 2024-12-19T22:38:57Z

fyli16
Dec 19, 2024

Hello, I am trying some large scale runs with 14336 (256 nodes x 56 cores/node) MPI processes on the Frontera machine of TACC (https://tacc.utexas.edu/systems/frontera/). My jobs were cancelled by admin in the middle of the run, who complained that my job was overwhelming the storage targets with too many read requests and asked us to reduce IO. Is this problem normal for running SMILEI at this scale? I tried to halve the size of the problem and run with 128 nodes (7168 MPI) and had no issues (no complaints and no cancellation of my jobs).

Here is my job script and input file:
input.txt
job.txt
It's 2D Cartesian simulations and the diagnostics include simple probe (at selected points), fields on grids, and particlebinning.

Any solutions/comments are appreciated.

charlesprouveur · 2024-12-20T00:35:27Z

charlesprouveur
Dec 20, 2024
Maintainer

I cannot comment on the policy of this admin, but i have never seen the number of reads being an issue. SMILEI is regularly run with more than 14000 MPI processes on other systems and the issue in the case of IOs is the allocated space in the scratch partition. Overall it is a good policy to reduce as much as you can the number of IOs to improve the performance. If you want to reduce the number of MPI processes, a possibility would be to use OMP_NUM_THREADS (for instance use 256 x 14 tasks with 256 x 56 cores and OMP_NUM_THREADS=4).

Note: this issue has been converted into a discussion as this is not a bug issue (smilei was running fine, the admin killed it)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overwhelmed storage with too many read requests when running on > 14,000 MPI processes #781

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Overwhelmed storage with too many read requests when running on > 14,000 MPI processes #781

fyli16 Dec 19, 2024

Replies: 1 comment

charlesprouveur Dec 20, 2024 Maintainer

fyli16
Dec 19, 2024

charlesprouveur
Dec 20, 2024
Maintainer