-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory balancing necessary? #16
Comments
OK, so after measuring the (approximate) memory needs of the individual tasks, this here should be the assignment that we have shortly before the program dies (in GB):
Each process group has less than 1024 GB available, so the problem arises when Task 27 at 1062 GB is added to process group '0x30da350'. Task 27 has [l_x, l_y, l_z, l_v, l_w] = [10 5 4 4 3]. This means that yes: we have an imbalance, but no: we could not have distributed in a way that this large task fits onto one process group! |
This here is the data I used (in case of follow-up problems):
|
I am again running a weak scaling on Hawk (2GB/core, 128 cores/node).
This time, I am trying to get sensible scaling data for GENE+DisCoTec. But the required resolutions and memory footprints make it a lot harder!
While running the scheme on one process group of 4096 workers finishes fine,
it seems that we're running out of memory when running the same problem on eight process groups of 512 each:
It might be that one process group is assigned more memory-intense GENE tasks than the others. This would mean that memory does not correlate strongly enough with run time of the first step, such that our current round-robin assignment approach would work. (Remember: we have an estimate based on run time, I am currently using the gripoint based LinearLoadModel, which we use to assign one grid to each process group. But then, the next grids are assigned to process groups once they finish. For instance, if there is a component grid that takes VERY long to compute the first time step, there would be no other grid assigned to this group.)
I am currently working to verify that this is the exact problem.
The other possible source of error (that I can imagine) would be that the overhead of 3 additionally allocated sparse grids is the culprit. Given the sparse grid size of 481371297 grid points * 16 byte (double complex) * 3 = 1.345 gibibyte, this seems unlikely (this overhead is over all 4096 workers!).
This would make some kind of "memory balancing" necessary in the case of memory scarcity (in analogy to load balancing in the task assignment).
The text was updated successfully, but these errors were encountered: