Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache epsilon computations for MPB to improve MPI scaling #1257

Merged
merged 7 commits into from
Jun 26, 2020

Conversation

stevengj
Copy link
Collaborator

@stevengj stevengj commented Jun 19, 2020

Work towards #1255.

(Still needs debugging, @oskooi.)

@oskooi
Copy link
Collaborator

oskooi commented Jun 23, 2020

There are currently two failing tests on Travis (special_kz.py and oblique_source.py) showing the same error message:

CHECK failure on line 104 of maxwell_eps.c: singular 3x3 matrix
CHECK failure on line 104 of maxwell_eps.c: singular 3x3 matrix

Increasing the resolution slightly (special_kz.py:eigsrc_kz from 30 to 40; oblique_source.py from 50 to 60) produces a different error which reveals that a segmentation fault is occurring:

Using MPI version 3.1, 2 processes
complex
-----------
Initializing structure...
Halving computational cell along direction y
Splitting into 2 chunks evenly
time for choose_chunkdivision = 0.00223154 s
Working in 2D dimensions.
Computational cell is 14 x 14 x 0 with resolution 40
     block, center = (0,0,0)
          size (1e+20,1,1e+20)
          axes (1,0,0), (0,1,0), (0,0,1)
          dielectric constant epsilon diagonal = (12,12,12)
time for set_epsilon = 0.156297 s
-----------
Meep: using complex fields.
corrupted size vs. prev_size
*** Process received signal ***
Signal: Aborted (6)
Signal code:  (-6)
[ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7fa2726f7890]
[ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7fa272332e97]
[ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7fa272334801]
[ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x89897)[0x7fa27237d897]
[ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x9090a)[0x7fa27238490a]
[ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x95acf)[0x7fa272389acf]
[ 6] /lib/x86_64-linux-gnu/libc.so.6(realloc+0x36b)[0x7fa27238cf9b]
[ 7] /home/oskooi/install/meep6/src/.libs/libmeep.so.19(+0xca628)[0x7fa27090d628]
[ 8] /usr/local/lib/libmpb.so.1(set_maxwell_dielectric+0x984)[0x7fa270419254]
[ 9] /home/oskooi/install/meep6/src/.libs/libmeep.so.19(_ZN4meep6fields13get_eigenmodeEdNS_9directionENS_6volumeES2_iRKNS_3vecEbiddPdPPv+0x1ec5)[0x7fa27090fdc5]
[10] /home/oskooi/install/meep6/src/.libs/libmeep.so.19(_ZN4meep6fields20add_eigenmode_sourceENS_9componentERKNS_8src_timeENS_9directionERKNS_6volumeES8_iRKNS_3vecEbiddSt7complexIdEPFSD_SB_E+0x10d)[0x7fa270910aed]
[11] /home/oskooi/install/meep6/python/meep/_meep.so(+0xecb0e)[0x7fa270c7fb0e]

Running python/tests/oblique_source.py using gdb and performing a backtrace shows:

Using MPI version 3.1, 1 processes
-----------
Initializing structure...
time for choose_chunkdivision = 0.00139393 s
Working in 2D dimensions.
Computational cell is 10 x 10 x 0 with resolution 60
     block, center = (0,0,0)
          size (1e+20,1,1e+20)
          axes (1,0,0), (0,1,0), (0,0,1)
          dielectric constant epsilon diagonal = (2.25,2.25,2.25)
time for set_epsilon = 0.922465 s
-----------
corrupted size vs. prev_size

Thread 1 "python3.5" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ffff7805801 in __GI_abort () at abort.c:79
#2  0x00007ffff784e897 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff797bb9a "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007ffff785590a in malloc_printerr (str=str@entry=0x7ffff7979c9d "corrupted size vs. prev_size") at malloc.c:5350
#4  0x00007ffff785aacf in _int_realloc (av=av@entry=0x7ffff7bb0c40 <main_arena>, oldp=oldp@entry=0x156f960, oldsize=oldsize@entry=6160, nb=nb@entry=12304) at malloc.c:4564
#5  0x00007ffff785df9b in __GI___libc_realloc (oldmem=0x156f970, bytes=12288) at malloc.c:3230
#6  0x00007ffff5e4112f in meep::meep_mpb_eps (eps=0x7fffffffa460, eps_inv=0x7fffffffa4b0, r=0x7fffffffa540, eps_data_=0x7fffffffae00) at mpb.cpp:71
#7  0x00007ffff593c254 in set_maxwell_dielectric (md=0x1539620, mesh_size=<optimized out>, R=0x7fffffffaf70, G=<optimized out>, 
    epsilon=0x7ffff5e40ed0 <meep::meep_mpb_eps(symmetric_matrix*, symmetric_matrix*, mpb_real const*, void*)>, mepsilon=0x0, epsilon_data=0x7fffffffae00) at maxwell_eps.c:498
#8  0x00007ffff5e437c2 in meep::fields::get_eigenmode (this=0x1529150, frequency=1, d=meep::NO_DIRECTION, where=..., eig_vol=..., band_num=1, _kpoint=..., match_frequency=true, parity=2, resolution=120, 
    eigensolver_tol=9.9999999999999998e-13, kdom=0x0, user_mdata=0x0) at mpb.cpp:413
#9  0x00007ffff5e4575c in meep::fields::add_eigenmode_source (this=0x1529150, c0=meep::Dielectric, src=..., d=meep::NO_DIRECTION, where=..., eig_vol=..., band_num=1, kpoint=..., match_frequency=true, 
    parity=2, resolution=0, eigensolver_tol=9.9999999999999998e-13, amp=..., A=0x0) at mpb.cpp:743
#10 0x00007ffff627a180 in _wrap_fields_add_eigenmode_source__SWIG_1 (args=0x7fffc638d2a8) at meep-python.cxx:86166
#11 0x00007ffff627a6d4 in _wrap_fields_add_eigenmode_source (self=0x7ffff664a778, args=0x7fffc638d2a8) at meep-python.cxx:86252

The problem seems to be the realloc statement within the function meep_mpb_eps on src/mpb.cpp:71:

https://github.com/NanoComp/meep/pull/1257/files#diff-e4ab557c3ba9e1876312d1a70976c3c3R71

src/mpb.cpp Outdated Show resolved Hide resolved
@oskooi
Copy link
Collaborator

oskooi commented Jun 24, 2020

After applying the bug fix to src/mpb.cpp:71 described above and verifying that all tests in the make check suite pass, a benchmarking test for this PR involving a large 3d simulation with a 2d source plane (ridge waveguide cross section) reveals that there is practically no speed up relative to master. The test is performed on a single machine (i.e., no networking/MPI) with 14 processors/chunks.

The test for master involves timing the call to set_maxwell_dielectric in src/mpb.cpp:385 via the wall_time() function:

set_maxwell_dielectric(mdata, mesh_size, R, G, meep_mpb_eps, NULL, &eps_data);

master

set_maxwell_dielectric:, 1067.82 s

The test for this PR involves timing each of the two calls to set_maxwell_dielectric as well as sum_to_all which is called in between the two calls to set_maxwell_dielectric.

this PR

set_maxwell_dielectric1:, 0.661123 s
sum_to_all:, 1067.43 s
set_maxwell_dielectric2:, 0.000506878 s

These results demonstrate that the sum_to_all call in this PR is taking just as long as the single call to set_maxwell_dielectric in master (even though set_maxwell_dielectric has been sped up considerably).

@stevengj
Copy link
Collaborator Author

You might try just timing an all_wait(); call right before the sum_to_all(), to see if the wait time is just due to one process taking a long time to reach that point.

src/mpb.cpp Show resolved Hide resolved
@oskooi
Copy link
Collaborator

oskooi commented Jun 24, 2020

Putting an all_wait(); right before sum_to_all() and timing each function call separately (with the output displayed using master_printf) reveals that it is the all_wait() that is taking up most of the time:

set_maxwell_dielectric1:, 0.664365 s
all_wait:, 1059.81 s
sum_to_all:, 0.004601 s
set_maxwell_dielectric2:, 0.000252962 s

To investigate whether one or more of the chunks is causing the delay, each of the 14 chunks (rank 0-13) outputs its wall time for all_time() separately via printf:

all_wait:, 0 (rank), 1060.05 s
all_wait:, 1 (rank), 564.322 s
all_wait:, 2 (rank), 533.729 s
all_wait:, 3 (rank), 495.724 s
all_wait:, 4 (rank), 942.212 s
all_wait:, 5 (rank), 29.6588 s
all_wait:, 6 (rank), 735.957 s
all_wait:, 7 (rank), 552.688 s
all_wait:, 8 (rank), 493.958 s
all_wait:, 9 (rank), 1026.89 s
all_wait:, 10 (rank), 0.000249147 s
all_wait:, 11 (rank), 710.75 s
all_wait:, 12 (rank), 589.994 s
all_wait:, 13 (rank), 953.669 s

These results indicate that while there is one chunk (rank 0) causing the delay there are several other chunks with comparable times (i.e., 4, 9, 13).

@stevengj
Copy link
Collaborator Author

Try putting an all_wait() at the beginning of get_eigenmode as well, to check whether the synchronization delay originates in this function or somewhere else.

Meanwhile, I'm going to merge this anyway, since it should scale better to do things this way, and tests pass.

@stevengj stevengj merged commit fb58a86 into master Jun 26, 2020
bencbartlett pushed a commit to bencbartlett/meep that referenced this pull request Sep 9, 2021
)

* cache epsilon computations for MPB to improve MPI scaling

* whoops

* tweak

* fixes

* add missing sizeof

* tell MPB not to do its own subpixel averaging

* assert.h
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants