-
Notifications
You must be signed in to change notification settings - Fork 641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
loop tiling for step_update_edhb #1733
Conversation
Maybe try a case that's more of a cubic unit cell, and try chopping along the longest axis. |
Codecov Report
@@ Coverage Diff @@
## master #1733 +/- ##
=======================================
Coverage 74.37% 74.37%
=======================================
Files 13 13
Lines 4573 4574 +1
=======================================
+ Hits 3401 3402 +1
Misses 1172 1172
|
I ran the benchmarking test for a cubic cell on AWS using a single c5.2xlarge instance (output of The results are shown below and are similar to those obtained using my desktop i7-7700k machine. The time spent on
|
I think this is because your ε tensor is diagonal in this test problem, in which case there is no temporal locality to exploit by tiling — you update Ex from Dx, Ey from Dy, etcetera, with each D component value being used exactly once. To get a benefit, I think you need anistropic non-diagonal ε. This will happen if you have interfaces along non-cartesian directions, or you can simply add some anistropic material with an arbitrarily oriented ε tensor. |
src/update_eh.cpp
Outdated
dmp[dc_2][cmp] ? s->chi1inv[ec][d_2] : NULL, s_ec, s_1, s_2, s->chi2[ec], | ||
s->chi3[ec], f_w[ec][cmp], dsigw, s->sig[dsigw], s->kap[dsigw]); | ||
if (f[ec][cmp] != f[dc][cmp]) { | ||
for (const auto& sub_gv : gvs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This loop should be outside the
FOR_FT_COMPONENTS
loop (just make sure to do the backup copies only once). - If the material has a diagonal
s->chi1inv
(i.e. the off-diagonal pointers areNULL
for thisft
), then you don't want to tile (because there is no locality as explained below) — in that case you want to just setgvs
to a single tile with the wholegv
.
With the changes introduced in 150e130 and a new benchmarking test involving a cubic cell with a dielectric sphere using subpixel smoothing which produces anisotropic ε tensors at boundary voxels, the loop tiling for The benchmarking tests were run on AWS EC2 using the same set up as described previously. 1. update_eh
2. step_db
test
|
Additional results for a different test case involving anisotropic ε materials also show a similar speedup of ~12% for Note from the results above and below that for the case of no tiling, the time spent per timestep on I think these results finally demonstrate that loop tiling can be used to speed up
test import meep as mp
resolution = 110
s = 3.0
cell_size = mp.Vector3(s,s,s)
fcen = 1.0
sources = [mp.Source(mp.ContinuousSource(fcen),
center=mp.Vector3(-0.5*s),
size=mp.Vector3(0,s,s),
component=mp.Ez)]
aniso_mat1 = mp.Medium(epsilon_diag=mp.Vector3(8.09946,10.12833,6.34247),
epsilon_offdiag=mp.Vector3(0.56199,2.84232,5.05261))
aniso_mat2 = mp.Medium(epsilon_diag=mp.Vector3(14.74,23.98,22.64),
epsilon_offdiag=mp.Vector3(-11.54,0,-10.69))
rad = 1.0
geometry = [mp.Sphere(material=aniso_mat2,
center=mp.Vector3(),
radius=rad)]
nprocs = mp.count_processors()
nproc = mp.divide_parallel_processes(mp.count_processors())
ltbs = [300,700,1600,3000,5000,12000,25000,40000,80000,0]
for ltb in ltbs:
sim = mp.Simulation(resolution=resolution,
cell_size=cell_size,
sources=sources,
k_point=mp.Vector3(),
geometry=geometry,
loop_tile_base=ltb,
default_material=aniso_mat1,
eps_averaging=True)
sim.init_sim()
for r in range(1,6):
sim.fields.step()
sim.fields.reset_timers()
for _ in range(5):
sim.fields.step()
sim.output_times('timings_res{}_tb{}_np{}_proc{}_run{}.csv'.format(resolution,ltb,nprocs,nproc,r))
sim.restart_fields()
sim.reset_meep() |
…if material has diagonal s->chi1inv
dc766a8
to
bbdcba2
Compare
I ran benchmarks on AWS EC2 with the latest commit (57b4359) and there is again a consistent speedup across all four single-threaded processors of ~12% for I think this feature is now ready to be merged. update_e_from_d
|
* loop tiling for STEP_UPDATE_EDHB * reorganization to prevent copying the PML auxiliary fields for every tile * move loop over tiles outside of loop over components and skip tiling if material has diagonal s->chi1inv * add two separate subdomains for step_db and update_eh and * always tile step_db but only tile update_eh for anisotropic materials * fixes and documentation
Initial attempt to add loop tiling to
STEP_UPDATE_EDHB
viafields_chunk::update_eh
based on the same approach used forSTEP_CURL
viafields_chunks::step_db
in #1655. I ran the same benchmark as #1655 (i.e., identical 3d simulation executed simultaneously on all four single-threaded cores of i7-7700) and looked at just the time spent onupdate_eh
. The time forfields_chunk::update_eh(H_stuff)
was negligible and excluded from the results and so onlyfields_chunk::update_eh(E_stuff)
matters. Subpixel smoothing was turned on which produces anisotropic ε tensors at the discontinuous (lossless ) dielectric interfaces. (The loop tiling forfields_chunk::step_db
was disabled not that it should affect the timing results forupdate_eh
.)Somewhat unexpectedly, the results show that increasing the number of tiles produces slower performance relative to the no-tiling case.