Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loop tiling for step_update_edhb #1733

Merged
merged 6 commits into from
Sep 16, 2021
Merged

Conversation

oskooi
Copy link
Collaborator

@oskooi oskooi commented Aug 16, 2021

Initial attempt to add loop tiling to STEP_UPDATE_EDHB via fields_chunk::update_eh based on the same approach used for STEP_CURL via fields_chunks::step_db in #1655. I ran the same benchmark as #1655 (i.e., identical 3d simulation executed simultaneously on all four single-threaded cores of i7-7700) and looked at just the time spent on update_eh. The time for fields_chunk::update_eh(H_stuff) was negligible and excluded from the results and so only fields_chunk::update_eh(E_stuff) matters. Subpixel smoothing was turned on which produces anisotropic ε tensors at the discontinuous (lossless ) dielectric interfaces. (The loop tiling for fields_chunk::step_db was disabled not that it should affect the timing results for update_eh.)

Somewhat unexpectedly, the results show that increasing the number of tiles produces slower performance relative to the no-tiling case.

oled_subpix_updateE3_benchmark_procs4

@stevengj
Copy link
Collaborator

Maybe try a case that's more of a cubic unit cell, and try chopping along the longest axis.

@codecov-commenter
Copy link

codecov-commenter commented Aug 29, 2021

Codecov Report

Merging #1733 (57b4359) into master (83d637c) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #1733   +/-   ##
=======================================
  Coverage   74.37%   74.37%           
=======================================
  Files          13       13           
  Lines        4573     4574    +1     
=======================================
+ Hits         3401     3402    +1     
  Misses       1172     1172           
Impacted Files Coverage Δ
python/simulation.py 76.57% <100.00%> (+0.01%) ⬆️

@oskooi
Copy link
Collaborator Author

oskooi commented Aug 29, 2021

I ran the benchmarking test for a cubic cell on AWS using a single c5.2xlarge instance (output of lscpu is below) with hyperthreading disabled (four single-threaded cores) using commit e14a6f9 of this branch. As before, I ran the same test simultaneously on all four threads in order to saturate the memory bandwidth. I also varied the resolution in order to investigate the effect of the total number of pixels (or memory) on the results.

The results are shown below and are similar to those obtained using my desktop i7-7700k machine. The time spent on fields_chunk::update_eh(E_stuff) increases (nearly monotonically) with the number of tiles. For comparison, the time spent on fields_chunk::step_db(D_stuff) in the same run decreases with the number of tiles until reaching a minimum similar to what was demonstrated in #1655. Thus, it seems there is still no benefit to using loop tiling for fields_chunk::updater_eh.

$ lscpu
Architecture:         x86_64
CPU op-mode(s):       32-bit, 64-bit
Byte Order:           Little Endian
CPU(s):               8
On-line CPU(s) list:  0-3
Off-line CPU(s) list: 4-7
Thread(s) per core:   1
Core(s) per socket:   4
Socket(s):            1
NUMA node(s):         1
Vendor ID:            GenuineIntel
CPU family:           6
Model:                85
Model name:           Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
Stepping:             7
CPU MHz:              3615.747
BogoMIPS:             6000.00
Hypervisor vendor:    KVM
Virtualization type:  full
L1d cache:            32K
L1i cache:            32K
L2 cache:             1024K
L3 cache:             36608K
NUMA node0 CPU(s):    0-3

1. update_eh
oled_benchmark_procs4_cube_res140_updateEH

processor 1: relative improvement of  0.00% using 1 tiles
processor 2: relative improvement of  0.00% using 1 tiles
processor 3: relative improvement of  0.17% using 3520 tiles
processor 4: relative improvement of  2.04% using 3520 tiles

2. step_db
oled_benchmark_procs4_cube_res140_stepDB

processor 1: relative improvement of 20.95% using 14080 tiles
processor 2: relative improvement of 19.97% using 14080 tiles
processor 3: relative improvement of 21.66% using 28160 tiles
processor 4: relative improvement of 22.37% using 14080 tiles

@stevengj
Copy link
Collaborator

stevengj commented Sep 1, 2021

I think this is because your ε tensor is diagonal in this test problem, in which case there is no temporal locality to exploit by tiling — you update Ex from Dx, Ey from Dy, etcetera, with each D component value being used exactly once.

To get a benefit, I think you need anistropic non-diagonal ε. This will happen if you have interfaces along non-cartesian directions, or you can simply add some anistropic material with an arbitrarily oriented ε tensor.

dmp[dc_2][cmp] ? s->chi1inv[ec][d_2] : NULL, s_ec, s_1, s_2, s->chi2[ec],
s->chi3[ec], f_w[ec][cmp], dsigw, s->sig[dsigw], s->kap[dsigw]);
if (f[ec][cmp] != f[dc][cmp]) {
for (const auto& sub_gv : gvs) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This loop should be outside the FOR_FT_COMPONENTS loop (just make sure to do the backup copies only once).
  2. If the material has a diagonal s->chi1inv (i.e. the off-diagonal pointers are NULL for this ft), then you don't want to tile (because there is no locality as explained below) — in that case you want to just set gvs to a single tile with the whole gv.

@oskooi
Copy link
Collaborator Author

oskooi commented Sep 2, 2021

With the changes introduced in 150e130 and a new benchmarking test involving a cubic cell with a dielectric sphere using subpixel smoothing which produces anisotropic ε tensors at boundary voxels, the loop tiling for fields_chunk::update_eh(E_stuff) produces a speedup of ~12% in the optimal tiling configuration relative to no tiling. In the same test, the speedup for fields_chunk::step_db(D_stuff) was ~20% for the same tiling configuration as fields_chunk::update_eh(E_stuff) in which the speedup was maximized.

The benchmarking tests were run on AWS EC2 using the same set up as described previously.

1. update_eh

sphere_benchmark_procs4_res80_updateEH

processor 1: relative improvement of 10.54% using 240 tiles
processor 2: relative improvement of 10.77% using 3840 tiles
processor 3: relative improvement of 10.90% using 3840 tiles
processor 4: relative improvement of 12.08% using 3840 tiles

2. step_db

sphere_benchmark_procs4_res80_stepDB

processor 1: relative improvement of 17.21% using 3840 tiles
processor 2: relative improvement of 17.78% using 960 tiles
processor 3: relative improvement of 19.82% using 3840 tiles
processor 4: relative improvement of 19.15% using 3840 tiles

test

import meep as mp

resolution = 80

s = 3.0
cell_size = mp.Vector3(s,s,s)

fcen = 1.0
sources = [mp.Source(mp.ContinuousSource(fcen),
                     center=mp.Vector3(-0.5*s),
                     size=mp.Vector3(0,s,s),
                     component=mp.Ez)]

n_sphere = 2.0
rad = 1.0
geometry = [mp.Sphere(material=mp.Medium(index=n_sphere),
                      center=mp.Vector3(),
                      radius=rad)]

nprocs = mp.count_processors()
nproc = mp.divide_parallel_processes(mp.count_processors())

ltbs = [300,700,1600,3000,5000,25000,80000,0]

for ltb in ltbs:
    sim = mp.Simulation(resolution=resolution,
                        cell_size=cell_size,
                        sources=sources,
                        k_point=mp.Vector3(),
                        geometry=geometry,
                        loop_tile_base=ltb)

    sim.init_sim()

    for r in range(1,6):
        sim.fields.step()
        sim.fields.reset_timers()

        for _ in range(5):
            sim.fields.step()

        sim.output_times('timings_res{}_tb{}_np{}_proc{}_run{}.csv'.format(resolution,ltb,nprocs,nproc,r))

        sim.restart_fields()

    sim.reset_meep()

@oskooi
Copy link
Collaborator Author

oskooi commented Sep 3, 2021

Additional results for a different test case involving anisotropic ε materials also show a similar speedup of ~12% for fields_chunk::update_eh(E_stuff) relative to no tiling. It's interesting that the maximum speedup of ~12% for fields_chunk::update_eh(E_stuff) and ~20% for fields_chunk::step_db(D_stuff) for the optimal tiling is consistent across different tests and grid resolutions.

Note from the results above and below that for the case of no tiling, the time spent per timestep on fields_chunk::update_eh(E_stuff) is more than 2X larger than fields_chunk::step_db(D_stuff). This means that in absolute terms, the time savings due to loop tiling is actually more significant fields_chunk::update_eh(E_stuff) than it is for fields_chunk::step_db(D_stuff).

I think these results finally demonstrate that loop tiling can be used to speed up update_eh.

1. update_eh
aniso_sphere_benchmark_procs4_res110_updateEH

processor 1: relative improvement of 12.66% using 5280 tiles
processor 2: relative improvement of 11.63% using 5280 tiles
processor 3: relative improvement of 10.62% using 660 tiles
processor 4: relative improvement of 11.96% using 10560 tiles

2. step_db
aniso_sphere_benchmark_procs4_res110_stepDB

processor 1: relative improvement of 20.24% using 2640 tiles
processor 2: relative improvement of 21.39% using 21120 tiles
processor 3: relative improvement of 19.10% using 2640 tiles
processor 4: relative improvement of 21.97% using 10560 tiles

test

import meep as mp

resolution = 110

s = 3.0
cell_size = mp.Vector3(s,s,s)

fcen = 1.0
sources = [mp.Source(mp.ContinuousSource(fcen),
                     center=mp.Vector3(-0.5*s),
                     size=mp.Vector3(0,s,s),
                     component=mp.Ez)]

aniso_mat1 = mp.Medium(epsilon_diag=mp.Vector3(8.09946,10.12833,6.34247),
                       epsilon_offdiag=mp.Vector3(0.56199,2.84232,5.05261))

aniso_mat2 = mp.Medium(epsilon_diag=mp.Vector3(14.74,23.98,22.64),
                       epsilon_offdiag=mp.Vector3(-11.54,0,-10.69))

rad = 1.0
geometry = [mp.Sphere(material=aniso_mat2,
                      center=mp.Vector3(),
                      radius=rad)]

nprocs = mp.count_processors()
nproc = mp.divide_parallel_processes(mp.count_processors())

ltbs = [300,700,1600,3000,5000,12000,25000,40000,80000,0]

for ltb in ltbs:
    sim = mp.Simulation(resolution=resolution,
                        cell_size=cell_size,
                        sources=sources,
                        k_point=mp.Vector3(),
                        geometry=geometry,
			loop_tile_base=ltb,
                        default_material=aniso_mat1,
                        eps_averaging=True)

    sim.init_sim()

    for r in range(1,6):
        sim.fields.step()
	sim.fields.reset_timers()

        for _ in range(5):
            sim.fields.step()

	sim.output_times('timings_res{}_tb{}_np{}_proc{}_run{}.csv'.format(resolution,ltb,nprocs,nproc,r))

	sim.restart_fields()

    sim.reset_meep()

src/update_eh.cpp Outdated Show resolved Hide resolved
src/update_eh.cpp Outdated Show resolved Hide resolved
src/step_db.cpp Outdated Show resolved Hide resolved
@oskooi oskooi changed the title WIP: loop tiling for step_update_edhb loop tiling for step_update_edhb Sep 15, 2021
@oskooi
Copy link
Collaborator Author

oskooi commented Sep 16, 2021

I ran benchmarks on AWS EC2 with the latest commit (57b4359) and there is again a consistent speedup across all four single-threaded processors of ~12% for update_e_from_b and ~20% for step_db though the tile configurations are different. (Note that tiling is turned off for update_h_from_b because there are no anisotropic materials which also just means H=B and therefore no update is actually performed.)

I think this feature is now ready to be merged.

update_e_from_d

sphere_subpix_benchmark_procs4_res110_updateEH

processor 1: relative improvement of 12.41% using 1320 tiles
processor 2: relative improvement of 12.86% using 1320 tiles
processor 3: relative improvement of 12.09% using 1320 tiles
processor 4: relative improvement of 12.57% using 1320 tiles

step_db
sphere_subpix_benchmark_procs4_res110_stepDB

processor 1: relative improvement of 22.38% using 10560 tiles
processor 2: relative improvement of 20.34% using 2640 tiles
processor 3: relative improvement of 20.92% using 10560 tiles
processor 4: relative improvement of 22.97% using 10560 tiles

@stevengj stevengj merged commit 3f0ddec into NanoComp:master Sep 16, 2021
@oskooi oskooi deleted the loop_tile_updateEH branch September 16, 2021 17:21
mawc2019 pushed a commit to mawc2019/meep that referenced this pull request Nov 3, 2021
* loop tiling for STEP_UPDATE_EDHB

* reorganization to prevent copying the PML auxiliary fields for every tile

* move loop over tiles outside of loop over components and skip tiling if material has diagonal s->chi1inv

* add two separate subdomains for step_db and update_eh and

* always tile step_db but only tile update_eh for anisotropic materials

* fixes and documentation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants