loop tiling for step_update_edhb #1733

oskooi · 2021-08-16T02:05:43Z

Initial attempt to add loop tiling to STEP_UPDATE_EDHB via fields_chunk::update_eh based on the same approach used for STEP_CURL via fields_chunks::step_db in #1655. I ran the same benchmark as #1655 (i.e., identical 3d simulation executed simultaneously on all four single-threaded cores of i7-7700) and looked at just the time spent on update_eh. The time for fields_chunk::update_eh(H_stuff) was negligible and excluded from the results and so only fields_chunk::update_eh(E_stuff) matters. Subpixel smoothing was turned on which produces anisotropic ε tensors at the discontinuous (lossless ) dielectric interfaces. (The loop tiling for fields_chunk::step_db was disabled not that it should affect the timing results for update_eh.)

Somewhat unexpectedly, the results show that increasing the number of tiles produces slower performance relative to the no-tiling case.

src/update_eh.cpp

stevengj · 2021-08-18T02:59:23Z

Maybe try a case that's more of a cubic unit cell, and try chopping along the longest axis.

codecov-commenter · 2021-08-29T01:06:23Z

Codecov Report

Merging #1733 (57b4359) into master (83d637c) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #1733   +/-   ##
=======================================
  Coverage   74.37%   74.37%           
=======================================
  Files          13       13           
  Lines        4573     4574    +1     
=======================================
+ Hits         3401     3402    +1     
  Misses       1172     1172

Impacted Files	Coverage Δ
python/simulation.py	`76.57% <100.00%> (+0.01%)`	⬆️

oskooi · 2021-08-29T21:32:01Z

I ran the benchmarking test for a cubic cell on AWS using a single c5.2xlarge instance (output of lscpu is below) with hyperthreading disabled (four single-threaded cores) using commit e14a6f9 of this branch. As before, I ran the same test simultaneously on all four threads in order to saturate the memory bandwidth. I also varied the resolution in order to investigate the effect of the total number of pixels (or memory) on the results.

The results are shown below and are similar to those obtained using my desktop i7-7700k machine. The time spent on fields_chunk::update_eh(E_stuff) increases (nearly monotonically) with the number of tiles. For comparison, the time spent on fields_chunk::step_db(D_stuff) in the same run decreases with the number of tiles until reaching a minimum similar to what was demonstrated in #1655. Thus, it seems there is still no benefit to using loop tiling for fields_chunk::updater_eh.

$ lscpu
Architecture:         x86_64
CPU op-mode(s):       32-bit, 64-bit
Byte Order:           Little Endian
CPU(s):               8
On-line CPU(s) list:  0-3
Off-line CPU(s) list: 4-7
Thread(s) per core:   1
Core(s) per socket:   4
Socket(s):            1
NUMA node(s):         1
Vendor ID:            GenuineIntel
CPU family:           6
Model:                85
Model name:           Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
Stepping:             7
CPU MHz:              3615.747
BogoMIPS:             6000.00
Hypervisor vendor:    KVM
Virtualization type:  full
L1d cache:            32K
L1i cache:            32K
L2 cache:             1024K
L3 cache:             36608K
NUMA node0 CPU(s):    0-3

1. update_eh

processor 1: relative improvement of  0.00% using 1 tiles
processor 2: relative improvement of  0.00% using 1 tiles
processor 3: relative improvement of  0.17% using 3520 tiles
processor 4: relative improvement of  2.04% using 3520 tiles

2. step_db

processor 1: relative improvement of 20.95% using 14080 tiles
processor 2: relative improvement of 19.97% using 14080 tiles
processor 3: relative improvement of 21.66% using 28160 tiles
processor 4: relative improvement of 22.37% using 14080 tiles

stevengj · 2021-09-01T01:26:48Z

I think this is because your ε tensor is diagonal in this test problem, in which case there is no temporal locality to exploit by tiling — you update Ex from Dx, Ey from Dy, etcetera, with each D component value being used exactly once.

To get a benefit, I think you need anistropic non-diagonal ε. This will happen if you have interfaces along non-cartesian directions, or you can simply add some anistropic material with an arbitrarily oriented ε tensor.

stevengj · 2021-09-01T01:31:26Z

src/update_eh.cpp

-                         dmp[dc_2][cmp] ? s->chi1inv[ec][d_2] : NULL, s_ec, s_1, s_2, s->chi2[ec],
-                         s->chi3[ec], f_w[ec][cmp], dsigw, s->sig[dsigw], s->kap[dsigw]);
+      if (f[ec][cmp] != f[dc][cmp]) {
+        for (const auto& sub_gv : gvs) {


This loop should be outside the FOR_FT_COMPONENTS loop (just make sure to do the backup copies only once).

If the material has a diagonal s->chi1inv (i.e. the off-diagonal pointers are NULL for this ft), then you don't want to tile (because there is no locality as explained below) — in that case you want to just set gvs to a single tile with the whole gv.

oskooi · 2021-09-02T17:00:01Z

With the changes introduced in 150e130 and a new benchmarking test involving a cubic cell with a dielectric sphere using subpixel smoothing which produces anisotropic ε tensors at boundary voxels, the loop tiling for fields_chunk::update_eh(E_stuff) produces a speedup of ~12% in the optimal tiling configuration relative to no tiling. In the same test, the speedup for fields_chunk::step_db(D_stuff) was ~20% for the same tiling configuration as fields_chunk::update_eh(E_stuff) in which the speedup was maximized.

The benchmarking tests were run on AWS EC2 using the same set up as described previously.

1. update_eh

processor 1: relative improvement of 10.54% using 240 tiles
processor 2: relative improvement of 10.77% using 3840 tiles
processor 3: relative improvement of 10.90% using 3840 tiles
processor 4: relative improvement of 12.08% using 3840 tiles

2. step_db

processor 1: relative improvement of 17.21% using 3840 tiles
processor 2: relative improvement of 17.78% using 960 tiles
processor 3: relative improvement of 19.82% using 3840 tiles
processor 4: relative improvement of 19.15% using 3840 tiles

test

import meep as mp

resolution = 80

s = 3.0
cell_size = mp.Vector3(s,s,s)

fcen = 1.0
sources = [mp.Source(mp.ContinuousSource(fcen),
                     center=mp.Vector3(-0.5*s),
                     size=mp.Vector3(0,s,s),
                     component=mp.Ez)]

n_sphere = 2.0
rad = 1.0
geometry = [mp.Sphere(material=mp.Medium(index=n_sphere),
                      center=mp.Vector3(),
                      radius=rad)]

nprocs = mp.count_processors()
nproc = mp.divide_parallel_processes(mp.count_processors())

ltbs = [300,700,1600,3000,5000,25000,80000,0]

for ltb in ltbs:
    sim = mp.Simulation(resolution=resolution,
                        cell_size=cell_size,
                        sources=sources,
                        k_point=mp.Vector3(),
                        geometry=geometry,
                        loop_tile_base=ltb)

    sim.init_sim()

    for r in range(1,6):
        sim.fields.step()
        sim.fields.reset_timers()

        for _ in range(5):
            sim.fields.step()

        sim.output_times('timings_res{}_tb{}_np{}_proc{}_run{}.csv'.format(resolution,ltb,nprocs,nproc,r))

        sim.restart_fields()

    sim.reset_meep()

oskooi · 2021-09-03T00:10:23Z

Additional results for a different test case involving anisotropic ε materials also show a similar speedup of ~12% for fields_chunk::update_eh(E_stuff) relative to no tiling. It's interesting that the maximum speedup of ~12% for fields_chunk::update_eh(E_stuff) and ~20% for fields_chunk::step_db(D_stuff) for the optimal tiling is consistent across different tests and grid resolutions.

Note from the results above and below that for the case of no tiling, the time spent per timestep on fields_chunk::update_eh(E_stuff) is more than 2X larger than fields_chunk::step_db(D_stuff). This means that in absolute terms, the time savings due to loop tiling is actually more significant fields_chunk::update_eh(E_stuff) than it is for fields_chunk::step_db(D_stuff).

I think these results finally demonstrate that loop tiling can be used to speed up update_eh.

1. update_eh

processor 1: relative improvement of 12.66% using 5280 tiles
processor 2: relative improvement of 11.63% using 5280 tiles
processor 3: relative improvement of 10.62% using 660 tiles
processor 4: relative improvement of 11.96% using 10560 tiles

2. step_db

processor 1: relative improvement of 20.24% using 2640 tiles
processor 2: relative improvement of 21.39% using 21120 tiles
processor 3: relative improvement of 19.10% using 2640 tiles
processor 4: relative improvement of 21.97% using 10560 tiles

test

import meep as mp

resolution = 110

s = 3.0
cell_size = mp.Vector3(s,s,s)

fcen = 1.0
sources = [mp.Source(mp.ContinuousSource(fcen),
                     center=mp.Vector3(-0.5*s),
                     size=mp.Vector3(0,s,s),
                     component=mp.Ez)]

aniso_mat1 = mp.Medium(epsilon_diag=mp.Vector3(8.09946,10.12833,6.34247),
                       epsilon_offdiag=mp.Vector3(0.56199,2.84232,5.05261))

aniso_mat2 = mp.Medium(epsilon_diag=mp.Vector3(14.74,23.98,22.64),
                       epsilon_offdiag=mp.Vector3(-11.54,0,-10.69))

rad = 1.0
geometry = [mp.Sphere(material=aniso_mat2,
                      center=mp.Vector3(),
                      radius=rad)]

nprocs = mp.count_processors()
nproc = mp.divide_parallel_processes(mp.count_processors())

ltbs = [300,700,1600,3000,5000,12000,25000,40000,80000,0]

for ltb in ltbs:
    sim = mp.Simulation(resolution=resolution,
                        cell_size=cell_size,
                        sources=sources,
                        k_point=mp.Vector3(),
                        geometry=geometry,
			loop_tile_base=ltb,
                        default_material=aniso_mat1,
                        eps_averaging=True)

    sim.init_sim()

    for r in range(1,6):
        sim.fields.step()
	sim.fields.reset_timers()

        for _ in range(5):
            sim.fields.step()

	sim.output_times('timings_res{}_tb{}_np{}_proc{}_run{}.csv'.format(resolution,ltb,nprocs,nproc,r))

	sim.restart_fields()

    sim.reset_meep()

src/update_eh.cpp

…tile

…if material has diagonal s->chi1inv

src/update_eh.cpp

src/step_db.cpp

oskooi · 2021-09-16T01:59:32Z

I ran benchmarks on AWS EC2 with the latest commit (57b4359) and there is again a consistent speedup across all four single-threaded processors of ~12% for update_e_from_b and ~20% for step_db though the tile configurations are different. (Note that tiling is turned off for update_h_from_b because there are no anisotropic materials which also just means H=B and therefore no update is actually performed.)

I think this feature is now ready to be merged.

update_e_from_d

processor 1: relative improvement of 12.41% using 1320 tiles
processor 2: relative improvement of 12.86% using 1320 tiles
processor 3: relative improvement of 12.09% using 1320 tiles
processor 4: relative improvement of 12.57% using 1320 tiles

step_db

processor 1: relative improvement of 22.38% using 10560 tiles
processor 2: relative improvement of 20.34% using 2640 tiles
processor 3: relative improvement of 20.92% using 10560 tiles
processor 4: relative improvement of 22.97% using 10560 tiles

* loop tiling for STEP_UPDATE_EDHB * reorganization to prevent copying the PML auxiliary fields for every tile * move loop over tiles outside of loop over components and skip tiling if material has diagonal s->chi1inv * add two separate subdomains for step_db and update_eh and * always tile step_db but only tile update_eh for anisotropic materials * fixes and documentation

oskooi added the enhancement label Aug 16, 2021

stevengj reviewed Aug 16, 2021

View reviewed changes

src/update_eh.cpp Show resolved Hide resolved

stevengj reviewed Sep 1, 2021

View reviewed changes

stevengj reviewed Sep 9, 2021

View reviewed changes

src/update_eh.cpp Outdated Show resolved Hide resolved

oskooi added 5 commits September 11, 2021 14:58

loop tiling for STEP_UPDATE_EDHB

1649705

reorganization to prevent copying the PML auxiliary fields for every …

778b4e5

…tile

move loop over tiles outside of loop over components and skip tiling …

8cc0505

…if material has diagonal s->chi1inv

add two separate subdomains for step_db and update_eh and

302e901

always tile step_db but only tile update_eh for anisotropic materials

bbdcba2

oskooi force-pushed the loop_tile_updateEH branch from dc766a8 to bbdcba2 Compare September 11, 2021 21:58

stevengj reviewed Sep 15, 2021

View reviewed changes

src/update_eh.cpp Outdated Show resolved Hide resolved

stevengj reviewed Sep 15, 2021

View reviewed changes

src/step_db.cpp Outdated Show resolved Hide resolved

fixes and documentation

57b4359

oskooi changed the title ~~WIP: loop tiling for step_update_edhb~~ loop tiling for step_update_edhb Sep 15, 2021

stevengj merged commit 3f0ddec into NanoComp:master Sep 16, 2021

oskooi deleted the loop_tile_updateEH branch September 16, 2021 17:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loop tiling for step_update_edhb #1733

loop tiling for step_update_edhb #1733

oskooi commented Aug 16, 2021

stevengj commented Aug 18, 2021

codecov-commenter commented Aug 29, 2021 •

edited

Loading

oskooi commented Aug 29, 2021

stevengj commented Sep 1, 2021

stevengj Sep 1, 2021

oskooi commented Sep 2, 2021

oskooi commented Sep 3, 2021

oskooi commented Sep 16, 2021

loop tiling for step_update_edhb #1733

loop tiling for step_update_edhb #1733

Conversation

oskooi commented Aug 16, 2021

stevengj commented Aug 18, 2021

codecov-commenter commented Aug 29, 2021 • edited Loading

Codecov Report

oskooi commented Aug 29, 2021

stevengj commented Sep 1, 2021

stevengj Sep 1, 2021

Choose a reason for hiding this comment

oskooi commented Sep 2, 2021

oskooi commented Sep 3, 2021

oskooi commented Sep 16, 2021

codecov-commenter commented Aug 29, 2021 •

edited

Loading