WIP: Speed up choose_chunkdivision with geom_boxt_tree #1030

ChristopherHogan · 2019-09-29T17:02:16Z

No description provided.

stevengj · 2019-10-02T01:34:00Z

src/meepgeom.cpp

+        if (overlap > 0.0) {
+          for (int j = 0; j < geom.num_items; ++j) {
+            if (t->objects[i].o == geom.items + j) {
+              overlaps[j] += overlap;


If this loop becomes a limiting factor, we could do the lookup in another way, e.g. a hash table or an array of indices sorted by pointer address.

stevengj · 2019-10-02T01:34:58Z

src/meepgeom.cpp

+
+    for (int i = 0; i < geom.num_items; ++i) {
+      geometric_object *go = &geom.items[i];
+      double overlap = overlaps[i];


Would be good to check if this roughly matches the overlaps computed for each object with the old code.

stevengj · 2019-10-02T01:38:07Z

src/meepgeom.cpp

+  int temp_periodicity = ensure_periodicity;
+  dimensions = 3;
+  ensure_periodicity = 0;
+  geom_tree = create_geom_box_tree0(*geom_, cell_box);


As a simpler alternative to creating this tree, you could also just cache geom_get_bounding_box for each object, and check whether the bounding box overlaps before calling box_overlap_with_object. This only helps the constant factor, though, unlike a tree which in principle can give better scaling.

stevengj · 2019-10-02T01:41:45Z

src/meepgeom.cpp

+    geom_box_intersection(&intersection, &t->b, &box);
+
+    for (int i = 0; i < t->nobjects; ++i) {
+      if (geom_boxes_intersect(&intersection, &t->objects[i].box)) {


One potential optimization: if this intersection is the whole of t->objects[i].box, i.e. the object lies completely within the intersection, then the overlap is just the volume of the object. We don't currently have an API to get the object volume, however.

stevengj · 2019-10-02T01:43:49Z

(I suspect that the base case in libctl/utils/geom.c:divide_geom_box_tree should be larger leaves for this application.)

ChristopherHogan · 2019-10-02T01:46:25Z

Depends on NanoComp/libctl#43

oskooi · 2019-10-09T01:42:43Z

The following test can be used for benchmarking. This PR does not seem to be providing much of a performance improvement (and in fact does slightly worse) compared to master: 193.894 s vs. 206.908 s.

master

Using MPI version 3.0, 18 processes
-----------
Initializing structure...
Splitting into 18 chunks by cost
time for choose_chunkdivision = 193.894 s
Working in 3D dimensions.
Computational cell is 7.4 x 7.4 x 3.90909 with resolution 55
     block, center = (0,0,0.55)
          size (1e+20,1e+20,2.8)
          axes (1,0,0), (0,1,0), (0,0,1)
          dielectric constant epsilon diagonal = (1,1,1)
     block, center = (0,0,-1.1)
          size (1e+20,1e+20,0.5)
          axes (1,0,0), (0,1,0), (0,0,1)
          dielectric constant epsilon diagonal = (1,1,1)
     block, center = (0,0,-1.6)
          size (1e+20,1e+20,0.5)
          axes (1,0,0), (0,1,0), (0,0,1)
          dielectric constant epsilon diagonal = (3.0625,3.0625,3.0625)
     block, center = (0,0,-1.95)
          size (1e+20,1e+20,0.2)
          axes (1,0,0), (0,1,0), (0,0,1)
          dielectric constant epsilon diagonal = (1,1,1)
     sphere, center = (-3.39414,-3.39414,-1.85)
          radius 0.1
          dielectric constant epsilon diagonal = (1,1,1)
     sphere, center = (-3.4137,-3.1137,-1.85)
          radius 0.1
          dielectric constant epsilon diagonal = (1,1,1)
     sphere, center = (-3.39905,-2.79905,-1.85)
          radius 0.1
          dielectric constant epsilon diagonal = (1,1,1)
     sphere, center = (-3.42459,-2.52459,-1.85)
          radius 0.1
          dielectric constant epsilon diagonal = (1,1,1)
     sphere, center = (-3.41616,-2.21616,-1.85)
          radius 0.1
          dielectric constant epsilon diagonal = (1,1,1)
     sphere, center = (-3.44683,-1.94683,-1.85)
          radius 0.1
          dielectric constant epsilon diagonal = (1,1,1)
     ...(+ 570 objects not shown)...
time for set_epsilon = 1.71491 s
time for set_conductivity = 0.232645 s
time for set_conductivity = 0.232838 s
time for set_conductivity = 0.236223 s
time for set_conductivity = 0.236205 s
time for set_conductivity = 0.236093 s
time for set_conductivity = 0.235809 s
lorentzian susceptibility: frequency=2.80116, gamma=2.72777
lorentzian susceptibility: frequency=1.45825, gamma=1.08966
lorentzian susceptibility: frequency=1.24532, gamma=0.251645
lorentzian susceptibility: frequency=0.130662, gamma=0.268583
drude susceptibility: frequency=1e-10, gamma=0.0379081
lorentzian susceptibility: frequency=5.48457, gamma=0.513775
lorentzian susceptibility: frequency=0.101049, gamma=0
lorentzian susceptibility: frequency=8.60279, gamma=0
lorentzian susceptibility: frequency=14.619, gamma=0
-----------
estimated costs per process: 5003.61, 5514.81, 5181.63, 5372.34, 5117.11, 5611.91, 5246.98, 5355.01, 5166.24, 5456.63, 5285.86, 5401.04, 5056.53, 5551.81, 5182.95, 5455.5, 5170.85, 5650.81
estimated cost mean = 5321.2, stddev = 193.249

Elapsed run time = 216.5703 s

this PR

Using MPI version 3.0, 18 processes
-----------
Initializing structure...
Splitting into 18 chunks by cost
time for choose_chunkdivision = 206.908 s
Working in 3D dimensions.
Computational cell is 7.4 x 7.4 x 3.90909 with resolution 55
     block, center = (0,0,0.55)
          size (1e+20,1e+20,2.8)
          axes (1,0,0), (0,1,0), (0,0,1)
          dielectric constant epsilon diagonal = (1,1,1)
     block, center = (0,0,-1.1)
          size (1e+20,1e+20,0.5)
          axes (1,0,0), (0,1,0), (0,0,1)
          dielectric constant epsilon diagonal = (1,1,1)
     block, center = (0,0,-1.6)
          size (1e+20,1e+20,0.5)
          axes (1,0,0), (0,1,0), (0,0,1)
          dielectric constant epsilon diagonal = (3.0625,3.0625,3.0625)
     block, center = (0,0,-1.95)
          size (1e+20,1e+20,0.2)
          axes (1,0,0), (0,1,0), (0,0,1)
          dielectric constant epsilon diagonal = (1,1,1)
     sphere, center = (-3.39994,-3.39994,-1.85)
          radius 0.1
          dielectric constant epsilon diagonal = (1,1,1)
     sphere, center = (-3.40759,-3.10759,-1.85)
          radius 0.1
          dielectric constant epsilon diagonal = (1,1,1)
     sphere, center = (-3.44913,-2.84913,-1.85)
          radius 0.1
          dielectric constant epsilon diagonal = (1,1,1)
     sphere, center = (-3.42989,-2.52989,-1.85)
          radius 0.1
          dielectric constant epsilon diagonal = (1,1,1)
     sphere, center = (-3.40125,-2.20125,-1.85)
          radius 0.1
          dielectric constant epsilon diagonal = (1,1,1)
     sphere, center = (-3.40707,-1.90707,-1.85)
          radius 0.1
          dielectric constant epsilon diagonal = (1,1,1)
     ...(+ 570 objects not shown)...
time for set_epsilon = 2.42739 s
time for set_conductivity = 0.316124 s
time for set_conductivity = 0.315186 s
time for set_conductivity = 0.318315 s
time for set_conductivity = 0.318088 s
time for set_conductivity = 0.318354 s
time for set_conductivity = 0.318157 s
lorentzian susceptibility: frequency=2.80116, gamma=2.72777
lorentzian susceptibility: frequency=1.45825, gamma=1.08966
lorentzian susceptibility: frequency=1.24532, gamma=0.251645
lorentzian susceptibility: frequency=0.130662, gamma=0.268583
drude susceptibility: frequency=1e-10, gamma=0.0379081
lorentzian susceptibility: frequency=5.48457, gamma=0.513775
lorentzian susceptibility: frequency=0.101049, gamma=0
lorentzian susceptibility: frequency=8.60279, gamma=0
lorentzian susceptibility: frequency=14.619, gamma=0
-----------
estimated costs per process: 58253, 70735.6, 37089.1, 42303.1, 37755, 43749.6, 35717, 42687, 22575.7, 23095.7, 25225.6, 24252.5, 34733.5, 39773.8, 23768.8, 21599.5, 20178.6, 20517.4
estimated cost mean = 34667.2, stddev = 13895.2

Elapsed run time = 230.0236 s

from time import time
import meep as mp
from meep.materials import Al, ITO, fused_quartz
import math
import numpy as np
from random import random

lambda_min = 0.4                           # minimum source wavelength                                                                                                                            
lambda_max = 0.8                           # maximum source wavelength                                                                                                                            
fmin = 1/lambda_max                        # minimum source frequency                                                                                                                             
fmax = 1/lambda_min                        # maximum source frequency                                                                                                                             
fcen = 0.5*(fmin+fmax)                     # source frequency center                                                                                                                              
df = fmax-fmin                             # source frequency width                                                                                                                               

resolution = 55
nfreq = 800
nperiods = 10 # 24                                                                                                                                                                                
tABS = 2.2
tPML = 2.1
tGLS = 0.6
tITO = 0.5
tORG = 0.4
tAl = 0.2
ps = 0.3
rs = 0.1
dfrac = 0.2

L = nperiods*ps        # length of OLED                                                                                                                                                           

# length of computational cell along Z                                                                                                                                                            
sz = tPML+tGLS+tITO+tORG+rs+tAl
# length of non-absorbing region of computational cell in X and Y                                                                                                                                 
sxy = L+2*tABS
cell_size = mp.Vector3(sxy,sxy,sz)

boundary_layers = [mp.Absorber(tABS,direction=mp.X),
                   mp.Absorber(tABS,direction=mp.Y),
                   mp.PML(tPML,direction=mp.Z,side=mp.High)]

ORG = mp.Medium(index=1.75)

geometry = [mp.Block(material=fused_quartz,
                     size=mp.Vector3(mp.inf,mp.inf,tABS+tGLS),
                     center=mp.Vector3(z=0.5*sz-0.5*(tABS+tGLS))),
            mp.Block(material=ITO,
                     size=mp.Vector3(mp.inf,mp.inf,tITO),
                     center=mp.Vector3(z=0.5*sz-tABS-tGLS-0.5*tITO)),
            mp.Block(material=ORG,
                     size=mp.Vector3(mp.inf,mp.inf,tORG+rs),
                     center=mp.Vector3(z=0.5*sz-tABS-tGLS-tITO-0.5*(tORG+rs))),
            mp.Block(material=Al,
                     size=mp.Vector3(mp.inf,mp.inf,tAl),
                     center=mp.Vector3(z=0.5*sz-tABS-tGLS-tITO-tORG-rs-0.5*tAl))]

si = -0.5*sxy+0.5*ps+0.5*(sxy-ps*math.floor(sxy/ps))
di = ps
ni = math.floor(sxy/ps)

def sph(cx,cy):
    if mp.am_master():
        dpos = dfrac*ps*random()
    else:
        dpos = None
    dpos = mp.comm.bcast(dpos, root=0)
    return mp.Sphere(material=Al,
                     radius=rs,
                     center=mp.Vector3(cx+dpos,cy+dpos,0.5*sz-tABS-tGLS-tITO-tORG-rs))

for cx in np.arange(si,si+ni*di,di):
    for cy in np.arange(si,si+ni*di,di):
        geometry.append(sph(cx,cy))

num_src = 20                 # number of point sources                                                                                                                                            
sources = [];
for n in range(num_src):
    sources.append(mp.Source(mp.GaussianSource(fcen, fwidth=df),
                             component=mp.Ez,
                             center=mp.Vector3(z=0.5*sz-tABS-tGLS-tITO-0.4*tORG-0.2*tORG*n/num_src)))

sim = mp.Simulation(resolution=resolution,
                    cell_size=cell_size,
                    boundary_layers=boundary_layers,
                    geometry=geometry,
                    sources=sources,
                    eps_averaging=False,
                    split_chunks_evenly=False)

# surround source with a six-sided box of flux planes                                                                                                                                             
srcbox_width = 0.05
srcbox_top = sim.add_flux(fcen, df, nfreq, mp.FluxRegion(center=mp.Vector3(0,0,0.5*sz-tPML-tGLS), size=mp.Vector3(srcbox_width,srcbox_width,0), direction=mp.Z, weight=+1))
srcbox_bot = sim.add_flux(fcen, df, nfreq, mp.FluxRegion(center=mp.Vector3(0,0,0.5*sz-tPML-tGLS-tITO-0.8*tORG), size=mp.Vector3(srcbox_width,srcbox_width,0), direction=mp.Z, weight=-1))
srcbox_xp = sim.add_flux(fcen, df, nfreq, mp.FluxRegion(center=mp.Vector3(0.5*srcbox_width,0,0.5*sz-tPML-tGLS-0.5*(tITO+0.8*tORG)), size=mp.Vector3(0,srcbox_width,tITO+0.8*tORG), direction=mp.X\
, weight=+1))
srcbox_xm = sim.add_flux(fcen, df, nfreq, mp.FluxRegion(center=mp.Vector3(-0.5*srcbox_width,0,0.5*sz-tPML-tGLS-0.5*(tITO+0.8*tORG)), size=mp.Vector3(0,srcbox_width,tITO+0.8*tORG), direction=mp.\
X, weight=-1))
srcbox_yp = sim.add_flux(fcen, df, nfreq, mp.FluxRegion(center=mp.Vector3(0,0.5*srcbox_width,0.5*sz-tPML-tGLS-0.5*(tITO+0.8*tORG)), size=mp.Vector3(srcbox_width,0,tITO+0.8*tORG), direction=mp.Y\
, weight=+1))
srcbox_ym = sim.add_flux(fcen, df, nfreq, mp.FluxRegion(center=mp.Vector3(0,-0.5*srcbox_width,0.5*sz-tPML-tGLS-0.5*(tITO+0.8*tORG)), size=mp.Vector3(srcbox_width,0,tITO+0.8*tORG), direction=mp.\
Y, weight=-1))

# padding for flux box to fully capture waveguide mode                                                                                                                                            
fluxbox_dpad = 0.05

glass_flux = sim.add_flux(fcen, df, nfreq, mp.FluxRegion(center=mp.Vector3(0,0,0.5*sz-tPML-(tGLS-fluxbox_dpad)), size = mp.Vector3(L,L,0), direction=mp.Z, weight=+1))
wvgbox_xp = sim.add_flux(fcen, df, nfreq, mp.FluxRegion(size=mp.Vector3(0,L,fluxbox_dpad+tITO+tORG+fluxbox_dpad), direction=mp.X, center=mp.Vector3(0.5*L,0,0.5*sz-tPML-tGLS-0.5*(tITO+tORG)), we\
ight=+1))
wvgbox_xm = sim.add_flux(fcen, df, nfreq, mp.FluxRegion(size=mp.Vector3(0,L,fluxbox_dpad+tITO+tORG+fluxbox_dpad), direction=mp.X, center=mp.Vector3(-0.5*L,0,0.5*sz-tPML-tGLS-0.5*(tITO+tORG)), w\
eight=-1))
wvgbox_yp = sim.add_flux(fcen, df, nfreq, mp.FluxRegion(size=mp.Vector3(L,0,fluxbox_dpad+tITO+tORG+fluxbox_dpad), direction=mp.Y, center=mp.Vector3(0,0.5*L,0.5*sz-tPML-tGLS-0.5*(tITO+tORG)), we\
ight=+1))
wvgbox_ym = sim.add_flux(fcen, df, nfreq, mp.FluxRegion(size=mp.Vector3(L,0,fluxbox_dpad+tITO+tORG+fluxbox_dpad), direction=mp.Y, center=mp.Vector3(0,-0.5*L,0.5*sz-tPML-tGLS-0.5*(tITO+tORG)), w\
eight=-1))

mp.verbosity(2)
sim.init_sim()

oskooi · 2019-10-09T02:10:41Z

The latest commit provides just a slight improvement but the time for choose_chunkdivision is still comparable to that of master:

Splitting into 18 chunks by cost
time for choose_chunkdivision = 195.99 s

Also, the latest commit seems to be ignoring the verbose parameter and printing to standard output all 500+ geometric objects.

ChristopherHogan · 2019-10-10T01:59:54Z

This is ready, although, as @oskooi says, it doesn't seem to make things any faster. refl_angular.py will fail in this PR because of NanoComp/libctl#43.

oskooi · 2019-10-23T22:56:13Z

Unrelated to this PR, @stevengj's object_vol branch of NanoComp/libctl provides a speedup of more than 6X compared to master for the test shown above:

Using MPI version 3.0, 18 processes
-----------
Initializing structure...
Splitting into 18 chunks by cost
time for choose_chunkdivision = 31.2376 s
Working in 3D dimensions.
Computational cell is 7.4 x 7.4 x 3.90909 with resolution 55
     block, center = (0,0,0.55)
          size (1e+20,1e+20,2.8)
          axes (1,0,0), (0,1,0), (0,0,1)
          dielectric constant epsilon diagonal = (1,1,1)
     block, center = (0,0,-1.1)
          size (1e+20,1e+20,0.5)
          axes (1,0,0), (0,1,0), (0,0,1)
          dielectric constant epsilon diagonal = (1,1,1)
     block, center = (0,0,-1.6)
          size (1e+20,1e+20,0.5)
          axes (1,0,0), (0,1,0), (0,0,1)
          dielectric constant epsilon diagonal = (3.0625,3.0625,3.0625)
     block, center = (0,0,-1.95)
          size (1e+20,1e+20,0.2)
          axes (1,0,0), (0,1,0), (0,0,1)
          dielectric constant epsilon diagonal = (1,1,1)
     sphere, center = (-3.44194,-3.44194,-1.85)
          radius 0.1
          dielectric constant epsilon diagonal = (1,1,1)
     sphere, center = (-3.40394,-3.10394,-1.85)
          radius 0.1
          dielectric constant epsilon diagonal = (1,1,1)
     sphere, center = (-3.40442,-2.80442,-1.85)
          radius 0.1
          dielectric constant epsilon diagonal = (1,1,1)
     sphere, center = (-3.41241,-2.51241,-1.85)
          radius 0.1
          dielectric constant epsilon diagonal = (1,1,1)
     sphere, center = (-3.42602,-2.22602,-1.85)
          radius 0.1
          dielectric constant epsilon diagonal = (1,1,1)
     sphere, center = (-3.41229,-1.91229,-1.85)
          radius 0.1
          dielectric constant epsilon diagonal = (1,1,1)
     ...(+ 570 objects not shown)...
time for set_epsilon = 3.52622 s
time for set_conductivity = 0.453251 s
time for set_conductivity = 0.452212 s
time for set_conductivity = 0.457252 s
time for set_conductivity = 0.45628 s
time for set_conductivity = 0.459845 s
time for set_conductivity = 0.457122 s
lorentzian susceptibility: frequency=2.80116, gamma=2.72777
lorentzian susceptibility: frequency=1.45825, gamma=1.08966
lorentzian susceptibility: frequency=1.24532, gamma=0.251645
lorentzian susceptibility: frequency=0.130662, gamma=0.268583
drude susceptibility: frequency=1e-10, gamma=0.0379081
lorentzian susceptibility: frequency=5.48457, gamma=0.513775
lorentzian susceptibility: frequency=0.101049, gamma=0
lorentzian susceptibility: frequency=8.60279, gamma=0
lorentzian susceptibility: frequency=14.619, gamma=0
-----------
estimated costs per process: 4769.7, 4780.76, 6030.21, 6150.92, 6026.44, 6101.62, 5544.01, 5788.56, 4770.31, 4788.68, 5408.71, 5504.66, 5538.32, 5781.88, 4724.99, 4829.21, 5416.83, 5467.45
estimated cost mean = 5412.4, stddev = 516.089

Elapsed run time = 62.3301 s

The time for choose_chunkdivision is 31.2376 s compared to 193.894 s for master — a speedup of 6.2X. Note that the estimated cost mean and stddev are similar: 5412.4 ± 516.089 vs. 5321.2 ± 193.249.

ChristopherHogan added 3 commits September 15, 2019 14:54

WIP: geom_box_tree for faster fragment_stats

06bb2fd

WIP: speed up choose_chunkdivision with geom_box_tree

10420b5

WIP: Speed up fragment_stats with geom_box_tree

db9fe35

stevengj reviewed Oct 2, 2019

View reviewed changes

Working much better now

404d610

Remove debug code and fix anisotropic bug

6778a69

ChristopherHogan mentioned this pull request Nov 13, 2019

use geom_box_tree for fragment stats calculation #996

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Speed up choose_chunkdivision with geom_boxt_tree #1030

WIP: Speed up choose_chunkdivision with geom_boxt_tree #1030

ChristopherHogan commented Sep 29, 2019

stevengj Oct 2, 2019

stevengj Oct 2, 2019

stevengj Oct 2, 2019

stevengj Oct 2, 2019

stevengj commented Oct 2, 2019

ChristopherHogan commented Oct 2, 2019

oskooi commented Oct 9, 2019

oskooi commented Oct 9, 2019

ChristopherHogan commented Oct 10, 2019

oskooi commented Oct 23, 2019

WIP: Speed up choose_chunkdivision with geom_boxt_tree #1030

Are you sure you want to change the base?

WIP: Speed up choose_chunkdivision with geom_boxt_tree #1030

Conversation

ChristopherHogan commented Sep 29, 2019

stevengj Oct 2, 2019

Choose a reason for hiding this comment

stevengj Oct 2, 2019

Choose a reason for hiding this comment

stevengj Oct 2, 2019

Choose a reason for hiding this comment

stevengj Oct 2, 2019

Choose a reason for hiding this comment

stevengj commented Oct 2, 2019

ChristopherHogan commented Oct 2, 2019

oskooi commented Oct 9, 2019

oskooi commented Oct 9, 2019

ChristopherHogan commented Oct 10, 2019

oskooi commented Oct 23, 2019