additional benchmarking results for runtime scaling on MPI clusters (N…

…anoComp#1272) * additional benchmarking results for runtime scaling on MPI clusters * plot 1/time vs P and expand discussion of linear scaling
bencbartlett · Jul 3, 2020 · 06f567d · 06f567d
1 parent f867dfd
commit 06f567d
Show file tree

Hide file tree

Showing 9 changed files with 26 additions and 9 deletions.
diff --git a/doc/docs/FAQ.md b/doc/docs/FAQ.md
@@ -318,8 +318,7 @@ Only the [harmonic mean](https://en.wikipedia.org/wiki/Harmonic_mean) of eigenva
 
 ### How do I model graphene or other 2d materials with single-atom thickness?
 
-Typically, graphene and similar "2d" materials are mathematically represented as a [delta function](https://en.wikipedia.org/wiki/Dirac_delta_function) conductivity in Maxwell's equations because their thickness is negligible compared to the wavelength. In
-a discretized computer model like Meep, this is approximated by a volume conductivity that is one pixel (`1/resolution`) thick *and* has an amplitude scaled by `resolution`. Such a one-pixel-thick [conductor](Materials.md#conductivity-and-complex) can be represented by e.g. a [`Block`](Python_User_Interface.md#block) with `size=meep.Vector3(x,y,1/resolution)` in a 3d cell, with the value of the conductivity explicitly multiplied by `resolution`.
+Typically, graphene and similar "2d" materials are mathematically represented as a [delta function](https://en.wikipedia.org/wiki/Dirac_delta_function) conductivity in Maxwell's equations because their thickness is negligible compared to the wavelength. In a discretized computer model like Meep, this is approximated by a volume conductivity that is one pixel (`1/resolution`) thick *and* has an amplitude scaled by `resolution`. Such a one-pixel-thick [conductor](Materials.md#conductivity-and-complex) can be represented by e.g. a [`Block`](Python_User_Interface.md#block) with `size=meep.Vector3(x,y,1/resolution)` in a 3d cell, with the value of the conductivity explicitly multiplied by `resolution`. For accurate results, the Yee grid resolution should be comparable to a singe-atom thickness. Unfortunately, this means that the memory requirements of even a 2d simulation will be quite large. Finite-element methods which are based on variable grid resolution may be more suitable for these types of problems.
 
 ### How do I model a continuously varying permittivity?
 
@@ -357,7 +356,7 @@ Even if you disable the subpixel averaging, however, when you output the dielect
 
 ### Why does subpixel averaging take so long?
 
-There are at least two possible reasons due to using: (1) a [material function](Python_User_Interface.md#material-function) to define a [`Medium`](Python_User_Interface.md#medium) object or (2) the [C++](C++_Tutorial) interface. Unlike either the [Python](Python_User_Interface/) or [Scheme](Scheme_User_Interface/) interfaces which are based on analytically computing the averaged permittivity for boundary voxels involving at most one [`GeometricObject`](Python_User_Interface.md#geometricobject) (e.g., `Sphere`, `Prism`, `Block`, etc.), the C++ interface computes these averages from the material_function using [numerical quadrature](https://en.wikipedia.org/wiki/Numerical_integration) if the parameter `use_anisotropic_averaging=true` is passed to the constructor of `set_materials`. This procedure involves calling the material_function many times for every voxel in the [structure object](C++_Developer_Information.md#data-structures-and-chunks) which can be slow due to the [SWIG](http://www.swig.org/) callbacks, particularly because the voxel density is repeatedly doubled until a given threshold tolerance (`subpixel_tol`) or maximum iteration number (`subpixel_maxeval`) is reached. Because of this discrepancy in the subpixel averaging, the results for the C++ and Python/Scheme interfaces may be slightly different at the same resolution. You can potentially speed up subpixel averaging by increasing `subpixel_tol` or decreasing `subpixel_maxeval`. **Note:** the slow callbacks may still be noticeable during the grid initialization *even when subpixel averaging is turned off*. Just remember that if you turn off subpixel averaging, it usually means that you may need to increase the grid resolution to obtain the same accuracy. You will have to determine how much accuracy you want to trade for time. Alternatively, in the C++ interface you can use the [`meepgeom.hpp`](https://github.com/NanoComp/meep/blob/master/src/meepgeom.hpp) routines to define your geometry in terms of blocks, cylinders, etcetera similar to Python and Scheme, with semi-analytical subpixel averaging.
+There are at least two possible reasons due to using: (1) a [material function](Python_User_Interface.md#material-function) to define a [`Medium`](Python_User_Interface.md#medium) object or (2) the [C++](C++_Tutorial) interface. Unlike either the [Python](Python_User_Interface/) or [Scheme](Scheme_User_Interface/) interfaces which are based on analytically computing the averaged permittivity for boundary voxels involving at most one [`GeometricObject`](Python_User_Interface.md#geometricobject) (e.g., `Sphere`, `Prism`, `Block`, etc.), the C++ interface computes these averages from the material function using [numerical quadrature](https://en.wikipedia.org/wiki/Numerical_integration) if the parameter `use_anisotropic_averaging=true` is passed to the constructor of `set_materials`. This procedure involves calling the material function many times for every voxel in the [structure object](C++_Developer_Information.md#data-structures-and-chunks) which can be slow due to the [SWIG](http://www.swig.org/) callbacks, particularly because the voxel density is repeatedly doubled until a given threshold tolerance (`subpixel_tol`) or maximum iteration number (`subpixel_maxeval`) is reached. Because of this discrepancy in the subpixel averaging, the results for the C++ and Python/Scheme interfaces may be slightly different at the same resolution. You can potentially speed up subpixel averaging by increasing `subpixel_tol` or decreasing `subpixel_maxeval`. **Note:** the slow callbacks may still be noticeable during the grid initialization *even when subpixel averaging is turned off*. Just remember that if you turn off subpixel averaging, it usually means that you may need to increase the grid resolution to obtain the same accuracy. You will have to determine how much accuracy you want to trade for time. Alternatively, in the C++ interface you can use the [`meepgeom.hpp`](https://github.com/NanoComp/meep/blob/master/src/meepgeom.hpp) routines to define your geometry in terms of blocks, cylinders, etcetera similar to Python and Scheme, with semi-analytical subpixel averaging.
 
 ### Can subpixel averaging be applied to dispersive materials?
 

diff --git a/doc/docs/Field_Functions.md b/doc/docs/Field_Functions.md
@@ -139,6 +139,8 @@ meep.Simulation.run(meep.synchronized_magnetic(meep.at_every(1,my_weird_output))
 (run-until 200 (synchronized-magnetic (at-every 1 my-weird-output)))
 ```
 
+For more information, see [Python User Interface/Writing Your Own Step Functions](Python_User_Interface.md#writing-your-own-step-functions) or [Scheme User Interface/Writing Your Own Step Functions](Scheme_User_Interface.md#writing-your-own-step-functions).
+
 Coordinates of the Yee Grid
 ---------------------------
 
@@ -163,5 +165,3 @@ meep.Simulation.integrate_field_function([mp.Ex],f)
 ```
 
 This function prints the $(x,y,z)$ Yee grid coordinates of all $E_x$ fields and returns a value of 0 which is never used. In contrast, the output functions `output_field_function` (Python) or `output-field-function` (Scheme) (as well as `output-real-field-function`) interpolate *all* fields onto the center of each grid point.
-
-For more information, see [Python User Interface/Writing Your Own Step Functions](Python_User_Interface.md#writing-your-own-step-functions) or [Scheme User Interface/Writing Your Own Step Functions](Scheme_User_Interface.md#writing-your-own-step-functions).
diff --git a/doc/docs/Parallel_Meep.md b/doc/docs/Parallel_Meep.md
@@ -97,9 +97,9 @@ For large multicore jobs with I/O, it may be necessary to have `(meep-all-wait)`
 Runtime Scaling on MPI Clusters
 -------------------------------
 
-The following are benchmarking results of the total runtime vs. number of processors/chunks for a 3d [OLED](http://www.simpetus.com/projects.html#meep_oled) simulation involving [Lorentzian susceptibility](Python_User_Interface.md#lorentziansusceptibility), [Absorber](Python_User_Interface.md#absorber), 1d [PML](Python_User_Interface.md#pml), and [DFT flux monitors](Python_User_Interface.md#flux-spectra) for MPI clusters of [n1-standard-16](https://cloud.google.com/compute/docs/machine-types#n1_machine_type) instances on the [Google Cloud Platform](https://cloud.google.com/). Hyperthreading is disabled and one slot on each instance/node is reserved for kernel tasks leaving 7 slots/node. [elasticluster](https://elasticluster.readthedocs.io/en/latest/) is used for the cluster management and [grid engine](https://en.wikipedia.org/wiki/Oracle_Grid_Engine) for the job scheduler. Even chunk splitting is used (`split_chunks_evenly=True`). The size of the clusters ranges from 2 to 14 nodes (14 to 98 processors).
+The following are benchmarking results of the total runtime vs. number of processors for a 3d [OLED](http://www.simpetus.com/projects.html#meep_oled) simulation involving [Lorentzian susceptibility](Python_User_Interface.md#lorentziansusceptibility), [Absorber](Python_User_Interface.md#absorber), 1d [PML](Python_User_Interface.md#pml), and [DFT flux monitors](Python_User_Interface.md#flux-spectra) for [MPICH](https://www.mpich.org/) clusters of [n1-standard-16](https://cloud.google.com/compute/docs/machine-types#n1_machine_type) instances (8 single-threaded cores) on the [Google Cloud Platform](https://cloud.google.com/) (GCP). One slot on each node is reserved for kernel tasks leaving 7 slots per node. The software stack includes Ubuntu 16.04, the Meep [nightly build Conda package](Installation.md#nightly-builds), [elasticluster](https://elasticluster.readthedocs.io/en/latest/) for the cluster management, and [grid engine](https://en.wikipedia.org/wiki/Oracle_Grid_Engine) for the job scheduler. In order to reduce [cache contention](https://en.wikipedia.org/wiki/Resource_contention), process affinity is used via the `mpirun` option `-bind-to core`. Meep's simulation domain is split into equal-sized [chunks](Chunks_and_Symmetry.md#chunks-and-symmetry) (`split_chunks_evenly=True`). The size of the clusters ranges from 2 to 14 nodes (14 to 98 processors).
 
-As shown in the first figure below, the runtime reaches a minimum at 77 processors. The second figure shows the scaling of the ratio of the time spent on communication (MPI/synchronization) to the computation (time stepping and DFTs). This ratio is a measure of the parallelization efficiency. The crossover point when the parallelization efficiency becomes larger than one corresponds well to the minimum runtime of the first figure.
+As shown in the first figure below, the runtime reaches a minimum at 77 processors. The second figure shows the scaling of the ratio of the mean time spent on communication (MPI/synchronization) to the computation (time stepping and DFTs). (These timing metrics were obtained using the routine [`Simulation.mean_time_spent_on`](Python_User_Interface.md#simulation-time).) This ratio is a measure of the parallelization efficiency. The crossover point when the parallelization efficiency becomes larger than one — the regime in which the simulation is constrained by the network bandwidth rather than the CPU clock speed — corresponds well to the minimum runtime of the first figure.
 
 <center>
 ![](images/parallel_benchmark_runtime_vs_nprocs.png)
@@ -109,4 +109,22 @@ As shown in the first figure below, the runtime reaches a minimum at 77 processo
 ![](images/parallel_benchmark_commcomp_vs_nprocs.png)
 </center>
 
-The results are not continuous because as the number of processors changes slightly (e.g., from 42 to 49), the chunk divisions can change by a lot (i.e., it can switch from splitting some chunk along the *x* axis to along the *y* axis).
+These results are not continuous because as the number of processors changes slightly (e.g., from 42 to 49), the chunk divisions can change by a lot (i.e., it can switch from splitting some chunk along the *x* axis to along the *y* axis) which significantly affects the runtime performance.
+
+We can also analyze the per-processor times spent on time-stepping, MPI/synchronization, and DFT as shown in the next figure for the case of a cluster with 35 processors (5 nodes). Because the simulation is not properly load balanced due to the equal-sized chunks, there is a large variation in the timings for different processors particularly for the DFT where there are several idle processors (i.e., chunks which do not contain any DFT pixels).
+
+<center>
+![](images/parallel_benchmark_barplot.png)
+</center>
+
+Based on these results, we plot the average of the *inverse* of the timings (proportional to the number of cycles per second; a quantity which can demonstrate linear scaling) for the time-stepping and DFT over the full range of cluster sizes. The time-stepping results demonstrate ~linear scaling although the variation increases with the number of processors mainly as a result of the increasingly pronounced networking effects of the virtual cluster (the N1 instances on GCP do *not* support colocation via a [compact placement policy](https://cloud.google.com/compute/docs/instances/define-instance-placement)). The DFT results seem to be converging to a constant. This is not surprising because for any given number of processors/chunks only a subset actually contain DFT pixels. The processor(s) which takes the longest time to update its DFT pixels therefore sets an upper bound on how fast the DFT calculation for all processors can proceed. It is this upper bound which is revealed by the scaling plot.
+
+See also [FAQ/Should I expect linear speedup from the parlalel Meep](FAQ.md#should-i-expect-linear-speedup-from-the-parallel-meep)?
+
+<center>
+![](images/parallel_benchmark_timestep.png)
+</center>
+
+<center>
+![](images/parallel_benchmark_DFT.png)
+</center>
diff --git a/doc/docs/Scheme_User_Interface.md b/doc/docs/Scheme_User_Interface.md
@@ -874,7 +874,7 @@ Given a list of field components `cs`, compute the Fourier transform of these fi
 
 **`(flux-in-box dir box)`**
 —
-Given a `direction` constant, and a `meep::volume*`, returns the flux (the integral of $\Re [\mathbf{E}^* \times \mathbf{H}]$) in that volume. Most commonly, you specify a volume that is a plane or a line, and a direction perpendicular to it, e.g. `(flux-in-box `X (volume (center 0) (size 0 1 1)))`.
+Given a `direction` constant, and a `meep::volume*`, returns the flux (the integral of $\Re [\mathbf{E}^* \times \mathbf{H}]$) in that volume. Most commonly, you specify a volume that is a plane or a line, and a direction perpendicular to it, e.g. `(flux-in-box X (volume (center 0) (size 0 1 1)))`.
 
 **`(electric-energy-in-box box)`**
 —

diff --git a/doc/docs/images/parallel_benchmark_DFT.png b/doc/docs/images/parallel_benchmark_DFT.png
diff --git a/doc/docs/images/parallel_benchmark_barplot.png b/doc/docs/images/parallel_benchmark_barplot.png
diff --git a/doc/docs/images/parallel_benchmark_commcomp_vs_nprocs.png b/doc/docs/images/parallel_benchmark_commcomp_vs_nprocs.png
diff --git a/doc/docs/images/parallel_benchmark_runtime_vs_nprocs.png b/doc/docs/images/parallel_benchmark_runtime_vs_nprocs.png
diff --git a/doc/docs/images/parallel_benchmark_timestep.png b/doc/docs/images/parallel_benchmark_timestep.png