Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharded fields dump #1738

Merged
merged 20 commits into from
Sep 23, 2021
Merged

Sharded fields dump #1738

merged 20 commits into from
Sep 23, 2021

Conversation

kkg4theweb
Copy link
Contributor

@kkg4theweb kkg4theweb commented Aug 23, 2021

Add support for dumping (fields::dump) and loading fields (fields::load). This along with structure::{dump,load} this will allow saving the simulation state at any arbitrary time-step and then restoring from this saved state to continue the simulation from that point.

This PR also adds a basic C++ test dump/load structure and fields. It also moves the existing structure dump/load tests into its own python test (test_dump_load.py) and adds tests for dumping/loading fields also.

Copy link
Collaborator

@oskooi oskooi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the CI is showing that the C++ unit test added in this PR (dump_load.cpp) is failing which should probably be fixed first:

============================================
   meep 1.21.0-beta: tests/test-suite.log
============================================

# TOTAL: 20
# PASS:  19
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0

.. contents:: :depth: 2

FAIL: dump_load
===============

Using MPI version 3.1, 2 processes
Testing 3D dump/load: temp_dir = /tmp/meepQOXddW...
Testing pml quality...
Dumping structure: /tmp/meepQOXddW/test_pml-structure-original
Dumping fields: /tmp/meepQOXddW/test_pml-fields-original
Got newE/oldE of 1.66447e-05
Got newE/oldE of -3.83624e-08
Got newE/oldE of 2.1382e-09
Dumping structure: /tmp/meepQOXddW/test_pml-structure-after-sim
Dumping fields: /tmp/meepQOXddW/test_pml-fields-after_sim
Loading structure: /tmp/meepQOXddW/test_pml-structure-after-sim
Dumping structure: /tmp/meepQOXddW/test_pml-structure-dump-loaded
Loading fields: /tmp/meepQOXddW/test_pml-fields-after_sim
meep: error on line 244 of ../../../src/h5file.cpp: missing dataset in HDF5 file
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
FAIL dump_load (exit status: 1)

f1.step();
if (!compare_point(f, f1, vec(0.5, 0.01, 0.5))) return 0;
if (!compare_point(f, f1, vec(0.46, 0.33, 0.2))) return 0;
if (!compare_point(f, f1, vec(1.0, 0.25, 0.301))) return 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This series of unit tests (which seems to be copied from tests/three_d.cpp) is comparing the fields from two different structures (with and without a chunk splitting) but does not involve anything related to either the fields or structure dump introduced in this PR. Is it necessary?

}

std::string structure_filename_after_sim =
structure_dump(&s, filename_prefix, "after-sim");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For most practical applications the structure is time independent and thus the structure object can be dumped to file just once at any point during the simulation.

for (int i = 0; i < num_chunks; i++) {
my_num_chunks += (single_parallel_file || chunks[i]->is_mine());
}
size_t num_f_size = my_num_chunks * NUM_FIELD_COMPONENTS * 2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meep uses real fields by default and it is only under certain conditions when it uses complex fields (e.g., a non-zero k_point, force_complex_fields=True in the Simulation constructor, etc.). The hard-coded factor of 2 in this line should therefore be changed to be either 1 or 2 depending on whether the complex field component is a non-NULL pointer or not.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If !is_real then the imaginary pointers are NULL and so it should do the right thing, I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few lines below you can see that we dump (or load) only does it for the non-null complex field component.

@oskooi
Copy link
Collaborator

oskooi commented Aug 27, 2021

The error reported above for dump_load.cpp also occurs for the Python unit test test_dump_load.py only when using parallel Meep with 2 or more MPI processes. Executing these two tests with serial Meep does actually run the test but then fails due to a fields mismatch assertion. Seems there is a bug when two MPI processes try to write to the same file.

@oskooi
Copy link
Collaborator

oskooi commented Aug 27, 2021

I think there is a bug in test_dump_load.py where the reference run and the load-fields run are offset by a single timestep due to this line:

        # Run until t=1 so that fields are initialized.
        sim.run(mp.at_every(5, get_field_point), until=1)

After removing this line as well as geometry, boundary_layers, and symmetries from the Simulation constructor, the test which compares the fields from the two runs at identical times now passes:

    # Tests dumping/loading of fields & structure.                                                                
    def _load_dump_fields(self, single_parallel_file=True):
        resolution = 50
        cell = mp.Vector3(5, 5)
        sources = mp.Source(src=mp.GaussianSource(1, fwidth=1.0),
                            center=mp.Vector3(),
                            component=mp.Ez)
        sim1 = mp.Simulation(resolution=resolution,
                             cell_size=cell,
                             sources=[sources])

        sample_point = mp.Vector3(0.12, -0.29)
        ref_field_points = {}

        def get_ref_field_point(sim):
            p = sim.get_field_point(mp.Ez, sample_point)
            ref_field_points[sim.meep_time()] = p.real

        # First run until t=25 and save structure/fields                                                          
        sim1.run(mp.at_every(5, get_ref_field_point), until=25)

        dump_dirname = os.path.join(self.temp_dir, 'test_load_dump_fields')
        sim1.dump(dump_dirname,
                  dump_structure=False,
                  dump_fields=True,
                  single_parallel_file=single_parallel_file)

        # Then continue running another 25 until t=50                                                             
        sim1.run(mp.at_every(5, get_ref_field_point), until=25)
        print('ref_field_points = ' + str(ref_field_points))

        # Now create a new simulation and try restoring state.                                                    
        sim = mp.Simulation(resolution=resolution,
                            cell_size=cell,
                            sources=[sources])

        sim.init_sim()
        field_points = {}

        def get_field_point(sim):
            p = sim.get_field_point(mp.Ez, sample_point)
            field_points[sim.meep_time()] = p.real

        # Now load the fields.                                                                                    
        sim.load(dump_dirname,
                 load_structure=False,
                 load_fields=True,
                 single_parallel_file=single_parallel_file)
        sim.run(mp.at_every(5, get_field_point), until=25)
        print('field_points = ' + str(field_points))

        for t, v in field_points.items():
            self.assertAlmostEqual(ref_field_points[t], v)

output

Using MPI version 3.1, 1 processes
Saving temp files to dir: /tmp/meepNNflLc
ssRunning test_load_dump_fields
-----------
Initializing structure...
time for choose_chunkdivision = 1.3597e-05 s
Working in 2D dimensions.
Computational cell is 5 x 5 x 0 with resolution 50
time for set_epsilon = 0.0829253 s
-----------
run 0 finished at t = 25.0 (2500 timesteps)
creating fields output file "/tmp/meepNNflLc/test_load_dump_fields/fields.h5" (1)...
Dumped fields to file: /tmp/meepNNflLc/test_load_dump_fields/fields.h5 (True)
run 1 finished at t = 50.0 (5000 timesteps)
ref_field_points = {50.0: 0.17848575860261917, 35.0: -0.09694628417491913, 20.0: 0.12576791644096375, 5.0: -0.32930123805999756, 40.0: -0.12000752240419388, 25.0: -0.24073712527751923, 10.0: 0.1038854569196701, 45.0: 0.05137018859386444, 30.0: 0.24092617630958557, 15.0: 0.07364945858716965}
-----------
Initializing structure...
time for choose_chunkdivision = 5.71e-06 s
Working in 2D dimensions.
Computational cell is 5 x 5 x 0 with resolution 50
time for set_epsilon = 0.0641066 s
-----------
reading fields from file "/tmp/meepNNflLc/test_load_dump_fields/fields.h5" (1)...
Loaded fields from file: /tmp/meepNNflLc/test_load_dump_fields/fields.h5 (True)
on time step 2500 (time=25), 2.94857 s/step
run 0 finished at t = 50.0 (5000 timesteps)
field_points = {50.0: 0.17848575860261917, 35.0: -0.09694628417491913, 40.0: -0.12000752240419388, 25.0: -0.24073712527751923, 45.0: 0.05137018859386444, 30.0: 0.24092617630958557}
.sss
----------------------------------------------------------------------
Ran 6 tests in 0.671s

OK (skipped=5)

@oskooi
Copy link
Collaborator

oskooi commented Aug 28, 2021

In test_dump_load.py, after replacing the line:

sim.run(mp.at_every(5, get_field_point), until=1)

with

sim.init_sim()

The subtest test_load_dump_fields fails because of theAl (aluminum) material used in one of the two Block objects of the geometry:

        geometry = [mp.Block(material=Al, center=mp.Vector3(), size=one_by_one),
                    mp.Block(material=mp.Medium(epsilon=13), center=mp.Vector3(1), size=one_by_one)]

The reason for this test failure is simple: there is a polarization_state object of the fields class which is not being dumped and loaded along with the other field objects as it needs to be:

meep/src/meep.hpp

Line 1466 in cdae1a8

polarization_state *pol[NUM_FIELD_TYPES]; // array of npol[i] polarization_state structures

As a verification that this is indeed the cause of the failing test, replace material=Al with e.g. material=mp.Medium(index=3.2) in the geometry definition above and test_dump_load.py will then pass.

The slight challenge though for dumping/loading the polarization_state is that it is a typedef:

meep/src/meep.hpp

Lines 1424 to 1429 in cdae1a8

// data for each susceptibility
typedef struct polarization_state_s {
void *data; // internal polarization data for the susceptibility
const susceptibility *s;
struct polarization_state_s *next; // linked list
} polarization_state;

in which the susceptibility object is a generic class:

meep/src/meep.hpp

Lines 73 to 75 in cdae1a8

/* generic base class, only used by subclassing: represents susceptibility
polarizability vector P = chi(omega) W (where W = E or H). */
class susceptibility {

Dispersive materials such as Al are lorentzian_susceptibility objects which derived from susceptibility:

meep/src/meep.hpp

Lines 236 to 240 in cdae1a8

/* a Lorentzian susceptibility
\chi(\omega) = sigma * omega_0^2 / (\omega_0^2 - \omega^2 - i\gamma \omega)
If no_omega_0_denominator is true, then we omit the omega_0^2 factor in the
denominator to obtain a Drude model. */
class lorentzian_susceptibility : public susceptibility {

There are other derived classes of susceptibility besides lorentzian_susceptibility such as multilevel_susceptibility, gyrotropic_susceptibility, etc but for now we can just support checkpointing lorentzian_susceptiblity objects since they are more common than the other varieties.

@kkg4theweb kkg4theweb changed the title Sharded fields dump (WIP) Sharded fields dump Sep 9, 2021
@codecov-commenter
Copy link

codecov-commenter commented Sep 9, 2021

Codecov Report

Merging #1738 (b8bcf9d) into master (3da5613) will increase coverage by 0.92%.
The diff coverage is 87.50%.

@@            Coverage Diff             @@
##           master    #1738      +/-   ##
==========================================
+ Coverage   73.46%   74.39%   +0.92%     
==========================================
  Files          13       13              
  Lines        4557     4581      +24     
==========================================
+ Hits         3348     3408      +60     
+ Misses       1209     1173      -36     
Impacted Files Coverage Δ
python/simulation.py 76.78% <87.50%> (+0.32%) ⬆️
python/adjoint/utils.py 93.44% <0.00%> (ø)
python/geom.py 94.15% <0.00%> (+0.17%) ⬆️
python/adjoint/optimization_problem.py 70.08% <0.00%> (+0.38%) ⬆️
python/source.py 95.41% <0.00%> (+3.05%) ⬆️
python/adjoint/objective.py 90.90% <0.00%> (+15.75%) ⬆️

@stevengj
Copy link
Collaborator

stevengj commented Sep 9, 2021

The fact that you have to use init_sim, and that restoring the checkpoint does not work if you run for a nonzero amount of time first, seems to indicate that there is some state we are forgetting about. It would be good to chase this down.

p = sim.get_field_point(mp.Ez, sample_point)
field_points[sim.meep_time()] = p.real

sim.init_sim()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this line with sim.init_sim() causes the subtest test_load_dump_fields to fail with a segmentation fault after the previously saved fields are loaded from the HDF5 file and sim.run is invoked.

Here is the output from gdb:

Running test_load_dump_fields
-----------
Initializing structure...
Halving computational cell along direction y
time for choose_chunkdivision = 0.000307183 s
Working in 2D dimensions.
Computational cell is 5 x 5 x 0 with resolution 50
     block, center = (0,0,0)
          size (1,1,1e+20)
          axes (1,0,0), (0,1,0), (0,0,1)
          dielectric constant epsilon diagonal = (10.24,10.24,10.24)
     block, center = (1,0,0)
          size (1,1,1e+20)
          axes (1,0,0), (0,1,0), (0,0,1)
          dielectric constant epsilon diagonal = (13,13,13)
time for set_epsilon = 0.0874734 s
-----------
run 0 finished at t = 15.0 (1500 timesteps)
creating epsilon from file "/tmp/meepO1ahOS/test_load_dump_fields/structure.h5" (1)...
Dumped structure to file: /tmp/meepO1ahOS/test_load_dump_fields/structure.h5 (True)
creating fields output file "/tmp/meepO1ahOS/test_load_dump_fields/fields.h5" (1)...
Fields State:
  a = 50, dt = 0.01
  m = 0, beta = 0
  t = 1500, phasein_time = 0, is_real = 1

  num_chunks = 6 (shared=1)
Dumped fields to file: /tmp/meepO1ahOS/test_load_dump_fields/fields.h5 (True)
run 1 finished at t = 20.0 (2000 timesteps)
-----------
Initializing structure...
Halving computational cell along direction y
time for choose_chunkdivision = 0.000286404 s
Working in 2D dimensions.
Computational cell is 5 x 5 x 0 with resolution 50
time for set_epsilon = 0.0759436 s
-----------
reading epsilon from file "/tmp/meepO1ahOS/test_load_dump_fields/structure.h5" (1)...
Loaded structure from file: /tmp/meepO1ahOS/test_load_dump_fields/structure.h5 (True)
reading fields from file "/tmp/meepO1ahOS/test_load_dump_fields/fields.h5" (1)...
Fields State:
  a = 50, dt = 0.01
  m = 0, beta = 0
  t = 1500, phasein_time = 0, is_real = 0

  num_chunks = 6 (shared=1)
Loaded fields from file: /tmp/meepO1ahOS/test_load_dump_fields/fields.h5 (True)
on time step 1500 (time=15), 7.26514 s/step

Thread 1 "python3.5" received signal SIGSEGV, Segmentation fault.
0x00007fffc8e6a8e0 in meep::step_curl_stride1 (f=0x55555681f940, c=meep::Bx, g1=0x0, g2=0x0, s1=0, s2=0, gv=..., 
    is=..., ie=..., dtdx=-0.5, dsig=meep::Y, sig=0x5555567505b0, kap=0x5555567503e0, siginv=0x555556750210, 
    fu=0x0, dsigu=meep::NO_DIRECTION, sigu=0x0, kapu=0x0, siginvu=0x0, dt=0.01, cnd=0x0, cndinv=0x0, fcnd=0x0)
    at step_generic_stride1.cpp:192
192	            f[i] = ((kap[k] - sig[k]) * f[i] - dtdx * (g1[i + s1] - g1[i])) * siginv[k];

The two terms in this update equation which are causing the segfault are g1[i + s1] and g1[i]. In _load_dump_fields, the g1 array corresponds to the Ez field.

As additional info, the backtrace is:

#0  0x00007fffc8e6a8e0 in meep::step_curl_stride1 (f=0x55555681f940, c=meep::Bx, g1=0x0, g2=0x0, s1=0, s2=0, 
    gv=..., is=..., ie=..., dtdx=-0.5, dsig=meep::Y, sig=0x5555567505b0, kap=0x5555567503e0, 
    siginv=0x555556750210, fu=0x0, dsigu=meep::NO_DIRECTION, sigu=0x0, kapu=0x0, siginvu=0x0, dt=0.01, cnd=0x0, 
    cndinv=0x0, fcnd=0x0) at step_generic_stride1.cpp:192
#1  0x00007fffc8e03647 in meep::fields_chunk::step_db (this=0x5555567c4e10, ft=meep::B_stuff) at step_db.cpp:119
#2  0x00007fffc8e02961 in meep::fields::step_db (this=0x555556757c10, ft=meep::B_stuff) at step_db.cpp:36
#3  0x00007fffc8e0022e in meep::fields::step (this=0x555556757c10) at step.cpp:67
#4  0x00007fffc92ac56c in _wrap_fields_step (args=0x7fffc98d3f98) at meep-python.cxx:82395

@oskooi
Copy link
Collaborator

oskooi commented Sep 22, 2021

I think I know why init_sim() is necessary. To understand this involves going through the sequence of allocating the fields arrays that Meep is using. The important point is that the field allocation (i.e., for Ez which is the cause of the segmentation fault reported above) during init_sim() actually happens when calling add_sources:

self.add_sources()

add_sources in turn calls the member function require_source_components() of the fields class:

meep/python/simulation.py

Lines 2445 to 2450 in f5b60e5

def add_sources(self):
if self.fields is None:
self.init_sim() # in case only some processes have IndexedSources
for s in self.sources:
self.add_source(s)
self.fields.require_source_components() # needed by IndexedSource objects

require_source_components then calls the lower-level function _require_component for each of the fields required by the particular source component. In the unit test _load_dump_fields, the source component is Ez (which is actually an electric current Jz) and therefore the components in the 2d cell are Ez, Bx, and By. For μ=1 (non-magnetic materials), Meep uses Hx=Bx and Hy=By to avoid unnecessary duplication. Note that in 2d there are two orthogonal sets of polarizations (Ex,Ey,Hz vs. Hx,Hy,Ez) and typically only one set is used in the simulation.

meep/src/fields.cpp

Lines 495 to 517 in f5b60e5

// allocate fields for components required by any source on any process
// ... this is needed after calling the low-level fields::add_srcdata
void fields::require_source_components() {
int needed[NUM_FIELD_COMPONENTS];
memset(needed, 0, sizeof(needed));
for (int i = 0; i < num_chunks; i++) {
FOR_FIELD_TYPES(ft) {
for (const auto& src : chunks[i]->get_sources(ft)) {
needed[src.c] = 1;
}
}
}
int allneeded[NUM_FIELD_COMPONENTS];
am_now_working_on(MpiAllTime);
or_to_all(needed, allneeded, NUM_FIELD_COMPONENTS);
finished_working();
bool aniso2d = is_aniso2d();
for (int c = 0; c < NUM_FIELD_COMPONENTS; ++c)
if (allneeded[c])
_require_component(component(c), aniso2d);
sync_chunk_connections();
}

Finally, _require_component calls alloc_f to allocate the fields:

meep/src/fields.cpp

Lines 541 to 561 in f5b60e5

void fields::_require_component(component c, bool aniso2d) {
if (!gv.has_field(c))
meep::abort("cannot require a %s component in a %s grid", component_name(c), dimension_name(gv.dim));
components_allocated = true;
// allocate fields if they haven't been allocated yet for this component
int need_to_reconnect = 0;
FOR_COMPONENTS(c_alloc) {
if (gv.has_field(c_alloc) && (is_like(gv.dim, c, c_alloc) || aniso2d))
for (int i = 0; i < num_chunks; ++i)
if (chunks[i]->alloc_f(c_alloc)) need_to_reconnect++;
}
if (need_to_reconnect) {
figure_out_step_plan();
// we will eventually call sync_chunk_connections(), in either require_component(c)
// or require_components(), to synchronize this across processes:
chunk_connections_valid = false;
}
}

meep/src/fields.cpp

Lines 465 to 493 in f5b60e5

// this function should ordinarily not be called directly;
// instead it should be called via require_component,
// since only require_component knows what other field components
// need to be allocated in addition to c
bool fields_chunk::alloc_f(component c) {
bool changed = false;
if (is_mine()) DOCMP {
if (!f[c][cmp]) {
changed = true;
if (is_magnetic(c)) {
/* initially, we just set H == B ... later on, we lazily allocate
H fields if needed (if mu != 1 or in PML) in update_eh */
component bc = direction_component(Bx, component_direction(c));
if (!f[bc][cmp]) {
f[bc][cmp] = new realnum[gv.ntot()];
for (size_t i = 0; i < gv.ntot(); i++)
f[bc][cmp][i] = 0.0;
}
f[c][cmp] = f[bc][cmp];
}
else {
f[c][cmp] = new realnum[gv.ntot()];
for (size_t i = 0; i < gv.ntot(); i++)
f[c][cmp][i] = 0.0;
}
}
}
return changed;
}

What needs to happen then when dumping and loading the fields is that only those components generated by the particular source need to be saved and allocated. This means that we should save the source component as part of the checkpoint and use it to allocate the correct set of fields when loading the fields similar to how it is already set up.

@stevengj stevengj marked this pull request as ready for review September 23, 2021 19:30
@oskooi oskooi merged commit fdb0d99 into NanoComp:master Sep 23, 2021
mawc2019 pushed a commit to mawc2019/meep that referenced this pull request Nov 3, 2021
* Add support for fully local HDF5 files and shared dumping of meep::structure

* Add support for fully local HDF5 files and shared dumping of meep::structure

* Update python func docs

* Update python API documentation

* Dump/Load of 'fields'

The PR is incomplete (does not include the dft fields) and also has a
lot of debugging statements.

* Save dft chunks

* Remove debug log stmts

* Add saving of time value and also reorg tests.

* Fix dft-chunk saving for the single-parallel-file mode.

* Abort when trying to dump fields with non-null polarization state

* Clean up test_dump_load.py and add test_dump_fails_for_non_null_polarization_state test

* Add new tests/dump_load to set of files to ignore

* Clean up the test to remove unnecessary stuff from the copied over test

* Also dump/load 'f_w_prev'

* Fix typo causing build breakage

* load fields at the very end of init_sim after add_sources

* remove init_sim before loading fields since that now works.
@oskooi
Copy link
Collaborator

oskooi commented Dec 25, 2021

It seems there is a memory leak in both structure::load and fields:load (introduced in #1715 and this PR, respectively) which is triggered by the C++ unit test dump_test.cpp as reported by the logs of the address sanitizer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants