Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subfiling VFD #1883

Merged
merged 150 commits into from
Jul 22, 2022
Merged

Subfiling VFD #1883

merged 150 commits into from
Jul 22, 2022

Conversation

jhendersonHDF
Copy link
Collaborator

@jhendersonHDF jhendersonHDF commented Jul 12, 2022

This is the current development branch for the Subfiling VFD feature, outlined in the attached RFC. The feature itself is entirely contained inside the H5FDsubfiling directory under src/. There are several other supporting changes mostly surrounding VFD testing in this PR as well.

Subfiling uses Mercury's threading utilities to manage a pool of worker threads and so brings along a bit of the code from that project with it.

RFC_VFD_subfiling_200424.pdf

mainzer and others added 30 commits March 31, 2022 13:21
associated test code.  Note that this includes the optimization
to allow shortened sizes and types arrays to allow more space
efficient representations of vectors in which all entries are
of the same size and/or type.  See the Selection I/o RFC for
further details.

Tested serial and parallel, debug and production on Charis.
       serial and parallel debug only on Jelly.
quick serial build and test on jelly
translate to scalar calls.  Fix const buf in H5FD_write_vector().
datatype conversion, no I/O filters, no page buffer, not using collective
I/O.  Requires global variable H5_use_selection_io_g be set to TRUE.
Implemented selection to vector I/O transaltion at the file driver
layer.
I/O translation.  Add const qualifiers to some internal selection I/O
routines to maintain const-correctness while avoiding memcpys.
test code (see testpar/t_vfd.c).

Note that this implementation does NOT support vector entries of
size greater than 2 GB.  This must be repaired before release,
but it should be good enough for correctness testing.

As MPIO requires vector I/O requests to be sorted in increasing
address order, also added a vector sort utility in H5FDint.c  This
function is tested in passing by the MPIO vector I/O extension.

In passing, repaired a bug in size / type vector extension management
in H5FD_read/write_vector()

Tested parallel debug and production on charis and Jelly.
HDF5_USE_SELECTION_IO env var to control selection I/O (default off).
Merged branch 'develop' into selection_io
Updated the branch with develop changes.
needed, to cut down on memory usage during I/O.
* Initial checkin of merged sub-filing VFD.

Passes regression tests (debug/shared/paralle) on Jelly.
However, bugs and many compiler warnings remain -- not suitable
for merge to develop.

* Minor mods to src/H5FDsubfile_mpi.c to address errors reported by autogen.sh

* Code formatting run -- no test

* Merged my subfiling code fixes into the new selection_io_branch

* Forgot to add the FindMERCURY.cmake file. This will probably disappear soon

* attempting to make a more reliable subfile file open which doesn't return errors. For some unknown reason, the regular posix open will occasionally fail to create a subfile.  Some better error handling for file close has been added.

* added NULL option for H5FD_subfiling_config_t in H5Pset_fapl_subfiling (#1034)

* NULL option automatically stacks IOC VFD for subfiling and returns a valid fapl.
* added doxygen subfiling APIs

* Various fixes which allow the IOR benchmark to run correctly

* Lots of updates including the packaging up of the mercury_util source files to enable easier builds for our Benchmarking

* Interim checkin of selection_io_with_subfiling_vfd branch

    Moddified testpar/t_vfd.c to test the subfiling vfd with default configuration.
    Must update this code to run with a variety of configurations -- most particularly
    multiple IO concentrators, and stripe depth small enough to test the other IO
    concentrators.

    testpar/t_vfd.c exposed a large number of race condidtions -- symtoms included:

      1) Crashes (usually seg faults)

      2) Heap corruption

      3) Stack corruption

      4) Double frees of heap space

      5) Hangs

      6) Out of order execution of I/O requests / violations of POSIX semantics

      7) Swapped write requests

        Items 1 - 4 turned out to be primarily caused by file close issues --
    specifically, the main I/O concentrator thread and its pool of worker threads
    were not being shut down properly on file close.  Addressing this issue in
    combination with some other minor fixes seems to have addressed these issues.

        Items 5 & 6 appear to have been caused by issue of I/O requests to the
    thread pool in an order that did not maintain POSIX semantics.  A rewrite of
    the I/O request dispatch code appears to have solved these issues.

        Item 7 seems to have been caused by multiple write requests from a given
    rank being read by the wrong worker thread.  Code to issue "unique" tags for
    each write request via the ACK message appears to have cleaned this up.

        Note that the code is still in poor condtition.  A partial list of known
    defects includes:

     a) Race condiditon on file close that allows superblock writes to arrive
        at the I/O concentrator after it has been shutdown.  This defect is
        most evident when testpar/t_subfiling_vfd is run with 8 ranks.

     b) No error reporting from I/O concentrators -- must design and implement
        this.  For now, mostly just asserts, which suggests that it should be
        run in debug mode.

     c) Much commented out and/or un-used code.

     d) Code orgnaization

     e) Build system with bits of Mercury is awkward -- think of shifting
        to pthreads with our own thread pool code.

     f) Need to add native support for vector and selection I/O to the subfiling
        VFD.

     g) Need to review, and posibly rework configuration code.

     h) Need to store subfile configuration data in a superblock extension message,
        and add code to use this data on file open.

     i) Test code is inadequate -- expect more issues as it is extended.

        In particular, there is no unit test code for the I/O request dispatch code.
        While I think it is correct at present, we need test code to verify this.

        Similarly, we need to test with multiple I/O concentrators and much smaller
        stripe depth.

    My actual code changes were limited to:

          src/H5FDioc.c
          src/H5FDioc_threads.c
          src/H5FDsubfile_int.c
          src/H5FDsubfile_mpi.c
          src/H5FDsubfiling.c
          src/H5FDsubfiling.h
          src/H5FDsubfiling_priv.h
          testpar/t_subfiling_vfd.c
          testpar/t_vfd.c

    I'm not sure what is going on with the deletions in src/mercury/src/util.

    Tested parallel/debug on Charis and Jelly

* subfiling with selection IO (#1219)

Merged branch 'selection_io' into subfiling branch.

* Subfile name fixes (#1250)

* fixed subfiling naming convention, and added leading zero to rank names.

* Merge branch 'selection_io' into selection_io_with_subfiling_vfd (#1265)

* Added script to join subfiles into a single HDF5 file (#1350)

* Modified  H5FD__subfiling_query() to report that the sub-filing VFD supports MPI
This exposed issues with truncate and get EOF in the sub-filing VFD.
I believe I have addressed these issues (get EOF not as fully tested as it should be), howeer,
it exposed race conditions resulting in hangs.  As of this writing, I have not been able
to chase these down.

Note that the tests that expose these race conditions are in testpar/t_subfiling_vfd.c, and
are currently skipped.  Unskip these tests to reproduce the race conditions.

tested (to the extent possible) debug/parallel on charis and jelly.

* Committing clang-format changes

* fixed H5MM_free

Co-authored-by: mainzer <mainzer#hdfgroup.org>
Co-authored-by: jrmainzer <72230804+jrmainzer@users.noreply.github.com>
Co-authored-by: Richard Warren <Richard.Warren@hdfgroup.org>
Co-authored-by: Richard.Warren <richard.warren@jelly.ad.hdfgroup.org>
Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
Merge branch 'develop' into feature/subfiling
Merge branch 'develop' into feature/subfiling
Merge branch 'develop' into feature/subfiling
@@ -237,6 +237,8 @@ set (HDF5_JAVA_LOGGING_NOP_JAR ${HDF5_SOURCE_DIR}/java/lib/ext/slf4j-nop-1.7
set (HDF5_JAVA_LOGGING_SIMPLE_JAR ${HDF5_SOURCE_DIR}/java/lib/ext/slf4j-simple-1.7.33.jar)
set (HDF5_DOXYGEN_DIR ${HDF5_SOURCE_DIR}/doxygen)

set (HDF5_SRC_INCLUDE_DIRS ${HDF5_SRC_DIR})
Copy link
Collaborator Author

@jhendersonHDF jhendersonHDF Jul 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the subfiling VFD is composed of several files, I tried to contain the feature to its own subdirectory under src/. CMake then needed to be adjusted so that everywhere ${HDF5_SRC_DIR} was added to a target's include directories, the subfiling subdirectory would be as well. This is so that the public H5FDsubfiling.h and H5FDioc.h headers are available for anything that includes hdf5.h.

Going forward, new subdirectories under src/ can be added to this HDF5_SRC_INCLUDE_DIRS variable and should work seamlessly across the library's components (C++, Java, etc.)

}

int
main(int argc, char **argv)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, the current regression testing for subfiling mostly consists of a simple file create and close test in this file, as well as some selection I/O testing in testpar/t_vfd.c and the serial HDF5 tests that can be run with make check-vfd or ctest when the CMake VFD testing option is enabled. The plan is to flesh out more specific regression testing for subfiling, but development work on the feature has been the highest priority so far and the feature has mostly been tested with external applications like IOR and HACC.

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */

/*
* Header file for shared code between the HDF5 Subfiling VFD and IOC VFD
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, the subfiling and I/O concentrator VFDs were tightly couple together, with most of their routines being marked extern and directly shared between source files. To make the feature more modular and allow subfiling to use different I/O concentrator backends, the VFDs were separated from each other and common subfiling functionality was moved into this "module" so that one can write their own I/O concentrator backend and just include this header.

@@ -0,0 +1,1778 @@
/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This VFD is the reference I/O concentrator implementation. It works by creating a "main" thread on each MPI rank that is designated as an I/O concentrator rank, as well as a thread pool with a configurable number of worker threads on each of those ranks. "Main" threads will await incoming MPI messages containing I/O requests and will farm those out to the worker threads. The I/O concentrator ranks currently use the sec2 VFD for the underlying subfile, but it should be fairly easy to use other VFDs for file I/O.

@@ -0,0 +1,3264 @@
/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the Subfiling VFD, which has the main purpose of breaking I/O requests down into vector I/O requests according to how the I/O requests fall across the different subfiles that the logical HDF5 file is being striped across. It then sends the vector I/O requests down to the underlying I/O concentrator, which handles the details of writing the I/O portions to each subfile.

@@ -0,0 +1,89 @@
#!/bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the post configuration version of the .in file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes. Its checkin might have predated creation of the .in file. Will remove

@derobins derobins merged commit 27bb358 into develop Jul 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants