Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash when writing parallel compressed chunks #265

Closed
nritsche opened this issue Jan 15, 2021 · 2 comments · Fixed by #1173, #1174 or #1175
Closed

Crash when writing parallel compressed chunks #265

nritsche opened this issue Jan 15, 2021 · 2 comments · Fixed by #1173, #1174 or #1175
Assignees
Labels
Component - C Library Core C library issues (usually in the src directory) Component - Parallel Parallel HDF5 (NOT thread-safety) Priority - 1. High 🔼 These are important issues that should be resolved in the next release Type - Bug Please report security issues to help@hdfgroup.org instead of creating an issue on GitHub
Milestone

Comments

@nritsche
Copy link

nritsche commented Jan 15, 2021

Follow up of https://forum.hdfgroup.org/t/crash-when-writing-parallel-compressed-chunks/6186

I found that the following test still fails with the current develop branch:

#include <stdlib.h>
#include <stdio.h>

#include "hdf5.h"


#define _MPI
#define _DSET2
//#define _COMPRESS

#define NPROC 4
#define CHUNK0 4

// Equivalent to original gist
// Works on 1.10.5 with patch, crashes on 1.10.5 vanilla and hangs on 1.10.6
//#define CHUNK1 32768
//#define NCHUNK1 32

// Works on 1.10.5 with and without patch and 1.10.6
//#define CHUNK1 256
//#define NCHUNK1 8192

// Works on 1.10.5 with and without patch and 1.10.6
//#define CHUNK1 512
//#define NCHUNK1 8192

// Crashes on 1.10.5 with and without patch, 1.10.6 and 1.12.0
#define CHUNK1 256
#define NCHUNK1 16384

int main(int argc, char **argv) {

    int mpi_size = 1, mpi_rank = 0;
    hid_t fapl_id, file_id, dset_space, dcpl_id, ds, ds2, propid, mem_dspace, sel_dspace;

    fapl_id = H5Pcreate(H5P_FILE_ACCESS);
#ifdef _MPI
    MPI_Comm comm  = MPI_COMM_WORLD;
    MPI_Info info  = MPI_INFO_NULL;

    MPI_Init(&argc, &argv);
    MPI_Comm_size(comm, &mpi_size);
    MPI_Comm_rank(comm, &mpi_rank);

    H5Pset_fapl_mpio(fapl_id, comm, info);
#endif

    printf("MPI rank [%i/%i]\n", mpi_rank, mpi_size);

    printf("rank=%i creating file\n", mpi_rank);
    file_id = H5Fcreate("test1.h5", H5F_ACC_TRUNC, H5P_DEFAULT, fapl_id);
    H5Pclose(fapl_id);

    // Define total dataset size
    hsize_t dset_dims[2] = {NPROC * CHUNK0, CHUNK1 * NCHUNK1};
    dset_space = H5Screate_simple (2, dset_dims, NULL);
    dcpl_id = H5Pcreate(H5P_DATASET_CREATE);

    // Set chunking and compression params
    hsize_t chunk_dims[2] = {CHUNK0, CHUNK1};
    H5Pset_chunk(dcpl_id, 2, chunk_dims);

#ifdef _COMPRESS
    H5Pset_deflate(dcpl_id, 9);
#endif


    // Define selection
    hsize_t sel_dims[2] = {CHUNK0, CHUNK1 * NCHUNK1};
    sel_dspace = H5Screate_simple(2, dset_dims, NULL);
    hsize_t offset[2] = {mpi_rank * sel_dims[0], 0};
    printf("rank=%i creating selection [%llu:%llu, %llu:%llu]\n",
           mpi_rank, offset[0], offset[0] + sel_dims[0], offset[1], offset[1] + sel_dims[1]);
    H5Sselect_hyperslab(sel_dspace, H5S_SELECT_SET,
                        offset, NULL, sel_dims, NULL);

    // Set the dspace for the input data
    mem_dspace = H5Screate_simple (2, sel_dims, NULL);

    propid = H5Pcreate(H5P_DATASET_XFER);
#ifdef _MPI
    H5Pset_dxpl_mpio(propid, H5FD_MPIO_COLLECTIVE);
#endif

    // Create the dataset
    printf("rank=%i creating dataset1\n", mpi_rank);
    ds = H5Dcreate (file_id, "dset1", H5T_NATIVE_FLOAT, dset_space, H5P_DEFAULT, dcpl_id, H5P_DEFAULT);

    // Create array of data to write
    // Initialise with random data.
    int totalsize = sel_dims[0] * sel_dims[1];
    float *data = (float *)malloc(sizeof(float) * totalsize);
    for(int i = 0; i < totalsize; i++) {
        data[i] = (float)drand48();
    }

    printf("rank=%i writing dataset1\n", mpi_rank);
    H5Dwrite(ds, H5T_NATIVE_FLOAT, mem_dspace, sel_dspace, propid, data);
    printf("rank=%i finished writing dataset1\n", mpi_rank);

#ifdef _DSET2
    // Create the dataset
    printf("rank=%i creating dataset2\n", mpi_rank);
    ds2 = H5Dcreate (file_id, "dset2", H5T_NATIVE_FLOAT, dset_space, H5P_DEFAULT, dcpl_id, H5P_DEFAULT);
    
    // Generate new data and write
    printf("rank=%i writing dataset2\n", mpi_rank);
    for(int i = 0; i < totalsize; i++) {
        data[i] = drand48();
    }
    H5Dwrite(ds2, H5T_NATIVE_FLOAT, mem_dspace, sel_dspace, propid, data);
    
    H5Dclose(ds2);
#endif

    // Close down everything
    printf("rank=%i closing everything\n", mpi_rank);
    H5Dclose(ds);
    H5Sclose(dset_space);
    H5Sclose(sel_dspace);
    H5Sclose(mem_dspace);
    H5Pclose(dcpl_id);
    H5Pclose(propid);
    H5Fclose(file_id);

    free(data);

    MPI_Finalize();
    return 0;
 }
$ /usr/bin/mpiexec -n 4 --mca io romio321 build/bin/chunk_compress
MPI rank [0/4]
rank=0 creating file
MPI rank [1/4]
rank=1 creating file
MPI rank [2/4]
rank=2 creating file
MPI rank [3/4]
rank=3 creating file
rank=2 creating selection [8:12, 0:4194304]
rank=3 creating selection [12:16, 0:4194304]
rank=3 creating dataset1
rank=0 creating selection [0:4, 0:4194304]
rank=0 creating dataset1
rank=1 creating selection [4:8, 0:4194304]
rank=1 creating dataset1
rank=2 creating dataset1
rank=1 writing dataset1
rank=3 writing dataset1
rank=0 writing dataset1
rank=2 writing dataset1
rank=2 finished writing dataset1
rank=2 creating dataset2
rank=1 finished writing dataset1
rank=1 creating dataset2
rank=0 finished writing dataset1
rank=0 creating dataset2
rank=3 finished writing dataset1
rank=3 creating dataset2
HDF5-DIAG: Error detected in HDF5 (1.13.0) MPI-process 2:
  #000: ../src/H5D.c line 189 in H5Dcreate2(): unable to synchronously create dataset
    major: Dataset
    minor: Unable to create file
  #001: ../src/H5D.c line 137 in H5D__create_api_common(): unable to create dataset
    major: Dataset
    minor: Unable to create file
  #002: ../src/H5VLcallback.c line 1809 in H5VL_dataset_create(): dataset create failed
    major: Virtual Object Layer
    minor: Unable to create file
  #003: ../src/H5VLcallback.c line 1774 in H5VL__dataset_create(): dataset create failed
    major: Virtual Object Layer
    minor: Unable to create file
  #004: ../src/H5VLnative_dataset.c line 73 in H5VL__native_dataset_create(): unable to create dataset
    major: Dataset
    minor: Unable to initialize object
  #005: ../src/H5Dint.c line 396 in H5D__create_named(): unable to create and link to dataset
    major: Dataset
    minor: Unable to initialize object
  #006: ../src/H5L.c line 2359 in H5L_link_object(): unable to create new link to object
    major: Links
    minor: Unable to initialize object
  #007: ../src/H5L.c line 2601 in H5L__create_real(): can't insert link
    major: Links
    minor: Unable to insert object
  #008: ../src/H5Gtraverse.c line 838 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #009: ../src/H5Gtraverse.c line 569 in H5G__traverse_real(): can't look up component
    major: Symbol table
    minor: Object not found
  #010: ../src/H5Gobj.c line 1097 in H5G__obj_lookup(): can't check for link info message
    major: Symbol table
    minor: Can't get value
  #011: ../src/H5Gobj.c line 306 in H5G__obj_get_linfo(): unable to read object header
    major: Symbol table
    minor: Can't get value
  #012: ../src/H5Omessage.c line 845 in H5O_msg_exists(): unable to protect object header
    major: Object header
    minor: Unable to protect metadata
  #013: ../src/H5Oint.c line 1048 in H5O_protect(): unable to load object header
    major: Object header
    minor: Unable to protect metadata
  #014: ../src/H5AC.c line 1431 in H5AC_protect(): H5C_protect() failed
    major: Object cache
    minor: Unable to protect metadata
  #015: ../src/H5C.c line 2341 in H5C_protect(): MPI_Bcast failed
    major: Internal error (too specific to document in detail)
    minor: Some MPI function failed
  #016: ../src/H5C.c line 2341 in H5C_protect(): MPI_ERR_TRUNCATE: message truncated
    major: Internal error (too specific to document in detail)
    minor: MPI Error String
rank=2 writing dataset2
HDF5-DIAG: Error detected in HDF5 (1.13.0) MPI-process 3:
  #000: ../src/H5D.c line 189 in H5Dcreate2(): unable to synchronously create dataset
    major: Dataset
    minor: Unable to create file
  #001: ../src/H5D.c line 137 in H5D__create_api_common(): unable to create dataset
    major: Dataset
    minor: Unable to create file
  #002: ../src/H5VLcallback.c line 1809 in H5VL_dataset_create(): dataset create failed
    major: Virtual Object Layer
    minor: Unable to create file
  #003: ../src/H5VLcallback.c line 1774 in H5VL__dataset_create(): dataset create failed
    major: Virtual Object Layer
    minor: Unable to create file
  #004: ../src/H5VLnative_dataset.c line 73 in H5VL__native_dataset_create(): unable to create dataset
    major: Dataset
    minor: Unable to initialize object
  #005: ../src/H5Dint.c line 396 in H5D__create_named(): unable to create and link to dataset
    major: Dataset
    minor: Unable to initialize object
  #006: ../src/H5L.c line 2359 in H5L_link_object(): unable to create new link to object
    major: Links
    minor: Unable to initialize object
  #007: ../src/H5L.c line 2601 in H5L__create_real(): can't insert link
    major: Links
    minor: Unable to insert object
  #008: ../src/H5Gtraverse.c line 838 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #009: ../src/H5Gtraverse.c line 569 in H5G__traverse_real(): can't look up component
    major: Symbol table
    minor: Object not found
  #010: ../src/H5Gobj.c line 1097 in H5G__obj_lookup(): can't check for link info message
    major: Symbol table
    minor: Can't get value
  #011: ../src/H5Gobj.c line 306 in H5G__obj_get_linfo(): unable to read object header
    major: Symbol table
    minor: Can't get value
  #012: ../src/H5Omessage.c line 845 in H5O_msg_exists(): unable to protect object header
    major: Object header
    minor: Unable to protect metadata
  #013: ../src/H5Oint.c line 1048 in H5O_protect(): unable to load object header
    major: Object header
    minor: Unable to protect metadata
  #014: ../src/H5AC.c line 1431 in H5AC_protect(): H5C_protect() failed
    major: Object cache
    minor: Unable to protect metadata
  #015: ../src/H5C.c line 2341 in H5C_protect(): MPI_Bcast failed
    major: Internal error (too specific to document in detail)
    minor: Some MPI function failed
  #016: ../src/H5C.c line 2341 in H5C_protect(): MPI_ERR_TRUNCATE: message truncated
    major: Internal error (too specific to document in detail)
    minor: MPI Error String
rank=3 writing dataset2
HDF5-DIAG: Error detected in HDF5 (1.13.0) MPI-process 3:
  #000: ../src/H5D.c line 1156 in H5Dwrite(): can't synchronously write data
    major: Dataset
    minor: Write failed
  #001: ../src/H5D.c line 1096 in H5D__write_api_common(): dset_id is not a dataset ID
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.13.0) MPI-process 3:
  #000: ../src/H5D.c line 472 in H5Dclose(): not a dataset ID
    major: Invalid arguments to routine
    minor: Inappropriate type
rank=3 closing everything
HDF5-DIAG: Error detected in HDF5 (1.13.0) MPI-process 2:
  #000: ../src/H5D.c line 1156 in H5Dwrite(): can't synchronously write data
    major: Dataset
    minor: Write failed
  #001: ../src/H5D.c line 1096 in H5D__write_api_common(): dset_id is not a dataset ID
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.13.0) MPI-process 2rank=2 closing everything
:
  #000: ../src/H5D.c line 472 in H5Dclose(): not a dataset ID
    major: Invalid arguments to routine
    minor: Inappropriate type

Note that --mca io ompio gives identical results.

Ubuntu 20.04.1
openmpi 4.0.3

@jhendersonHDF jhendersonHDF self-assigned this Feb 23, 2021
@derobins derobins removed the bug label Mar 3, 2023
@derobins derobins added Priority - 1. High 🔼 These are important issues that should be resolved in the next release Component - C Library Core C library issues (usually in the src directory) Component - Parallel Parallel HDF5 (NOT thread-safety) Type - Bug Please report security issues to help@hdfgroup.org instead of creating an issue on GitHub labels May 4, 2023
@derobins derobins added this to the 1.14.3 milestone Oct 9, 2023
@derobins
Copy link
Member

@nritsche - We're about to release 1.14.3. Is this still a problem in the hdf5_1_14 branch?

@jhendersonHDF
Copy link
Collaborator

Closing this issue as it's very old now and the original problem causing this example program to fail was fixed. However, I believe that it is exposing a separate issue where the default method of distributed metadata writes can hang when collective metadata writes are not enabled. I will open a separate issue for that after investigating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component - C Library Core C library issues (usually in the src directory) Component - Parallel Parallel HDF5 (NOT thread-safety) Priority - 1. High 🔼 These are important issues that should be resolved in the next release Type - Bug Please report security issues to help@hdfgroup.org instead of creating an issue on GitHub
Projects
None yet
3 participants