Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Subfile Read Issue #62

Open
wants to merge 22 commits into
base: master
Choose a base branch
from
Open

Fix Subfile Read Issue #62

wants to merge 22 commits into from

Conversation

yzanhua
Copy link
Collaborator

@yzanhua yzanhua commented Mar 6, 2023

Currently several issues exist in subfile read feature.

Running the following E3SM-IO benchmark using commit newer than e3aef7c will result in several errors:

# e3sm-io commands, this enables subfile feature
./e3sm_io -g 1 -k -a hdf5_log -x log -o ./test_output/hdf5_log_log_map_i_case_16p.h5 path-to-datasets/map_i_case_16p.h5
  1. Using HDF5 1.14.0 build-mode=production:
    1. H5VLfile_close will trigger an error, where an H5CX_xx call complains about invalid IDs. This issue is related to H5VL_logi_reset_lib_stat calls. Adding only one (pair of) H5VL_logi_reset_lib_stat around H5VL_log_filei_flush is not enough. We need to add two (pairs) inside H5VL_log_filei_flush, one for H5VL_log_nb_flush_write_reqs and one for H5VL_log_nb_flush_read_reqs
    2. After fixing the above issue, another error inside H5VL_log_filei_open_subfile will occur. This error complains about an invalid VOL id, which should also be related to lib_stat calls.
    3. The subfile name used in H5VL_log_filei_open_subfile is not correct. This issue is fixed.
    4. Using Log VOL release 1.4.0 will not cause any issues. This is because each process only perform read/write flush requests if the requests size > 0. The above e3sm-io command results in the read-requests-size=0 so no read logic is performed and no error is triggered. After commit e3aef7c, we no more check "requests size > 0".
  2. Using HDF5 1.14.0 build-mode=debug:
    1. An assertion error will occur inside H5VL_log_filei_create_subfile. VOL: Cannot Create and Write an Attribute at File Create Time hdf5#2220 should be a similar issue. We already followed their advice to move everything to the post open callback but is still having the issue.

Currently the subfile read feature is disabled and using HDF5 1.14.0 production mode should not give errors.

@yzanhua yzanhua added bug Something isn't working WIP Work in Progress labels Mar 6, 2023
@wkliao
Copy link
Collaborator

wkliao commented Mar 6, 2023

We need subfiling test programs to test this PR.

@yzanhua yzanhua removed the WIP Work in Progress label May 15, 2023
yzanhua and others added 17 commits May 30, 2023 11:40
when opening a subfile, fp->nldset and fp->nmdset are not read from
subfiles. Instead, they are read from the master file. This commit
fix this issue and read from the subfile correctly.
It is possible that, for example, a file is created with 8 subfiles.
But when openning and reading the file, we only use 4 processes.

In the original implementation before this fix, the info of 8 subfiles
is not saved. Only the first fp->ngroup subfiles will be opened for
read, where fp->ngroup is a number bounded by the number of processes
(i.e. <= 4 in this case).

In this fix, we use fp->nsubfiles to store the number of subfiles for an
opened file. All fp->nsubfiles subfiles will be opened for read.
MPI_FILE_set_view is a collective call. Befroe this fix, not all
processes call this function during dataset read, introducing possible
hangs. A subfile read test case may trigger this issue more easiliy.
This commit fix this issue.
Test the following:
1. nsbufile > nproc
2. nsubfile == nproc
3. nsubfile < nproc

For each of the above, test:
1. read pattern same as write pattern (row wise)
2. read pattern is row wise, but each process read a different row than
it writes. (read from one subfile that process is not responsible for)
3. read pattern is column wise (read from several subfiles)
4. read all dataset.

This means a total of 12 scenarios are tested.

Also, we test each scenario using 1 to 12 number of processes. This
makes sures Log VOL also works for odd number of processes.
@yzanhua yzanhua force-pushed the subfile-read branch 3 times, most recently from f2020f9 to c15c360 Compare June 5, 2023 11:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants