-
Notifications
You must be signed in to change notification settings - Fork 281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support HDF5 in tensorflow-io #174
Comments
/cc @veritas9872 |
Hey @yongtang, @veritas9872 is this issue still open? Can I work on this? |
@captain-pool Definitely! Let me know if you need any help. 👍 |
Thanks @yongtang. Can you point me some resources to write bazel BUILD files? I don't have much idea about writing BUILD Files. |
@captain-pool Bazel has a pretty steep learning curve... The Dataset implementation pattern is also not exactly very straightforward. I have been trying to simplify the pattern for adding new ops for tensorflow-io. At the moment I think TextDataset is the easily pattern to follow. The BUILD file is also easier to understand. In TextDataset, you can take a look at only one function That is pretty much all you need to implement if you want to add a new Dataset op. The BUILD file for TextDataset is also simple enough I think. |
Thanks for your response @yongtang , I've been looking through this. I'm downloading the Source code of HDF5 and building it. |
Hey @yongtang , I did some digging, and I found these bazel BUILD files for HDF5 are being used by, I think reusing codes from these BUILD Files and the WORKSPACE File, would be good enough for this job. |
@captain-pool Yes let's just reuse the one if already exists. I think |
@captain-pool @yongtang I found some documentation from PyTables, which also uses the HDF5 library and has its own set of optimizations. |
@veritas9872 With PR #236 merged in, the HDF5 support is almost done (partially supported with a couple of data types). I will take a look and see if #217 or a new PR could get the HDF5 support done. |
@yongtang Hello. I was curious to know whether the current implementation of HDF5 for TF is compatible with common features of HDF5, such as compression filters, checksums, and chunking. I am also curious about whether multi-processing or multi-threading would be implemented. For example, I have found that chunking data for each slice makes reading data x5 faster as unnecessary data is not being read in. I am not familiar with how #236 or #217 has been implemented and I was wondering whether the implementation was ironing out the complexities and optimizing HDF5 for people unfamiliar with the file system. |
@veritas9872 I added another PR #266 which fixes a few issues. The HDF5 in TF-IO is implemented through tf.data pipeline, so the biggest advantage is that you could really provide data to tf.keras for training and inference purposes easily (with a few The HDF5 implementation is based on HDF5's C++ library so I would assume checksums and chunking should be in place already. Having said that, the HDF5 format itself is a big scope and some of the features and data types are not really compatible with TensorFlow's tensor types. If you have a few sample files you work on, then it might be easier for us to check those files and makes sure they are compatible with the implementation in tensorflow. |
Here's an h5 file (90 MB) containing a sequence of MRI images (spatial and frequency domain) with a couple different data types: https://drive.google.com/file/d/1OBsTnmS2KX3GcJumRD3w0Yj_DHh_CbnG/view?usp=sharing |
Thanks @alexwal. I took a look at the sample file and think The It is not difficult to convert |
@yongtang By coincidence, It happens to be the case that I was working on the same dataset (the fastMRI dataset) when I requested this feature. That is also why I asked about compression filters and chunking (this accelerates data reading a lot). I think that h5py is a good reference for the python API since it is the go-to library for HDF5 files. It supports the complex number type and it is the library used to create that particular dataset. In fact, most HDF5 files will have been created with h5py (or maybe Pytables). I don't know anyone who uses the raw C API. So I think it would be a good idea to support the features in h5py. MATLAB also has an HDF5 API (see here for details). However, it is much more limited than h5py and most people just use the automatic storage as .mat files. So if the features in h5py are supported, all features in MATLAB will also be supported. |
@veritas9872 @alexwal sorry to get back late as was trying to get the TF 2.0's Dataset in place since last week. I will take a look at the HDF5 and get back soon. |
I met a problem when reading hdf5 files compressed with gzip, HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0: |
@CaptainDuke is this related to TF-IO? |
@alexwal |
@CaptainDuke May I ask if there is also a problem with shuffle filtering? I use gzip level 1 with shuffle filter all the time because it has the best compression for floating point numbers and complex numbers. I am curious to know whether shuffle filters function in TF-IO as well. |
Hi, I wanted also to work on the fastMRI dataset, and so far was using Sequences. I now want to switch to tf datasets, and was wondering if someone had an example of a tf dataset working on HDF5 files. Indeed, when using I get the following error: |
Similar issue as @zaccharieramzi hit.
|
@zaccharieramzi @aspratyush Do you have a sample file I could take a look? |
@aspratyush Added PR #681 which covers all common types (including |
With a file from the fastMRI database (I cannot attach it as it's too big even zipped), when I run the following code (with the latest import tensorflow_io as tfio
f = 'file1000002.h5'
tfio.IODataset.from_hdf5(f, 'reconstruction_esc') I can send you the file via mail if you want. (Sorry for the late reply I have been focusing on other projects). |
@zaccharieramzi please send me through email or a link to download. You can find my email in git logs. |
@zaccharieramzi I could not reproduce the set fault issue, I suspect it is related to the version mismatch of tf vs tfio. Are you using tensorflow-io-nightly with TF 2.0? On another note, the complex data type was not supported in tensorflow-io. Supporting complex type is a little tricky as there is no native complex type in HDF5, only commonly used H5T_COMPOUND type (with 'r' and I have added the support of complex64 and complex128 in PR #704 with 'r' and |
@zaccharieramzi PR #704 has been merged and a new nightly build is available for Linux: I think your issue should have been fixed with nightly I will close this issue but feel free to reopen if the issue persist. |
Well updating to this nightly version and to tf 2.1rc1 did fix the segfault issue.
|
@zaccharieramzi A prefix of '/' is needed as HDF5 dataset namespace could be recursive, e.g., The following should work (with '/' prefix):
|
Ah I see ok thanks! It does work now, although not as I expected. Indeed, in my case (fastMRI dataset), each HDF5 file is not a dataset but rather an example. Therefore here, I think I need However, when I do: tfio.IOTensor.from_hdf5(f) I get the following error:
From the debugging I can see that it comes from the Sorry for my misunderstanding of how |
@zaccharieramzi The issue is the scalar which was not taking into consideration before. Let me re-open this issue. |
Cool! Do you need some help fixing this (if it's pure Python I can do it)? |
@zaccharieramzi It is a little involved in C++. I had added a PR #708 for scalar support with HDF5. |
Thanks @yongtang , now in eager mode everything works just fine! However at some point to get the values of the columns, you used See the following minimal failing example (you need the file I sent you in your current dir). import tensorflow as tf
import tensorflow_io as tfio
print(tf.__version__, tfio.__version__)
files_ds = tf.data.Dataset.list_files('./*.h5', seed=0)
hdf5_ds = files_ds.map(tfio.IOTensor.from_hdf5) Gives the following error:
Do you now if there is a way to decode the column in a graph-friendly way? |
@zaccharieramzi Some additional work is needed in order to support graph mode. I have created a new issue #710 to track this support. |
@yongtang I am unable to do anything meaningful with this using h5 files formatted for TASSEL variants |
The follow is from tensorflow/tensorflow#27510
The text was updated successfully, but these errors were encountered: