Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge gz file #44

Open
nilgoyette opened this issue Feb 21, 2019 · 5 comments
Open

Huge gz file #44

nilgoyette opened this issue Feb 21, 2019 · 5 comments

Comments

@nilgoyette
Copy link
Collaborator

We received an big image from the Human Connectome Project, nothing huge, but we needed to resample it to 1x1x1 and now it's 2.3Gb in .nii.gz and 8.0Gb in .nii. It's a 181x218x181x288 f32 image, thus allocating 8 227 466 496 bytes and reading from a Gz source, here

let mut raw_data = vec![0u8; nb_bytes_for_data(header)?];
source.read_exact(&mut raw_data)?;

I tested and it doesn't seem to be a memory issue, in the sense that it does reach the read_exact line, but then it's stuck for, err, long enough that I kill the job. 7zip decodes it in ~1m40s, nifti-rs reads the non-gz version in ~10s. For the gz version, it allocates ~3750Mb, then run indefinitely (max we waited was 1 hour) while always using one process, so it's doing something.

We will probably work with HCP image in the future so we might want to contribute a solution to this problem. I'm not sure how to solve this though! Do you think a chunk version would work? Something like:

out = image of right dimension
buffer = vec![0; 1024]
while not eof
    read chuck
    reinterpret to input type
    cast to requested type
    linear_transform
    assign to out  at right place.
return out

It might slow down the reading of "normal"/smaller images, but we can probably create a different code path for "big" images. What do you think?

@Enet4
Copy link
Owner

Enet4 commented Feb 21, 2019

Coincidently, one of the ideas that I've had at the back of my head for a while was a "lazy" NIfTI volume implementation, which would not pull all of the volume from the file to memory. It would be backed by some sort of paged caching mechanism, thus restricting memory usage while preventing an excessive number of file reads. Combined with adaptor methods for retrieving volume slices, I believe that such a volume would solve that particular problem.

On the other hand, it's in fact weird that the program can allocate enough memory for the volume, but fail to read it afterwards. Something else might be at play here, so I would like to look into this as well.

@fmorency
Copy link

I've been bitten by this bug while processing huge multishell DWI datasets. The workaround of using uncompressed NIFTI files works at the expense of disk space.

@Enet4
Copy link
Owner

Enet4 commented Apr 12, 2019

I also agree that this is an important (and fairly challenging) matter. So far, I've thought of two non-exclusive ways to overcome the problem:

  1. As I stated above, one could have a lazy implementation that would keep the file open and consume the stream as values are requested, keeping only a few portions of past data in memory. This comes with a caveat in GZip compressed volumes, because if the user wishes to read a value far back in the volume data, the the program would have to reopen the file and deflate the byte stream from the beginning. This could be automated, albeit with some implications in performance predictability.

  2. Recently, I've also been thinking about providing an alternative public API, with no arbitrary indexing capabilities, but still with a means to iterate through slices of the volume. This is easier to implement and should still reflect most use cases without becoming unergonomic.

@nilgoyette
Copy link
Collaborator Author

The ideas in this issue might still be interesting, but the bug has been found and solved in flate2. tl;dr There was an infinite loop on huge files.

@Enet4
Copy link
Owner

Enet4 commented Jun 21, 2019

Great to know! I say we keep the issue open nonetheless, as a more memory-efficient solution for reading large volumes may still be useful.

@Enet4 Enet4 mentioned this issue Jun 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants