Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I found a tar.gz that broke ratarmount after 16 hours ;-; #66

Closed
mr-bo-jangles opened this issue Jul 2, 2021 · 6 comments
Closed

I found a tar.gz that broke ratarmount after 16 hours ;-; #66

mr-bo-jangles opened this issue Jul 2, 2021 · 6 comments

Comments

@mr-bo-jangles
Copy link

Currently at position 1664643022848 of 1664666965947 (100.00%). Estimated time remaining with current rate: 0 min 1 s, with average rate: 0 min 0 s.
Creating offset dictionary for /home/download/thefile.tar.gz took 57789.62s

Traceback (most recent call last):
File "/usr/local/bin/ratarmount", line 8, in
sys.exit(cli())
File "/usr/local/lib/python3.8/dist-packages/ratarmount.py", line 2604, in cli
fuseOperationsObject = TarMount(
File "/usr/local/lib/python3.8/dist-packages/ratarmount.py", line 1937, in init
self.mountSources: List[Union[SQLiteIndexedTar, FolderMountSource]] = [
File "/usr/local/lib/python3.8/dist-packages/ratarmount.py", line 1938, in
SQLiteIndexedTar(tarFile, writeIndex=True, **sqliteIndexedTarOptions)
File "/usr/local/lib/python3.8/dist-packages/ratarmount.py", line 504, in init
self._loadOrStoreCompressionOffsets() # store
File "/usr/local/lib/python3.8/dist-packages/ratarmount.py", line 1622, in _loadOrStoreCompressionOffsets
db.execute('INSERT INTO gzipindex VALUES (?)', (file.read(),))
OverflowError: BLOB longer than INT_MAX bytes

@mxmlnkn
Copy link
Owner

mxmlnkn commented Jul 2, 2021

Thank you for reporting this! And sorry about your wasted computation power. I was not aware that there is a maximum blob size even though I did some quite large benchmarks (100GB tar.gz files). The limit seems to be around 1GB for the blob size. I might be able to reproduce this problem by myself by choosing smaller indexed_gzip seek points.

Conversely, you might be able to work around your issue for now by increasing the gzip seek point distance, which reduces the data to write out. But, it might slow down random seeking a little bit. One seek point takes up roughly 32kiB. The default spacing should be 16MiB, i.e., the index required for gzip seeking is ~0.2% of your original files. As you hit the 1GB index limit, this means your file must be larger than 512GB. According to your output log, your file seems to be about 155GiB. Is that correct? I guess, there is some leeyway in my calculations somewhere. If you want to give it another try, then please try:

ratarmount --gzip-seek-point-spacing 128 ...

This will increase the gzip spacing to 128MiB and ratarmount should work if your tar.gz is smaller than 1GB/32kiB*128MiB = 4 TiB. Well, or with the ~4x deviation from my estimates, it should work with files smaller than ~1TB. If your archive is even larger or close to it then please choose an appropriately higher seek point spacing with some leeway because I'm not 100% sure about the 1GB and the 32kiB is only a rough estimate.

That SQLite database is essential for ratarmount to work even if it isn't written to disk. However, the gzip seek points are not essential when not writing it to disk. I might be able to avoid dumping them into the database if that database is never to be written to disk anyway. Then you would have been able to use --index-file :memory: as a workaround. But then, subsequent mounting would also take 16 hours each time.

The limit also can't be increased much more, only to the 2GB limit (signed 32 bit max number). I guess, I'll have to split the data into multiple smaller blobs to avoid the the limit.

@mr-bo-jangles
Copy link
Author

mr-bo-jangles commented Jul 2, 2021

According to your output log, your file seems to be about 155GiB. Is that correct?

The archive itself is 1.51TB, so you're slightly off

The limit also can't be increased much more, only to the 2GB limit (signed 32 bit max number). I guess, I'll have to split the data into multiple smaller blobs to avoid the the limit.

This is almost certainly the best choice

@mxmlnkn
Copy link
Owner

mxmlnkn commented Jul 2, 2021

According to your output log, your file seems to be about 155GiB. Is that correct?

The archive itself is 1.51TB, so you're slightly off

I must have forgotten a 0 somewhere. Well, --gzip-seek-point-spacing 128 might still work but a value of 512 or so might be safer.

Sorry about editing your post. I wanted to quote it not edit it...

@mxmlnkn
Copy link
Owner

mxmlnkn commented Jul 2, 2021

I pushed a fix. You can try it out with:

pip install git+https://github.com/mxmlnkn/ratarmount.git@fix-gzindex-max-blob-size#egginfo=ratarmount

@mr-bo-jangles
Copy link
Author

I'll be without internet for a few weeks, so hopefully I can try this when I get internet again

@mxmlnkn
Copy link
Owner

mxmlnkn commented Jul 12, 2021

Hopefully fixed in 0.8.1. Please let me know if it also works for you and if not feel free to reopen this issue.

@mxmlnkn mxmlnkn closed this as completed Jul 12, 2021
mxmlnkn added a commit that referenced this issue Aug 30, 2024
fusepy/fusepy #66, #67, #101
fusepy/fusepy #100

First test with ratarmount worked!

 - [ ] I only monkey-patched readdir and getattr. The other changed
   methods should also be adjusted and tested and maybe we can do
   better, e.g., by letting the caller decide which interface they want
   to implement with a member variable as flag! Or do it via inspection
   like in fusepy/fusepy#101, but the overhead
   might be killer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants