Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flush doesn't create or upload the file until the file is closed #484

Open
VOvchinnikov opened this issue Jul 28, 2022 · 3 comments
Open

Comments

@VOvchinnikov
Copy link

fsspec version 2022.5.0
gcsfs version 2022.5.0

Code to reproduce:

import fsspec
fileobj = fsspec.open('gs://<insert-your-bucket-here>/test-write-flush', 'w', auto_mkdirs=True)
f = fileobj.fs.open(fileobj.path, mode=fileobj.mode)
f.write('w' * (2**20))  # is guaranteed to be larger than minimal block size
f.flush()  # does nothing visible - no file is created at the destination
f.close()  # now the file is created and has content

Upon debugging flush call, it seems that the check self.buffer.tell() < self.blocksize is always True, because the way things are implemented, self.buffer.tell() returns 0.
Furthermore, if I call manually what is in fsspec flush implementation after the check, meaning this code:

        if self.offset is None:
            # Initialize a multipart upload
            self.offset = 0
            try:
                self._initiate_upload()
            except:  # noqa: E722
                self.closed = True
                raise

        if self._upload_chunk(final=force) is not False:
            self.offset += self.buffer.seek(0, 2)
            self.buffer = io.BytesIO()

the file is still not created, although the underlying code in _upload_chunk does something.

@martindurant
Copy link
Member

There are two ways to write a file ("key") to GCS: a single upload, or a multi-part upload. For the first, it's a one-shot deal, to close and flush are necessarily the same (we do this for small files).
For the latter, an upload container is created with the first flush (if the buffer is big enough - GCS limits how small each write can be!), and subsequent flushes will send more pieces; but on the remote API, the only way to patch the pieces together at the destination is when you are finally done with the file, i.e., the same as close. Sorry, GCS is not a real file system, and we do our best to emulate it, but cannot get around such shortcomings.

@VOvchinnikov
Copy link
Author

VOvchinnikov commented Jul 28, 2022

Thanks a lot!

After reading a bit more into the limitations, I think I understood them.

I suppose, the only "easy" workaround is to re-upload the entire file every time there's a flush, but it doesn't sound all that practical for bigger files.

The workaround I am likely going to use, given the immutability of the final objects, is, to upload parts of the stream into separate intermediate files if the "flush" is called, and if that happened, once the file is closed, perform the analogue of gsutil compose.

I wonder if this approach could be translated to a more general one.

@martindurant
Copy link
Member

You can use the merge() method for this, but I believe it has limits on how big the pieces can be too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants