Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-seekable streams can not be uploaded to S3 #860

Closed
jarshwah opened this issue Mar 13, 2020 · 8 comments
Closed

Non-seekable streams can not be uploaded to S3 #860

jarshwah opened this issue Mar 13, 2020 · 8 comments

Comments

@jarshwah
Copy link

jarshwah commented Mar 13, 2020

I have a large (1gb or so) csv file that I want admin users to be able to upload that will then be processed by an offline task. Unfortunately, uploading that file to Django isn't straightforward - because Nginx has reasonable limits to prevent bad users from uploading massive files.

Long story short, I want users to be able to provide a link (dropbox, google drive, etc) which will then be streamed by django to S3 via django-storages.

Turns out that's not so difficult because the underlying stream of urllib3 provides a file-like interface. The following almost works:

import requests
from django.core.files import File

def UrlFile(url):
    response = requests.get(url, stream=True)
    return File(response.raw)

content = UrlFile("https://example.com/large.csv")
my_model.file_field.save("streamed.csv", content)

But these streams are not files and do not support seek, which the s3boto3 backend attempts to do without question:

content.seek(0, os.SEEK_SET)
obj.upload_fileobj(content, ExtraArgs=params)

What it should do (I think..) is:

if content.seekable():
    content.seek(0, os.SEEK_SET)

I've hacked my way around this by implementing seek and no-op the default case of seek(0) when already at the start of the file. Entire class provided for any interested onlookers.

class UrlFile(File):
    def __init__(self, url: str, name: t.Optional[str] = None, **kwargs):
        """
        Provides a streaming URL File interface for Django storage system.

        Use like so:

        >>> file = UrlFile("https://example.com/my_file.csv", timeout=5)
        >>> my_model.file_field.save(file.name, file)

        `**kwargs` are all passed straight through to requests.get

        If the filename can not be determined from the `content-disposition`
        header, a uuid.uuid4 value will be provided.

        """
        resp: requests.Response = requests.get(url, stream=True, **kwargs)
        resp.raise_for_status()
        headers = resp.headers
        self._size = headers["Content-Length"]
        content_disposition = headers["content-disposition"]
        self.name = name
        if not self.name:
            self._try_set_filename(content_disposition)
        self.file = resp.raw

    def _try_set_filename(self, content_disposition: str):
        try:
            filename = re.findall(r'filename="([^"]+)', content_disposition)[0]
            # no paths
            filename = pathlib.Path(filename).name
        except IndexError:
            filename = uuid.uuid4().hex
        self.name = filename

    def seek(self, offset, whence=os.SEEK_SET):
        """
        A streaming file can not seek, but s3boto3 storage will seek(0) everything
        coming in, so we fake it for the default case of we're already at the
        beginning of the stream (a no-op)

        This fakeness might break other things, but the goal for now is to work
        with s3boto3storage
        """
        if self.file.tell() == offset and whence == os.SEEK_SET:
            return offset
        return self.file.seek(offset, whence)
@monokrome
Copy link

Same issue here unfortunately 🙀

@jschneier
Copy link
Owner

Hi, the seekable bugfix is reasonable and I will accept it in PR. I see that several other parts of the code base also use seek for things like compression, are you not using that or is there no issue?

@jarshwah
Copy link
Author

jarshwah commented Jun 2, 2020

Compression would still work, because it does a seek(0), consumes the stream, and returns a real file object. You're right though, there are many other places that would need to do this check. Some of them should raise exceptions if the stream is not seekable:

self.file.seek(0, os.SEEK_END)

This particular section of the code (writing to a file) does not make sense if the file is not seekable though, so it's reasonable to blow up in such circumstances.

@ramast
Copy link

ramast commented May 20, 2021

+1 for fixing this problem. I don't understand the code 100% but maybe adding an if statement like this

if fp.seekable():  // Don't run this code when file is not seekable
    if rewind:
        # caller requests reading from beginning of fp.
        fp.seek(0, os.SEEK_SET)
    else:

In file boto/s3/key.py line 1205

@belkka
Copy link

belkka commented May 28, 2021

BTW, default django file storage does allow saving non-seekable streams (e. g. like requests.get(..., stream=True).raw).

@monokrome
Copy link

@belkka Although a useful workaround, that requires using a different interface and worth noting that it is unfortunately harmful if considered a final suggestion in terms of interoperability with 3rd party applications - which is the case that I had run into

@belkka
Copy link

belkka commented May 31, 2021

@monokrome Did you reply to @ramast? I have not suggested any workarounds, just pointed out that default file storage is less "restrictive" :). Actually, that's why I had not noticed a bug in code that downloads facebook profile picture and saves it to django user's ImageField.

I think an option (e. g. in django settings) like "disable seek" set to False by default would make everyone happy without breaking compatibility.

vainu-arto added a commit to vainu-arto/django-storages that referenced this issue Sep 15, 2021
vainu-arto added a commit to vainu-arto/django-storages that referenced this issue Sep 16, 2021
Also add tests to verify the behavior. This fixes issue jschneier#860.
jschneier pushed a commit that referenced this issue Sep 19, 2021
…1057)

* Seek content to zero only once, before compressing

* Only seek in S3Boto3Storage._save if content is seekable

Also add tests to verify the behavior. This fixes issue #860.
@jschneier
Copy link
Owner

Resolved by #1057

jschneier pushed a commit that referenced this issue Sep 19, 2021
* Seek content to zero only once, before compressing

* Only seek in S3Boto3Storage._save if content is seekable

Also add tests to verify the behavior. This fixes issue #860.
mlazowik pushed a commit to qedsoftware/django-storages that referenced this issue Mar 9, 2022
)

* Seek content to zero only once, before compressing

* Only seek in S3Boto3Storage._save if content is seekable

Also add tests to verify the behavior. This fixes issue jschneier#860.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants