Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunked stream memory leak #3631

Closed
mvanderkroon opened this issue Feb 28, 2019 · 10 comments
Closed

Chunked stream memory leak #3631

mvanderkroon opened this issue Feb 28, 2019 · 10 comments
Labels

Comments

@mvanderkroon
Copy link

mvanderkroon commented Feb 28, 2019

Long story short

I'm trying to (POST) stream data to an aiohttp server instance (potentially hundreds of gigabytes), however my data is not typically stored in files and can be 'generated' in client processes on the fly. I can't use a multipart upload as it does not fit my needs, nor do I use the data = await request.post() shorthand (which the docs are clear about in that that will OOM for large files).

I'm trying to use the underlying StreamReader (request._payload) to allow line by line iteration over the stream. In doing so, aiohttp (server) consumes more and more memory until the application OOMs.

Expected behaviour

Processing a stream of data in aiohttp server should not cause OOMs

Actual behaviour

aiohttp OOMs on large streams of data

Steps to reproduce

aiohttp server

# server.py
async def process_stream(request):
    async for line in request._payload:
        pass

    return web.json_response({"STATUS": "OK"})

requests client

# client.py
def generate_data():
    while True:
        yield """hello world\n""".encode('utf-8')
        
r = requests.post("http://localhost:8080/", data=generate_data())

Additional info

I found a resource relating to asyncio and StreamReader/Writer-backpressure. I have done my best to read through the aiohttp source but it looks like the fixes mentioned in the document are already in place so I'm not sure why this is not working.

In fact, I'm not sure whether the memory increase is due to aiohttp (or an underlying lib) holding references to elements in memory, or whether the producing process is simply pushing data in to the queue faster than aiohttp is consuming it (this latter case would suggest a problem with backpressure).

Your environment

server
aiohttp 3.5.4
alpine 3.7.0

@aio-libs-bot
Copy link

GitMate.io thinks the contributors most likely able to help are @fafhrd91, and @asvetlov.

Possibly related issues are #1656 (Memory leak), #271 (Memory leak in request), #2566 (how to get post data), #469 (how to get the post data), and #133 (Memory leak when doing https request).

@mvanderkroon
Copy link
Author

mvanderkroon commented Mar 1, 2019

Found what looks like a bonafide bug. Analysis using tracemalloc showed that memory was leaking in streams:238.

Confusing at first, closer inspection reveals that on streams:276 this self.total_bytes value is appended to a list (_http_chunk_splits) of infinitely increasing size: hence the memory leak.

As far as I can tell, this _http_chunk_splits list is only used when iterating over the data stream in chunks using AsyncStreamReaderMixin.iter_chunks, in this case is seems to also be properly cleaned up (as elements are popped). Of the other three methods of iteration over the stream at least AsyncStreamReaderMixin.__aiter__ seems vulnerable to the bug.

My proposed fix would be to ensure the _http_chunk_splits list is properly cleaned up when the readline method is used to iterate over the stream line by line. Concretely I suggest the following:

# streams.py
async def readline(self) -> bytes:
    if self._exception is not None:
        raise self._exception

    line = []
    line_size = 0
    not_enough = True

    while not_enough:
        while self._buffer and not_enough:
            offset = self._buffer_offset
            ichar = self._buffer[0].find(b"\n", offset) + 1
            # Read from current offset to found b'\n' or to the end.
            data = self._read_nowait_chunk(ichar - offset if ichar else -1)
            line.append(data)
            line_size += len(data)
            if ichar:
                not_enough = False

            if line_size > self._high_water:
                raise ValueError("Line is too long")

        if self._eof:
            break

        if not_enough:
            await self._wait("readline")

    # fixes memory leak: https://github.com/aio-libs/aiohttp/issues/3631
    self._http_chunk_splits = []

    return b"".join(line)

@socketpair
Copy link
Contributor

@mvanderkroon could you please revert quotes and whitespaces in your commit ?

@mvanderkroon
Copy link
Author

@socketpair sure thing - latest commit should be fine

@socketpair
Copy link
Contributor

@mvanderkroon I wouldn't say so:

image

@mvanderkroon
Copy link
Author

mvanderkroon commented Mar 2, 2019

I think somehow my latest commit is not being picked up here: da7bcaa

If all fails I'll simply redo the process, let me know

@socketpair
Copy link
Contributor

Well, squash (fixup) commits please and make a PR.

@mvanderkroon mvanderkroon mentioned this issue Mar 3, 2019
2 tasks
@socketpair socketpair changed the title POST stream data memory leak Chunked stream memory leak Mar 3, 2019
@odysseusmax
Copy link

odysseusmax commented Mar 30, 2019

i'm too experiencing some memory leak in chunked stream responce.

I'm downloading files over http using aiohttp, my code looks some what like this

async with aiohttp.ClientSession() as session:
            async with session.get(url) as r:
                if(r.status < 400):
                   with open(filename, 'wb') as fd:
                                    async for chunk in r.content.iter_chunked(1024):
                                          fd.write(chunk)

in concurent downloads i get out of memory error.

@socketpair
Copy link
Contributor

@odysseusmax please report another issue and please prepare complete program to reproduce the bug. Every detail is importatant - Python version, and especially web server which I can use in order to reproduce.

@socketpair
Copy link
Contributor

socketpair commented Mar 30, 2019

Fix is landed in both master and 3.5 branches

@lock lock bot added the outdated label Apr 2, 2020
@lock lock bot locked as resolved and limited conversation to collaborators Apr 2, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants