iter_lines is still broken? #5540

vitidev · 2020-07-24T12:27:57Z

I don't understand.

I can't iterate "\r\n" because "\r" and "\n" can be read into different chunks.

But I can't even get solve this problem by explicitly setting the final delimiter "\n"

  for line in (_.strip() for _ in resp.iter_lines(delimiter=b"\n")):
      print(line)

because if chunk end with "\r\n", then lines = chunk.split(delimiter) in iter_lines append empty extra line (but chunk.splitlines() not)

the same problem and just with "\n" - if chunk end with delimiter, then we get an extra empty line. Which forces you to set a large buffer size if line length is unknown

So if the split(...) adds an extra line, why not remove it?

    if delimiter:
        lines = chunk.split(delimiter)
        if lines and not lines[-1]:
            lines.pop(-1)
    else:
        lines = chunk.splitlines()

The text was updated successfully, but these errors were encountered:

sigmavirus24 · 2020-07-24T17:32:07Z

I can't iterate "\r\n" because "\r" and "\n" can be read into different chunks.

I can't even get solve this problem

I still don't understand your problem but I think you have unrealistic expectations of what iter_lines does. Your issue isn't cogent enough for me to make sense of it and sounds like you're having trouble using the API which makes this appropriate for StackOverflow. You've not identified a clear bug (from what little I can understand in your issue) with the documented API

vitidev · 2020-07-24T19:26:19Z

I expect that

with open('src.txt', 'r') as f:
      for line in f:

and with the same file

resp = requests.get('http://127.0.0.1:5000/src.txt', stream=True)
       for line in resp.iter_lines(delimiter=b"\n"):

Should be equivalent

file src.txt contains "\r\n" and I don't expect it to work correctly because the current implementation is broken #4629, #3894, #4271, #2431

So I use an explicit single byte delimiter "\n", and expect everything to be fine.

But in this case internals begins to be used chunk.split, which leads to the appearance of empty lines (which are not in the original file.

I see this is an old known issue. Here I see in the comments to the code requests/models.py (v3.0)

  # Calling `.split(delimiter)` will always end with whatever text
  # remains beyond the delimiter, or '' if the delimiter is the end
  # of the text.  On the other hand, `.splitlines()` doesn't include
  # a '' if the text ends in a line delimiter.
  #
  # For example:
  #
  #     'abc\ndef\n'.split('\n')  ~> ['abc', 'def', '']
  #     'abc\ndef\n'.splitlines() ~> ['abc', 'def']
  #
  # So if we have a specified delimiter, we always pop the final
  # item and prepend it to the next chunk.
  #
  # If we're using `splitlines()`, we only do this if the chunk
  # ended midway through a line.

so this is both a description of the problem and a question - why not fix this aspect for versions 2.x? I don't see any breaking changes here. the function will stop producing empty strings that it shouldn't already. And I will be able to use built in iter-lines functionality instead of custom implementation.

sigmavirus24 · 2020-07-26T15:09:01Z

why not fix this aspect for versions 2.x? I

3.0 is never happening, so it won't be backported to 2.x

I don't see any breaking changes here.

I'm not surprised. Unfortunately a lot of things are incredibly closely attached to the current behaviour of this library and even if we can say "Well they shouldn't be relying on that" it would still break those folks and would be backwards incompatible as a result. The unfortunate reality is that you and I can't predict every way that someone uses this library. Just because you can't see a reason for using it that way, doesn't mean someone is and strong backwards-compatibility is important at this stage.

If 3.0 were to ever happen, that would be the time to change this behaviour but 3.0 isn't happening

vitidev · 2020-07-27T07:38:41Z

The unfortunate reality is that you and I can't predict every way that someone uses this library.

I can not agree.

The changes are different. And specifically, these changes do not break anything.

The method generates random empty lines. And this is only a special case when the delimiter is specified
And the user only has 2 options:

ignore (filter) such empty lines, if there is knowledge that the source should not contain empty lines
not use a method at all, but its own implementation, because outside of the method, you cannot separate fake empty lines from real ones from the source.

And if the method stops generating random empty lines, then will be no point in filtering, but it won't break anything

It is impossible to imagine that someone would base their logic on a side property (actually a bug) of the "randomly generate empty lines" method. To base logic on a side effect of a method that can (and should) be corrected at any time is extremely dumb

And for those who do not use an explicit delimiter, nothing will change at all.

saulpw · 2023-01-28T01:48:14Z

@CalPeterson ran into this same issue, and filed saulpw/visidata#1704.

I agree with @vitidev that iter_lines() should behave identically to iterating over the exact same file contents on local disk using a standard file pointer. Otherwise it is a subtly wrong function, as it can generate spurious blank lines if the delimiter happens to straddle a chunk boundary.

Tracking down the various PRs, it seems like the fix in #3984 was eventually merged into a proposed 3.0 branch, but as @sigmavirus24 says above, "3.0 isn't happening". So if I understand correctly, iter_lines() has a fundamental bug, no fix has been allowed into the current major version, and there will be no next major version.

Please allow a fix for this issue to be merged into the mainline requests library. I assure you, more people are depending on the expected behavior than are depending on the broken behavior, and it is only a matter of time until they hit this bug.

anjakefala · 2023-01-28T02:07:37Z

I am curious about the claim that "3.0 is never happening". It contradicts this page where it seems requests is open to major releases, just they would be infrequent: https://requests.readthedocs.io/en/latest/community/release-process/#major-releases

bemoody · 2023-11-09T22:59:02Z

I'd be happy to try to develop a fix for this issue.

So long as the issue isn't fixed, it's a trap for the unwary, and the documentation should point out that it's broken. Currently the documentation actively encourages people to use iter_lines together with stream=True.

bemoody · 2023-11-09T23:45:06Z

The following might work but is untested.

diff --git a/src/requests/models.py b/src/requests/models.py
index 44556394..6683ad92 100644
--- a/src/requests/models.py
+++ b/src/requests/models.py
@@ -861,6 +861,13 @@ class Response:
 
         pending = None
 
+        if decode_unicode:
+            lf = '\n'
+            cr = '\r'
+        else:
+            lf = b'\n'
+            cr = b'\r'
+
         for chunk in self.iter_content(
             chunk_size=chunk_size, decode_unicode=decode_unicode
         ):
@@ -869,18 +876,48 @@ class Response:
 
             if delimiter:
                 lines = chunk.split(delimiter)
-            else:
-                lines = chunk.splitlines()
-
-            if lines and lines[-1] and chunk and lines[-1][-1] == chunk[-1]:
+                # If the chunk ends with the delimiter, then lines[-1]
+                # is empty.  Otherwise lines[-1] is not empty.  Either
+                # way, lines[-1] shouldn't be treated as a separate
+                # line, but should be prepended to the following
+                # chunk.
                 pending = lines.pop()
             else:
-                pending = None
+                lines = chunk.splitlines()
+                # If the chunk ends with 'XYZ\n' or 'XYZ\r\n', then
+                # lines[-1] is 'XYZ'.  Either way, the following byte
+                # is the start of a new line.
+                if chunk.endswith(lf):
+                    pending = None
+                # If the chunk ends with 'XYZ\r', then lines[-1] is
+                # 'XYZ'.  We don't know yet if the next byte will be
+                # '\n' (and should be treated as part of this line's
+                # ending) or something else (and is the start of a new
+                # line.)  So leave this line, with its '\r', pending.
+                elif chunk.endswith(cr):
+                    pending = lines.pop() + cr
+                # If the chunk doesn't end with '\r' or '\n', then
+                # lines[-1] isn't a complete line and should be
+                # prepended to the following chunk.
+                else:
+                    pending = lines.pop()
 
             yield from lines
 
         if pending is not None:
-            yield pending
+            if not delimiter:
+                # If the final chunk ends with CR, then CR will have
+                # been included in 'pending' above, and should be
+                # discarded.  Note that when delimiter = None, the
+                # caller wants splitlines() behavior, which means not
+                # generating a final empty string if the response ends
+                # with a line terminator.
+                yield pending.rstrip(cr)
+            else:
+                # When delimiter != None, the caller wants split()
+                # behavior, which means generating a final empty
+                # string if the response ends with the delimiter.
+                yield pending
 
     @property
     def content(self):

bemoody · 2023-11-10T21:01:17Z

I'll say one more thing on the question of backward compatibility.

I found 82 instances of .iter_lines( in Debian:
https://codesearch.debian.net/search?literal=1&q=.iter_lines(+filetype%3Apython+-package%3Arequests

Based on a very quick assessment:

29 instances are unrelated to Requests.
25 instances are broken in some way because of this bug. (I didn't count how many of them were using stream=True.)
25 instances are not broken because they ignore blank lines.
3 instances I couldn't immediately assess.

So that's 50% of actual users of this method that are currently broken, and 0% that would be rendered broken by fixing this bug.

saulpw mentioned this issue Jan 28, 2023

Row of nulls in csv that came from csvbase saulpw/visidata#1704

Closed

bemoody mentioned this issue Nov 10, 2023

Document that Response.iter_lines is broken and should be avoided #6570

Open

abhizer mentioned this issue May 28, 2024

[Python SDK] use iter_lines to read HTTP response. feldera/feldera#1798

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iter_lines is still broken? #5540

iter_lines is still broken? #5540

vitidev commented Jul 24, 2020 •

edited

Loading

sigmavirus24 commented Jul 24, 2020

vitidev commented Jul 24, 2020 •

edited

Loading

sigmavirus24 commented Jul 26, 2020

vitidev commented Jul 27, 2020

saulpw commented Jan 28, 2023 •

edited

Loading

anjakefala commented Jan 28, 2023

bemoody commented Nov 9, 2023

bemoody commented Nov 9, 2023

bemoody commented Nov 10, 2023

iter_lines is still broken? #5540

iter_lines is still broken? #5540

Comments

vitidev commented Jul 24, 2020 • edited Loading

sigmavirus24 commented Jul 24, 2020

vitidev commented Jul 24, 2020 • edited Loading

sigmavirus24 commented Jul 26, 2020

vitidev commented Jul 27, 2020

saulpw commented Jan 28, 2023 • edited Loading

anjakefala commented Jan 28, 2023

bemoody commented Nov 9, 2023

bemoody commented Nov 9, 2023

bemoody commented Nov 10, 2023

vitidev commented Jul 24, 2020 •

edited

Loading

vitidev commented Jul 24, 2020 •

edited

Loading

saulpw commented Jan 28, 2023 •

edited

Loading