Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iter_lines is still broken? #5540

Open
vitidev opened this issue Jul 24, 2020 · 9 comments
Open

iter_lines is still broken? #5540

vitidev opened this issue Jul 24, 2020 · 9 comments

Comments

@vitidev
Copy link

vitidev commented Jul 24, 2020

I don't understand.

I can't iterate "\r\n" because "\r" and "\n" can be read into different chunks.

But I can't even get solve this problem by explicitly setting the final delimiter "\n"

  for line in (_.strip() for _ in resp.iter_lines(delimiter=b"\n")):
      print(line)

because if chunk end with "\r\n", then lines = chunk.split(delimiter) in iter_lines append empty extra line (but chunk.splitlines() not)

the same problem and just with "\n" - if chunk end with delimiter, then we get an extra empty line. Which forces you to set a large buffer size if line length is unknown

So if the split(...) adds an extra line, why not remove it?

    if delimiter:
        lines = chunk.split(delimiter)
        if lines and not lines[-1]:
            lines.pop(-1)
    else:
        lines = chunk.splitlines()
@sigmavirus24
Copy link
Contributor

I can't iterate "\r\n" because "\r" and "\n" can be read into different chunks.

I can't even get solve this problem

I still don't understand your problem but I think you have unrealistic expectations of what iter_lines does. Your issue isn't cogent enough for me to make sense of it and sounds like you're having trouble using the API which makes this appropriate for StackOverflow. You've not identified a clear bug (from what little I can understand in your issue) with the documented API

@vitidev
Copy link
Author

vitidev commented Jul 24, 2020

I expect that

with open('src.txt', 'r') as f:
      for line in f:

and with the same file

resp = requests.get('http://127.0.0.1:5000/src.txt', stream=True)
       for line in resp.iter_lines(delimiter=b"\n"):

Should be equivalent

file src.txt contains "\r\n" and I don't expect it to work correctly because the current implementation is broken #4629, #3894, #4271, #2431

So I use an explicit single byte delimiter "\n", and expect everything to be fine.

But in this case internals begins to be used chunk.split, which leads to the appearance of empty lines (which are not in the original file.

I see this is an old known issue. Here I see in the comments to the code requests/models.py (v3.0)

  # Calling `.split(delimiter)` will always end with whatever text
  # remains beyond the delimiter, or '' if the delimiter is the end
  # of the text.  On the other hand, `.splitlines()` doesn't include
  # a '' if the text ends in a line delimiter.
  #
  # For example:
  #
  #     'abc\ndef\n'.split('\n')  ~> ['abc', 'def', '']
  #     'abc\ndef\n'.splitlines() ~> ['abc', 'def']
  #
  # So if we have a specified delimiter, we always pop the final
  # item and prepend it to the next chunk.
  #
  # If we're using `splitlines()`, we only do this if the chunk
  # ended midway through a line.

so this is both a description of the problem and a question - why not fix this aspect for versions 2.x? I don't see any breaking changes here. the function will stop producing empty strings that it shouldn't already. And I will be able to use built in iter-lines functionality instead of custom implementation.

@sigmavirus24
Copy link
Contributor

why not fix this aspect for versions 2.x? I

3.0 is never happening, so it won't be backported to 2.x

I don't see any breaking changes here.

I'm not surprised. Unfortunately a lot of things are incredibly closely attached to the current behaviour of this library and even if we can say "Well they shouldn't be relying on that" it would still break those folks and would be backwards incompatible as a result. The unfortunate reality is that you and I can't predict every way that someone uses this library. Just because you can't see a reason for using it that way, doesn't mean someone is and strong backwards-compatibility is important at this stage.

If 3.0 were to ever happen, that would be the time to change this behaviour but 3.0 isn't happening

@vitidev
Copy link
Author

vitidev commented Jul 27, 2020

The unfortunate reality is that you and I can't predict every way that someone uses this library.

I can not agree.

The changes are different. And specifically, these changes do not break anything.

The method generates random empty lines. And this is only a special case when the delimiter is specified
And the user only has 2 options:

  • ignore (filter) such empty lines, if there is knowledge that the source should not contain empty lines
  • not use a method at all, but its own implementation, because outside of the method, you cannot separate fake empty lines from real ones from the source.

And if the method stops generating random empty lines, then will be no point in filtering, but it won't break anything

It is impossible to imagine that someone would base their logic on a side property (actually a bug) of the "randomly generate empty lines" method. To base logic on a side effect of a method that can (and should) be corrected at any time is extremely dumb

And for those who do not use an explicit delimiter, nothing will change at all.

@saulpw
Copy link

saulpw commented Jan 28, 2023

@CalPeterson ran into this same issue, and filed saulpw/visidata#1704.

I agree with @vitidev that iter_lines() should behave identically to iterating over the exact same file contents on local disk using a standard file pointer. Otherwise it is a subtly wrong function, as it can generate spurious blank lines if the delimiter happens to straddle a chunk boundary.

Tracking down the various PRs, it seems like the fix in #3984 was eventually merged into a proposed 3.0 branch, but as @sigmavirus24 says above, "3.0 isn't happening". So if I understand correctly, iter_lines() has a fundamental bug, no fix has been allowed into the current major version, and there will be no next major version.

Please allow a fix for this issue to be merged into the mainline requests library. I assure you, more people are depending on the expected behavior than are depending on the broken behavior, and it is only a matter of time until they hit this bug.

@anjakefala
Copy link

I am curious about the claim that "3.0 is never happening". It contradicts this page where it seems requests is open to major releases, just they would be infrequent: https://requests.readthedocs.io/en/latest/community/release-process/#major-releases

@bemoody
Copy link

bemoody commented Nov 9, 2023

I'd be happy to try to develop a fix for this issue.

So long as the issue isn't fixed, it's a trap for the unwary, and the documentation should point out that it's broken. Currently the documentation actively encourages people to use iter_lines together with stream=True.

@bemoody
Copy link

bemoody commented Nov 9, 2023

The following might work but is untested.

diff --git a/src/requests/models.py b/src/requests/models.py
index 44556394..6683ad92 100644
--- a/src/requests/models.py
+++ b/src/requests/models.py
@@ -861,6 +861,13 @@ class Response:
 
         pending = None
 
+        if decode_unicode:
+            lf = '\n'
+            cr = '\r'
+        else:
+            lf = b'\n'
+            cr = b'\r'
+
         for chunk in self.iter_content(
             chunk_size=chunk_size, decode_unicode=decode_unicode
         ):
@@ -869,18 +876,48 @@ class Response:
 
             if delimiter:
                 lines = chunk.split(delimiter)
-            else:
-                lines = chunk.splitlines()
-
-            if lines and lines[-1] and chunk and lines[-1][-1] == chunk[-1]:
+                # If the chunk ends with the delimiter, then lines[-1]
+                # is empty.  Otherwise lines[-1] is not empty.  Either
+                # way, lines[-1] shouldn't be treated as a separate
+                # line, but should be prepended to the following
+                # chunk.
                 pending = lines.pop()
             else:
-                pending = None
+                lines = chunk.splitlines()
+                # If the chunk ends with 'XYZ\n' or 'XYZ\r\n', then
+                # lines[-1] is 'XYZ'.  Either way, the following byte
+                # is the start of a new line.
+                if chunk.endswith(lf):
+                    pending = None
+                # If the chunk ends with 'XYZ\r', then lines[-1] is
+                # 'XYZ'.  We don't know yet if the next byte will be
+                # '\n' (and should be treated as part of this line's
+                # ending) or something else (and is the start of a new
+                # line.)  So leave this line, with its '\r', pending.
+                elif chunk.endswith(cr):
+                    pending = lines.pop() + cr
+                # If the chunk doesn't end with '\r' or '\n', then
+                # lines[-1] isn't a complete line and should be
+                # prepended to the following chunk.
+                else:
+                    pending = lines.pop()
 
             yield from lines
 
         if pending is not None:
-            yield pending
+            if not delimiter:
+                # If the final chunk ends with CR, then CR will have
+                # been included in 'pending' above, and should be
+                # discarded.  Note that when delimiter = None, the
+                # caller wants splitlines() behavior, which means not
+                # generating a final empty string if the response ends
+                # with a line terminator.
+                yield pending.rstrip(cr)
+            else:
+                # When delimiter != None, the caller wants split()
+                # behavior, which means generating a final empty
+                # string if the response ends with the delimiter.
+                yield pending
 
     @property
     def content(self):

@bemoody
Copy link

bemoody commented Nov 10, 2023

I'll say one more thing on the question of backward compatibility.

I found 82 instances of .iter_lines( in Debian:
https://codesearch.debian.net/search?literal=1&q=.iter_lines(+filetype%3Apython+-package%3Arequests

Based on a very quick assessment:

  • 29 instances are unrelated to Requests.
  • 25 instances are broken in some way because of this bug. (I didn't count how many of them were using stream=True.)
  • 25 instances are not broken because they ignore blank lines.
  • 3 instances I couldn't immediately assess.

So that's 50% of actual users of this method that are currently broken, and 0% that would be rendered broken by fixing this bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants