Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Token lines in the tokenize module always end in new line #104825

Closed
pablogsal opened this issue May 23, 2023 · 7 comments
Closed

Token lines in the tokenize module always end in new line #104825

pablogsal opened this issue May 23, 2023 · 7 comments

Comments

@pablogsal
Copy link
Member

pablogsal commented May 23, 2023

As a consequence of the changes in the Python tokenizer post PEP701, now the line attribute always ends in a new line character. We should remove this as it adds no information and wasn't there before.

Linked PRs

@lysnikolaou
Copy link
Member

Yeah, I ran across that while working on #104798. Are you already working on a fix or shall I take it up?

@pablogsal
Copy link
Member Author

Yeah, I ran across that while working on #104798. Are you already working on a fix or shall I take it up?

I am working on a fix, should be up in half an hour or so

pablogsal added a commit to pablogsal/cpython that referenced this issue May 24, 2023
pablogsal added a commit that referenced this issue May 24, 2023
@pablogsal pablogsal added needs backport to 3.12 bug and security fixes and removed needs backport to 3.12 bug and security fixes labels May 24, 2023
pablogsal added a commit to pablogsal/cpython that referenced this issue May 24, 2023
… in tokens emitted in the tokenize module (pythonGH-104846).

(cherry picked from commit c8cf9b4)

Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>
@lysnikolaou
Copy link
Member

#104846 now removes a newline, even when a newline exists in the original source.

❯ ./python.exe
Python 3.13.0a0 (heads/setuptools-peg-generator-dirty:c7503656c2, May 24 2023, 12:24:20) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import tokenize
>>> import io
>>> file = io.StringIO('1 + 2 + 3\n')
>>> for token in tokenize.generate_tokens(file.readline):
...     print(token)
... 
TokenInfo(type=2 (NUMBER), string='1', start=(1, 0), end=(1, 1), line='1 + 2 + 3')
TokenInfo(type=55 (OP), string='+', start=(1, 2), end=(1, 3), line='1 + 2 + 3')
TokenInfo(type=2 (NUMBER), string='2', start=(1, 4), end=(1, 5), line='1 + 2 + 3')
TokenInfo(type=55 (OP), string='+', start=(1, 6), end=(1, 7), line='1 + 2 + 3')
TokenInfo(type=2 (NUMBER), string='3', start=(1, 8), end=(1, 9), line='1 + 2 + 3')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 9), end=(1, 10), line='1 + 2 + 3')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')

pablogsal added a commit that referenced this issue May 24, 2023
…kens emitted in the tokenize module (GH-104846). (#104850)

(cherry picked from commit c8cf9b4)
@lysnikolaou
Copy link
Member

@pablogsal Do you think the above should be fixed? Will you work on it?

@pablogsal
Copy link
Member Author

@pablogsal Do you think the above should be fixed? Will you work on it?

Hummmmm, not sure how easy is going to be for us to distinguish between implicit and explicit newlines at the end. I would say, leave it like is now unless you disagree, in which case let's see if we can fix it easily.

@lysnikolaou
Copy link
Member

My only concern is that tests that rely on that will fail. For example, test_pegen now fails:

def test_repeat_0_simple(self) -> None:
        grammar = """
        start: thing thing* NEWLINE
        thing: NUMBER
        """
        parser_class = make_parser(grammar)
        node = parse_string("1 2 3\n", parser_class)
        self.assertEqual(
            node,
            [
                TokenInfo(NUMBER, string="1", start=(1, 0), end=(1, 1), line="1 2 3\n"),
                [
                    TokenInfo(
                        NUMBER, string="2", start=(1, 2), end=(1, 3), line="1 2 3\n"
                    ),
                    TokenInfo(
                        NUMBER, string="3", start=(1, 4), end=(1, 5), line="1 2 3\n"
                    ),
                ],
                TokenInfo(
                    NEWLINE, string="\n", start=(1, 5), end=(1, 6), line="1 2 3\n"
                ),
            ],
        )
        node = parse_string("1\n", parser_class)
        self.assertEqual(
            node,
            [
                TokenInfo(NUMBER, string="1", start=(1, 0), end=(1, 1), line="1\n"),
                [],
                TokenInfo(NEWLINE, string="\n", start=(1, 1), end=(1, 2), line="1\n"),
            ],
        )

The failure is:

======================================================================
FAIL: test_repeat_0_simple (test.test_peg_generator.test_pegen.TestPegen.test_repeat_0_simple)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/lysnikolaou/repos/python/cpython/Lib/test/test_peg_generator/test_pegen.py", line 248, in test_repeat_0_simple
    self.assertEqual(
AssertionError: Lists differ: [Toke[67 chars]1 2 3'), [TokenInfo(type=2 (NUMBER), string='2[201 chars] 3')] != [Toke[67 chars]1 2 3\n'), [TokenInfo(type=2 (NUMBER), string=[209 chars]\n')]

First differing element 0:
TokenInfo(type=2 (NUMBER), string='1', start=(1, 0), end=(1, 1), line='1 2 3')
TokenInfo(type=2 (NUMBER), string='1', start=(1, 0), end=(1, 1), line='1 2 3\n')

Diff is 1018 characters long. Set self.maxDiff to None to see it.

The fix is very simple, so I'm okay with going with this as-is. I will fix test_pegen in my setuptools PR.

terryjreedy added a commit to terryjreedy/cpython that referenced this issue May 24, 2023
Order of events:
Terry merged new idlelib test into main.
Ms. I. made a 3.12 backport; tests passed.
Pablo merged the tokenize change with idlelib test fix into main.
Pablo merged a 3.12 backport without the idle test fix
as the backport of the latter had not yet been been merged.
Terry merged the idlelib test backport.  The new test failed
on at least 4 3.12 buildbots because of the tokenize change.
This PR backports the now needed idlelib test fix.

(cherry picked from commit c8cf9b4)
pablogsal pushed a commit that referenced this issue May 24, 2023
Order of events:
Terry merged new idlelib test into main.
Ms. I. made a 3.12 backport; tests passed.
Pablo merged the tokenize change with idlelib test fix into main.
Pablo merged a 3.12 backport without the idle test fix
as the backport of the latter had not yet been been merged.
Terry merged the idlelib test backport.  The new test failed
on at least 4 3.12 buildbots because of the tokenize change.
This PR backports the now needed idlelib test fix.

(cherry picked from commit c8cf9b4)
@nedbat
Copy link
Member

nedbat commented May 26, 2023

You say:

As a consequence of the changes in the Python tokenizer post PEP701, now the line attribute always ends in a new line character. We should remove this as it adds no information and wasn't there before.

But it looks to me like the lines used to always end in newline, except for ENDMARKER. This uses tokbug.py from #104972:

3.10.9 (main, Dec  7 2022, 07:15:24) [Clang 12.0.0 (clang-1200.0.32.29)]
TokenInfo(type=62 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n')
TokenInfo(type=1 (NAME), string='a', start=(2, 0), end=(2, 1), line='a + \\\n')
TokenInfo(type=54 (OP), string='+', start=(2, 2), end=(2, 3), line='a + \\\n')
TokenInfo(type=1 (NAME), string='b', start=(3, 0), end=(3, 1), line='b\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 1), end=(3, 2), line='b\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(4, 0), end=(4, 0), line='')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants