Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer module does not handle backslash characters correctly #90432

Closed
ucodery mannequin opened this issue Jan 5, 2022 · 17 comments
Closed

Tokenizer module does not handle backslash characters correctly #90432

ucodery mannequin opened this issue Jan 5, 2022 · 17 comments
Labels
3.10 only security fixes 3.11 only security fixes 3.12 bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs)

Comments

@ucodery
Copy link
Mannequin

ucodery mannequin commented Jan 5, 2022

BPO 46274
Nosy @lysnikolaou, @ucodery

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2022-01-05.23:12:47.391>
labels = ['interpreter-core', '3.8', '3.9', '3.10', '3.11', '3.7']
title = 'Tokenizer module does not handle backslash characters correctly'
updated_at = <Date 2022-01-06.13:57:33.135>
user = 'https://github.com/ucodery'

bugs.python.org fields:

activity = <Date 2022-01-06.13:57:33.135>
actor = 'pablogsal'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Parser']
creation = <Date 2022-01-05.23:12:47.391>
creator = 'ucodery'
dependencies = []
files = []
hgrepos = []
issue_num = 46274
keywords = []
message_count = 1.0
messages = ['409811']
nosy_count = 2.0
nosy_names = ['lys.nikolaou', 'ucodery']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = None
url = 'https://bugs.python.org/issue46274'
versions = ['Python 3.6', 'Python 3.7', 'Python 3.8', 'Python 3.9', 'Python 3.10', 'Python 3.11']

@ucodery
Copy link
Mannequin Author

ucodery mannequin commented Jan 5, 2022

A source of one or more backslash-escaped newlines, and one final newline, is not tokenized the same as a source where those lines are "manually joined".

The source

\
\
\

produces the tokens NEWLINE, ENDMARKER when piped to the tokenize module.

Whereas the source


produces the tokens NL, ENDMARKER.

What I expect is to receive only one NL token from both sources. As per the documentation "Two or more physical lines may be joined into logical lines using backslash characters" ... "A logical line that contains only spaces, tabs, formfeeds and possibly a comment, is ignored (i.e., no NEWLINE token is generated)"

And, because these logical lines are not being ignored, if there are spaces/tabs, INDENT and DEDENT tokens are also being unexpectedly produced.

The source

    \

produces the tokens INDENT, NEWLINE, DEDENT, ENDMARKER.

Whereas the source (with spaces)

    

produces the tokens NL, ENDMARKER.

@ucodery ucodery mannequin added 3.7 (EOL) end of life 3.8 (EOL) end of life 3.9 only security fixes 3.10 only security fixes 3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Jan 5, 2022
@pablogsal pablogsal changed the title backslash creating statement out of nothing Tokenizer module does not handle backslash characters correctly Jan 6, 2022
@pablogsal pablogsal changed the title backslash creating statement out of nothing Tokenizer module does not handle backslash characters correctly Jan 6, 2022
@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
@iritkatriel iritkatriel added 3.12 bugs and security fixes and removed 3.9 only security fixes 3.8 (EOL) end of life 3.7 (EOL) end of life labels Oct 7, 2022
@asottile
Copy link
Contributor

asottile commented May 6, 2023

another similar example -- which I believe to be the root cause of hhatto/autopep8#669 is that these are treated differently:

import sys

def f():
    pass
import sys
\

def f():
    pass

the former having a NL token whereas the latter has a NEWLINE

in this particular case it throws off pycodestyle and then autopep8

@asottile
Copy link
Contributor

asottile commented May 6, 2023

this seems to fix it, but I'm not sure if this is the right direction -- it seems overly simple of a fix for such a complicated piece of code

diff --git a/Lib/test/test_tokenize.py b/Lib/test/test_tokenize.py
index 911b53e581..2280374971 100644
--- a/Lib/test/test_tokenize.py
+++ b/Lib/test/test_tokenize.py
@@ -59,6 +59,26 @@ def test_implicit_newline(self):
         self.assertEqual(tokens[-2].type, NEWLINE)
         self.assertEqual(tokens[-1].type, ENDMARKER)
 
+    def test_line_continuation(self):
+        code = dedent("""\
+            import sys
+
+            \\
+
+            import os
+        """)
+
+        self.check_tokenize(code, """\
+    NAME       'import'      (1, 0) (1, 6)
+    NAME       'sys'         (1, 7) (1, 10)
+    NEWLINE    '\\n'          (1, 10) (1, 11)
+    NL         '\\n'          (2, 0) (2, 1)
+    NL         '\\n'          (4, 0) (4, 1)
+    NAME       'import'      (5, 0) (5, 6)
+    NAME       'os'          (5, 7) (5, 9)
+    NEWLINE    '\\n'          (5, 9) (5, 10)
+        """)
+
     def test_basic(self):
         self.check_tokenize("1 + 1", """\
     NUMBER     '1'           (1, 0) (1, 1)
diff --git a/Lib/tokenize.py b/Lib/tokenize.py
index 46d2224f5c..cf66912262 100644
--- a/Lib/tokenize.py
+++ b/Lib/tokenize.py
@@ -593,6 +593,8 @@ def _tokenize(readline, encoding):
                 elif initial.isidentifier():               # ordinary name
                     yield TokenInfo(NAME, token, spos, epos, line)
                 elif initial == '\\':                      # continued stmt
+                    if line == '\\\n':  # ignore an empty escaped line
+                        break
                     continued = 1
                 else:
                     if initial in '([{':

@lysnikolaou
Copy link
Member

This appears to have been fixed in #104323. @asottile Can you check whether that solves the issue for you as well?

@asottile
Copy link
Contributor

@lysnikolaou it seems to fix that -- but also includes a regression(?) (this is 3.12.0b1):

$ diff -u <(python3.11 -mtokenize t.py) <(python3.12 -mtokenize t.py)
--- /dev/fd/63	2023-05-24 08:44:07.429455783 -0400
+++ /dev/fd/62	2023-05-24 08:44:07.429455783 -0400
@@ -2,7 +2,7 @@
 1,0-1,6:            NAME           'import'       
 1,7-1,10:           NAME           'sys'          
 1,10-1,11:          NEWLINE        '\n'           
-3,0-3,1:            NEWLINE        '\n'           
+3,0-3,1:            NL             '\n'           
 4,0-4,3:            NAME           'def'          
 4,4-4,5:            NAME           'f'            
 4,5-4,6:            OP             '('            
@@ -12,5 +12,5 @@
 5,0-5,4:            INDENT         '    '         
 5,4-5,8:            NAME           'pass'         
 5,8-5,9:            NEWLINE        '\n'           
-6,0-6,0:            DEDENT         ''             
+5,9-5,9:            DEDENT         ''             
 6,0-6,0:            ENDMARKER      ''     

@lysnikolaou
Copy link
Member

lysnikolaou commented May 24, 2023

I think that the 3.12 version is the correct one regarding the NEWLINE/NL tokens. Regarding the location of the DEDENT token, I don't have a strong opinion, but feel that it could stay this way. @pablogsal @mgmacias95 What do you think?

@pablogsal
Copy link
Member

pablogsal commented May 24, 2023

This is now covered in the 3.12 what's new:

Aditionally, there may be some minor behavioral changes as a consecuence of the changes required to support PEP 701. Some of these changes include:

Some final DEDENT tokens are now emitted within the bounds of the input. This means that for a file containing 3 lines, the old version of the tokenizer returned a DEDENT token in line 4 whilst the new version returns the token in line 3.
The type attribute of the tokens emitted when tokenizing some invalid Python characters such as ! has changed from ERRORTOKEN to OP.

@pablogsal
Copy link
Member

I think we can close this issue

@lysnikolaou
Copy link
Member

Thanks @pablogsal!

@pablogsal
Copy link
Member

Singe the latest fix, the only different now is the DEDENT:

❯ diff -u <( ../3.11/python.exe -mtokenize lol.py) <(./python.exe -mtokenize lol.py )
--- /dev/fd/11	2023-05-25 17:08:47
+++ /dev/fd/12	2023-05-25 17:08:47
@@ -12,5 +12,5 @@
 4,0-4,4:            INDENT         '    '
 4,4-4,8:            NAME           'pass'
 4,8-4,9:            NEWLINE        '\n'
-5,0-5,0:            DEDENT         ''
+4,9-4,9:            DEDENT         ''
 5,0-5,0:            ENDMARKER      ''

@asottile
Copy link
Contributor

@pablogsal then this should be reopened

@pablogsal
Copy link
Member

pablogsal commented May 25, 2023

@pablogsal then this should be reopened

Why? That is an expected difference as I mentioned in my previous comment.

@asottile
Copy link
Contributor

the NEWLINE token should be a NL if this issue is resolved

@pablogsal
Copy link
Member

pablogsal commented May 25, 2023

the NEWLINE token should be a NL if this issue is resolved

No, that was a bug, see #104866 and #104870

@asottile
Copy link
Contributor

I'm confused -- I think we're talking about different things -- are you using the source here?

#90432 (comment)

@pablogsal
Copy link
Member

pablogsal commented May 25, 2023

are you using the source here?

I am. The difference I was mentioning was between Python 3.11 and 3.12. Just to be clear, given

$cat source1.py
import sys

def f():
    pass

$ cat source2.py
import sys
\

def f():
    pass

with 3.12:

./python -m tokenize < source1.py
1,0-1,6:            NAME           'import'
1,7-1,10:           NAME           'sys'
1,10-1,11:          NEWLINE        '\n'
2,0-2,1:            NL             '\n'
3,0-3,3:            NAME           'def'
3,4-3,5:            NAME           'f'
3,5-3,6:            OP             '('
3,6-3,7:            OP             ')'
3,7-3,8:            OP             ':'
3,8-3,9:            NEWLINE        '\n'
4,0-4,4:            INDENT         '    '
4,4-4,8:            NAME           'pass'
4,8-4,9:            NEWLINE        '\n'
4,9-4,9:            DEDENT         ''
5,0-5,0:            ENDMARKER      ''

./python -m tokenize < source2.py
1,0-1,6:            NAME           'import'
1,7-1,10:           NAME           'sys'
1,10-1,11:          NEWLINE        '\n'
3,0-3,1:            NL             '\n'
4,0-4,3:            NAME           'def'
4,4-4,5:            NAME           'f'
4,5-4,6:            OP             '('
4,6-4,7:            OP             ')'
4,7-4,8:            OP             ':'
4,8-4,9:            NEWLINE        '\n'
5,0-5,4:            INDENT         '    '
5,4-5,8:            NAME           'pass'
5,8-5,9:            NEWLINE        '\n'
5,9-5,9:            DEDENT         ''
6,0-6,0:            ENDMARKER      ''

With 3.11:

python -m tokenize < source1.py
1,0-1,6:            NAME           'import'
1,7-1,10:           NAME           'sys'
1,10-1,11:          NEWLINE        '\n'
2,0-2,1:            NL             '\n'
3,0-3,3:            NAME           'def'
3,4-3,5:            NAME           'f'
3,5-3,6:            OP             '('
3,6-3,7:            OP             ')'
3,7-3,8:            OP             ':'
3,8-3,9:            NEWLINE        '\n'
4,0-4,4:            INDENT         '    '
4,4-4,8:            NAME           'pass'
4,8-4,9:            NEWLINE        '\n'
5,0-5,0:            DEDENT         ''
5,0-5,0:            ENDMARKER      ''

python -m tokenize < source2.py
1,0-1,6:            NAME           'import'
1,7-1,10:           NAME           'sys'
1,10-1,11:          NEWLINE        '\n'
3,0-3,1:            NEWLINE        '\n'
4,0-4,3:            NAME           'def'
4,4-4,5:            NAME           'f'
4,5-4,6:            OP             '('
4,6-4,7:            OP             ')'
4,7-4,8:            OP             ':'
4,8-4,9:            NEWLINE        '\n'
5,0-5,4:            INDENT         '    '
5,4-5,8:            NAME           'pass'
5,8-5,9:            NEWLINE        '\n'
6,0-6,0:            DEDENT         ''
6,0-6,0:            ENDMARKER      ''

The difference between 3.11 and 3.12 for the "correct" version is still the location of the last DEDENT token, which is expected.

@asottile
Copy link
Contributor

hmmm ok but this diff doesn't show the NEWLINE -> NL: #90432 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.10 only security fixes 3.11 only security fixes 3.12 bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs)
Projects
None yet
Development

No branches or pull requests

4 participants