Extended ASCII characters in multiline strings cause "SystemError: Negative size passed to PyUnicode_New" when the encoding is not specified #96611

polprog · 2022-09-06T09:29:26Z

Bug report

In some cases, when dealing with multi-line string with non-utf8 encoded files, python will throw a SystemError: Negative size passed to PyUnicode_New and not execute any code.

Minimal test case:

print("""
ą""")

This is only a problem if the non-utf8 character lies on a new line (at any point in the line)

A similar test case behaves correctly

print("""ą""")

And reports an encoding warning, which is the expected behavior

SyntaxError: Non-UTF-8 code starting with '\xb1' in file C:\Users\xxxxx\test.py on line 2, but no encoding declared; see https://python.org/dev/peps/pep-0263/ for details

Since this is an encoding related errors, both files are attached (as .txt, GitHub does not allow .py attachments).
test.txt - single line (correct behavior)
test_ml.txt - multi line (bug)

My environment

CPython versions tested on: Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug 1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)] on win32
Operating system and architecture: Windows 10 Pro 21H2 (19044.1826)

The text was updated successfully, but these errors were encountered:

mdboom · 2022-09-06T18:25:42Z

#96270 may fix this. Let me confirm.

mdboom · 2022-09-06T18:30:11Z

#96270 may fix this. Let me confirm.

It does not fix this.

mdboom · 2022-09-06T18:32:31Z

Since copy-and-paste doesn't usually preserve broken encodings, this is a convenient way to reproduce the bug:

open("x.py", "wb").write(b'print("""\n\xb1""")')

$ python x.py

eryksun · 2022-09-06T19:41:57Z

In tok_get() in "Parser/tokenizer.c", the following code blindly handles EOF returned from tok_nextc() as if it's the end of the file.

cpython/Parser/tokenizer.c

Lines 1936 to 1948 in 6744490

    
           /* Get rest of string */ 
        
           while (end_quote_size != quote_size) { 
        
               c = tok_nextc(tok); 
        
               if (c == EOF || (quote_size == 1 && c == '\n')) { 
        
                   assert(tok->multi_line_start != NULL); 
        
                   // shift the tok_state's location into 
        
                   // the start of string, and report the error 
        
                   // from the initial quote character 
        
                   tok->cur = (char *)tok->start; 
        
                   tok->cur++; 
        
                   tok->line_start = tok->multi_line_start; 
        
                   int start = tok->lineno; 
        
                   tok->lineno = tok->first_lineno;

In this case, however, tok->done is E_DECODE instead of E_EOF. This gets set by error_ret(), which also clears tok->start and tok->cur to NULL. The above code increments tok->cur to 1. Subsequently, _syntaxerror_range() tries to decode the text for the syntax error using the negative size 1 - tok->line_start.

…string

…#96623)

…string (pythonGH-96623) (cherry picked from commit 05692c6) Co-authored-by: Michael Droettboom <mdboom@gmail.com>

…GH-96623) (cherry picked from commit 05692c6) Co-authored-by: Michael Droettboom <mdboom@gmail.com>

mdboom · 2022-09-07T12:25:53Z

Thanks for the report, @polprog, and the diagnostics, @eryksun.

polprog added the type-bug An unexpected behavior, bug, or error label Sep 6, 2022

mdboom added the topic-unicode label Sep 6, 2022

mdboom self-assigned this Sep 6, 2022

eryksun added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Sep 6, 2022

mdboom added a commit to mdboom/cpython that referenced this issue Sep 6, 2022

pythongh-96611: Fix error message for invalid UTF-8 in mid-multiline …

709d7e1

…string

bedevere-bot mentioned this issue Sep 6, 2022

gh-96611: Fix error message for invalid UTF-8 in mid-multiline string #96623

Merged

pablogsal pushed a commit that referenced this issue Sep 6, 2022

gh-96611: Fix error message for invalid UTF-8 in mid-multiline string (…

05692c6

…#96623)

bedevere-bot mentioned this issue Sep 6, 2022

[3.11] gh-96611: Fix error message for invalid UTF-8 in mid-multiline string (GH-96623) #96631

Merged

bedevere-bot mentioned this issue Sep 6, 2022

[3.10] gh-96611: Fix error message for invalid UTF-8 in mid-multiline string (GH-96623) #96632

Merged

miss-islington added a commit that referenced this issue Sep 6, 2022

gh-96611: Fix error message for invalid UTF-8 in mid-multiline string (…

b6af933

…GH-96623) (cherry picked from commit 05692c6) Co-authored-by: Michael Droettboom <mdboom@gmail.com>

miss-islington added a commit that referenced this issue Sep 6, 2022

gh-96611: Fix error message for invalid UTF-8 in mid-multiline string (…

bb0dab5

…GH-96623) (cherry picked from commit 05692c6) Co-authored-by: Michael Droettboom <mdboom@gmail.com>

mdboom closed this as completed Sep 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extended ASCII characters in multiline strings cause "SystemError: Negative size passed to PyUnicode_New" when the encoding is not specified #96611

Extended ASCII characters in multiline strings cause "SystemError: Negative size passed to PyUnicode_New" when the encoding is not specified #96611

polprog commented Sep 6, 2022

mdboom commented Sep 6, 2022

mdboom commented Sep 6, 2022

mdboom commented Sep 6, 2022

eryksun commented Sep 6, 2022

mdboom commented Sep 7, 2022

Extended ASCII characters in multiline strings cause "SystemError: Negative size passed to PyUnicode_New" when the encoding is not specified #96611

Extended ASCII characters in multiline strings cause "SystemError: Negative size passed to PyUnicode_New" when the encoding is not specified #96611

Comments

polprog commented Sep 6, 2022

Bug report

My environment

mdboom commented Sep 6, 2022

mdboom commented Sep 6, 2022

mdboom commented Sep 6, 2022

eryksun commented Sep 6, 2022

mdboom commented Sep 7, 2022