UnicodeDecodeError: 'utf-8' codec can't decode byte ... #18

simonw · 2023-06-19T16:24:44Z

Following on from:

AttributeError: 'Call' object has no attribute 'value' #16

I ran this in a huge directory:

symbex -s function_with_non_pep_0484_annotation

And eventually got this:

Traceback (most recent call last):
  File "/Users/simon/.local/share/virtualenvs/symbex--e1aIHUb/bin/symbex", line 33, in <module>
    sys.exit(load_entry_point('symbex', 'console_scripts', 'symbex')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/simon/.local/share/virtualenvs/symbex--e1aIHUb/lib/python3.11/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/simon/.local/share/virtualenvs/symbex--e1aIHUb/lib/python3.11/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/simon/.local/share/virtualenvs/symbex--e1aIHUb/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/simon/.local/share/virtualenvs/symbex--e1aIHUb/lib/python3.11/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/simon/Dropbox/Development/symbex/symbex/cli.py", line 91, in cli
    code = file.read_text("utf-8") if hasattr(file, "read_text") else file.read()
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.4/Frameworks/Python.framework/Versions/3.11/lib/python3.11/pathlib.py", line 1059, in read_text
    return f.read()
           ^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 in position 81: invalid start byte

The text was updated successfully, but these errors were encountered:

simonw · 2023-06-19T16:25:26Z

symbex/symbex/cli.py

Line 91 in 4d2f158

code = file.read_text("utf-8") if hasattr(file, "read_text") else file.read()

simonw · 2023-06-19T16:26:44Z

The debugger showed me the file is this one:

/Users/simon/Dropbox/Development/library-of-congress/.venv/lib/python3.6/site-packages/IPython/core/tests/nonascii.py

https://github.com/ipython/ipython/blob/2da6fb9870dffd838ba87b8e99ded17acd0d4edb/IPython/core/tests/nonascii.py

# coding: iso-8859-5
# (Unlikely to be the default encoding for most testers.)
# ±¶ÿàáâãäåæçèéêëìíîï <- Cyrillic characters
u = "®âðÄ"

simonw · 2023-06-19T16:35:29Z

Relevant PEP: https://peps.python.org/pep-0263/

The PEP provides this regular expression, which should match from the first two lines:

^[ \t\f]*#.*?coding[:=][ \t]*([-_.a-zA-Z0-9]+)

So I could use open(file_path, 'r', encoding='utf-8', errors='ignore') to read the first few lines of the file, scan for that encoding, and if it's there use that to read the rest of the file.

simonw · 2023-06-19T16:39:38Z

... and here's exactly that code in pyastgrep (via #17 (comment)):

https://github.com/spookylukey/pyastgrep/blob/5db475f8a2712984fb307e2da64c337563046a0b/src/pyastgrep/files.py#L34-L69

# See https://peps.python.org/pep-0263/
# I couldn't find a stdlib function for this
_ENCODING_RE = re.compile(b"^[ \t\f]*#.*?coding[:=][ \t]*([-_.a-zA-Z0-9]+)")


def get_encoding(python_file_bytes: bytes) -> str:
    # Search in first two lines:
    current_idx = 0
    for line_num in (1, 2):
        # what does a line break character look like
        # if we don't know the encoding? Have to assume '\n' for now
        linebreak_idx = python_file_bytes.find(b"\n", current_idx)
        if linebreak_idx < 0:
            line = python_file_bytes[current_idx:]
        else:
            line = python_file_bytes[current_idx:linebreak_idx]
        coding_match = _ENCODING_RE.match(line)
        if coding_match:
            return coding_match.groups()[0].decode("ascii")
        if linebreak_idx < 0:
            break
        else:
            current_idx = linebreak_idx + 1
    return "utf-8"


def parse_python_file(contents: bytes, filename: str | Path, *, auto_dedent: bool) -> tuple[str, ast.AST]:
    if auto_dedent:
        contents = auto_dedent_code(contents)

    parsed_ast: ast.AST = ast.parse(contents, str(filename))
    # ast.parse does it's own encoding detection, which we have to replicate
    # here since we can't assume utf-8
    encoding = get_encoding(contents)
    str_contents = contents.decode(encoding)
    return str_contents, parsed_ast

And its tests: https://github.com/spookylukey/pyastgrep/blob/5db475f8a2712984fb307e2da64c337563046a0b/tests/test_encodings.py

simonw · 2023-06-19T17:41:05Z

I'm going to go with a slightly simpler implementation that opens in encoding="utf-8", errors="ignore" mode, reads the first 512 bytes (number plucked out of the air), splits on newlines to get the first two lines and runs that regex from the PEP against those lines.

spookylukey · 2023-06-19T17:44:55Z

@simonw The get_encoding code here is a bit verbose here because I'm focussing both on correctness and performance - I don't want to end up processing the whole file in order to get just the encoding from the first two lines. So it uses lower level idioms than it might otherwise do.

simonw · 2023-06-19T17:50:30Z

Grabbing that data for a test:

>>> import httpx
>>> httpx.get("https://raw.githubusercontent.com/ipython/ipython/2da6fb9870dffd838ba87b8e99ded17acd0d4edb/IPython/core/tests/nonascii.py").content
b'# coding: iso-8859-5\n# (Unlikely to be the default encoding for most testers.)\n# \xb1\xb6\xff\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef <- Cyrillic characters\nu = "\xae\xe2\xf0\xc4"\n'

simonw · 2023-06-19T17:54:08Z

I'm basically using ChatGPT Code Interpreter as a code typing intern at this point. https://chat.openai.com/share/b062955d-3601-4051-b6d9-80cef9228233

simonw · 2023-06-19T17:54:58Z

And with this change in place, the following command seems to just keep on running forever:

symbex -s -d ~/Dropbox/Development

Not sure if I have the patience to wait for it to finish.

simonw · 2023-06-19T18:00:42Z

I ran it with time:

symbex -s -d ~/Dropbox/Development  110.60s user 24.30s system 88% cpu 2:32.92 total
grep '# File'  0.15s user 0.03s system 0% cpu 2:32.92 total
tee /tmp/done.txt  0.01s user 0.23s system 0% cpu 2:32.92 total

So 2m32s to finish.

And wc -l /tmp/done.txt reports 307,910 files were processed.

Activity Monitor showed the process topping out at about 230MB of RAM.

'Refs #14, #15, #16, #18, #19, #20

simonw added the bug Something isn't working label Jun 19, 2023

simonw mentioned this issue Jun 19, 2023

UnicodeDecode error #17

Closed

simonw closed this as completed in 366760a Jun 19, 2023

simonw added a commit that referenced this issue Jun 19, 2023

Release 0.4

21c5deb

'Refs #14, #15, #16, #18, #19, #20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError: 'utf-8' codec can't decode byte ... #18

UnicodeDecodeError: 'utf-8' codec can't decode byte ... #18

simonw commented Jun 19, 2023

simonw commented Jun 19, 2023

simonw commented Jun 19, 2023 •

edited

Loading

simonw commented Jun 19, 2023

simonw commented Jun 19, 2023 •

edited

Loading

simonw commented Jun 19, 2023

spookylukey commented Jun 19, 2023

simonw commented Jun 19, 2023

simonw commented Jun 19, 2023

simonw commented Jun 19, 2023 •

edited

Loading

simonw commented Jun 19, 2023

UnicodeDecodeError: 'utf-8' codec can't decode byte ... #18

UnicodeDecodeError: 'utf-8' codec can't decode byte ... #18

Comments

simonw commented Jun 19, 2023

simonw commented Jun 19, 2023

simonw commented Jun 19, 2023 • edited Loading

simonw commented Jun 19, 2023

simonw commented Jun 19, 2023 • edited Loading

simonw commented Jun 19, 2023

spookylukey commented Jun 19, 2023

simonw commented Jun 19, 2023

simonw commented Jun 19, 2023

simonw commented Jun 19, 2023 • edited Loading

simonw commented Jun 19, 2023

simonw commented Jun 19, 2023 •

edited

Loading

simonw commented Jun 19, 2023 •

edited

Loading

simonw commented Jun 19, 2023 •

edited

Loading