Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: 'utf-8' codec can't decode byte ... #18

Closed
simonw opened this issue Jun 19, 2023 · 10 comments
Closed

UnicodeDecodeError: 'utf-8' codec can't decode byte ... #18

simonw opened this issue Jun 19, 2023 · 10 comments
Labels
bug Something isn't working

Comments

@simonw
Copy link
Owner

simonw commented Jun 19, 2023

Following on from:

I ran this in a huge directory:

symbex -s function_with_non_pep_0484_annotation

And eventually got this:

Traceback (most recent call last):
  File "/Users/simon/.local/share/virtualenvs/symbex--e1aIHUb/bin/symbex", line 33, in <module>
    sys.exit(load_entry_point('symbex', 'console_scripts', 'symbex')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/simon/.local/share/virtualenvs/symbex--e1aIHUb/lib/python3.11/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/simon/.local/share/virtualenvs/symbex--e1aIHUb/lib/python3.11/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/simon/.local/share/virtualenvs/symbex--e1aIHUb/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/simon/.local/share/virtualenvs/symbex--e1aIHUb/lib/python3.11/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/simon/Dropbox/Development/symbex/symbex/cli.py", line 91, in cli
    code = file.read_text("utf-8") if hasattr(file, "read_text") else file.read()
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.4/Frameworks/Python.framework/Versions/3.11/lib/python3.11/pathlib.py", line 1059, in read_text
    return f.read()
           ^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 in position 81: invalid start byte
@simonw simonw added the bug Something isn't working label Jun 19, 2023
@simonw
Copy link
Owner Author

simonw commented Jun 19, 2023

code = file.read_text("utf-8") if hasattr(file, "read_text") else file.read()

@simonw
Copy link
Owner Author

simonw commented Jun 19, 2023

The debugger showed me the file is this one:

/Users/simon/Dropbox/Development/library-of-congress/.venv/lib/python3.6/site-packages/IPython/core/tests/nonascii.py

https://github.com/ipython/ipython/blob/2da6fb9870dffd838ba87b8e99ded17acd0d4edb/IPython/core/tests/nonascii.py

# coding: iso-8859-5
# (Unlikely to be the default encoding for most testers.)
# ±¶ÿàáâãäåæçèéêëìíîï <- Cyrillic characters
u = "®âðÄ"

@simonw
Copy link
Owner Author

simonw commented Jun 19, 2023

Relevant PEP: https://peps.python.org/pep-0263/

The PEP provides this regular expression, which should match from the first two lines:

^[ \t\f]*#.*?coding[:=][ \t]*([-_.a-zA-Z0-9]+)

So I could use open(file_path, 'r', encoding='utf-8', errors='ignore') to read the first few lines of the file, scan for that encoding, and if it's there use that to read the rest of the file.

@simonw
Copy link
Owner Author

simonw commented Jun 19, 2023

... and here's exactly that code in pyastgrep (via #17 (comment)):

https://github.com/spookylukey/pyastgrep/blob/5db475f8a2712984fb307e2da64c337563046a0b/src/pyastgrep/files.py#L34-L69

# See https://peps.python.org/pep-0263/
# I couldn't find a stdlib function for this
_ENCODING_RE = re.compile(b"^[ \t\f]*#.*?coding[:=][ \t]*([-_.a-zA-Z0-9]+)")


def get_encoding(python_file_bytes: bytes) -> str:
    # Search in first two lines:
    current_idx = 0
    for line_num in (1, 2):
        # what does a line break character look like
        # if we don't know the encoding? Have to assume '\n' for now
        linebreak_idx = python_file_bytes.find(b"\n", current_idx)
        if linebreak_idx < 0:
            line = python_file_bytes[current_idx:]
        else:
            line = python_file_bytes[current_idx:linebreak_idx]
        coding_match = _ENCODING_RE.match(line)
        if coding_match:
            return coding_match.groups()[0].decode("ascii")
        if linebreak_idx < 0:
            break
        else:
            current_idx = linebreak_idx + 1
    return "utf-8"


def parse_python_file(contents: bytes, filename: str | Path, *, auto_dedent: bool) -> tuple[str, ast.AST]:
    if auto_dedent:
        contents = auto_dedent_code(contents)

    parsed_ast: ast.AST = ast.parse(contents, str(filename))
    # ast.parse does it's own encoding detection, which we have to replicate
    # here since we can't assume utf-8
    encoding = get_encoding(contents)
    str_contents = contents.decode(encoding)
    return str_contents, parsed_ast

And its tests: https://github.com/spookylukey/pyastgrep/blob/5db475f8a2712984fb307e2da64c337563046a0b/tests/test_encodings.py

@simonw
Copy link
Owner Author

simonw commented Jun 19, 2023

I'm going to go with a slightly simpler implementation that opens in encoding="utf-8", errors="ignore" mode, reads the first 512 bytes (number plucked out of the air), splits on newlines to get the first two lines and runs that regex from the PEP against those lines.

@spookylukey
Copy link

@simonw The get_encoding code here is a bit verbose here because I'm focussing both on correctness and performance - I don't want to end up processing the whole file in order to get just the encoding from the first two lines. So it uses lower level idioms than it might otherwise do.

@simonw
Copy link
Owner Author

simonw commented Jun 19, 2023

Grabbing that data for a test:

>>> import httpx
>>> httpx.get("https://raw.githubusercontent.com/ipython/ipython/2da6fb9870dffd838ba87b8e99ded17acd0d4edb/IPython/core/tests/nonascii.py").content
b'# coding: iso-8859-5\n# (Unlikely to be the default encoding for most testers.)\n# \xb1\xb6\xff\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef <- Cyrillic characters\nu = "\xae\xe2\xf0\xc4"\n'

@simonw
Copy link
Owner Author

simonw commented Jun 19, 2023

I'm basically using ChatGPT Code Interpreter as a code typing intern at this point. https://chat.openai.com/share/b062955d-3601-4051-b6d9-80cef9228233

@simonw simonw closed this as completed in 366760a Jun 19, 2023
@simonw
Copy link
Owner Author

simonw commented Jun 19, 2023

And with this change in place, the following command seems to just keep on running forever:

symbex -s -d ~/Dropbox/Development

Not sure if I have the patience to wait for it to finish.

@simonw
Copy link
Owner Author

simonw commented Jun 19, 2023

I ran it with time:

symbex -s -d ~/Dropbox/Development  110.60s user 24.30s system 88% cpu 2:32.92 total
grep '# File'  0.15s user 0.03s system 0% cpu 2:32.92 total
tee /tmp/done.txt  0.01s user 0.23s system 0% cpu 2:32.92 total

So 2m32s to finish.

And wc -l /tmp/done.txt reports 307,910 files were processed.

Activity Monitor showed the process topping out at about 230MB of RAM.

simonw added a commit that referenced this issue Jun 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants