-
-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnicodeDecodeError: 'utf-8' codec can't decode byte ... #18
Comments
Line 91 in 4d2f158
|
The debugger showed me the file is this one:
|
Relevant PEP: https://peps.python.org/pep-0263/ The PEP provides this regular expression, which should match from the first two lines:
So I could use |
... and here's exactly that code in # See https://peps.python.org/pep-0263/
# I couldn't find a stdlib function for this
_ENCODING_RE = re.compile(b"^[ \t\f]*#.*?coding[:=][ \t]*([-_.a-zA-Z0-9]+)")
def get_encoding(python_file_bytes: bytes) -> str:
# Search in first two lines:
current_idx = 0
for line_num in (1, 2):
# what does a line break character look like
# if we don't know the encoding? Have to assume '\n' for now
linebreak_idx = python_file_bytes.find(b"\n", current_idx)
if linebreak_idx < 0:
line = python_file_bytes[current_idx:]
else:
line = python_file_bytes[current_idx:linebreak_idx]
coding_match = _ENCODING_RE.match(line)
if coding_match:
return coding_match.groups()[0].decode("ascii")
if linebreak_idx < 0:
break
else:
current_idx = linebreak_idx + 1
return "utf-8"
def parse_python_file(contents: bytes, filename: str | Path, *, auto_dedent: bool) -> tuple[str, ast.AST]:
if auto_dedent:
contents = auto_dedent_code(contents)
parsed_ast: ast.AST = ast.parse(contents, str(filename))
# ast.parse does it's own encoding detection, which we have to replicate
# here since we can't assume utf-8
encoding = get_encoding(contents)
str_contents = contents.decode(encoding)
return str_contents, parsed_ast And its tests: https://github.com/spookylukey/pyastgrep/blob/5db475f8a2712984fb307e2da64c337563046a0b/tests/test_encodings.py |
I'm going to go with a slightly simpler implementation that opens in |
@simonw The |
Grabbing that data for a test: >>> import httpx
>>> httpx.get("https://raw.githubusercontent.com/ipython/ipython/2da6fb9870dffd838ba87b8e99ded17acd0d4edb/IPython/core/tests/nonascii.py").content
b'# coding: iso-8859-5\n# (Unlikely to be the default encoding for most testers.)\n# \xb1\xb6\xff\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef <- Cyrillic characters\nu = "\xae\xe2\xf0\xc4"\n' |
I'm basically using ChatGPT Code Interpreter as a code typing intern at this point. https://chat.openai.com/share/b062955d-3601-4051-b6d9-80cef9228233 |
And with this change in place, the following command seems to just keep on running forever: symbex -s -d ~/Dropbox/Development Not sure if I have the patience to wait for it to finish. |
I ran it with
So 2m32s to finish. And Activity Monitor showed the process topping out at about 230MB of RAM. |
Following on from:
I ran this in a huge directory:
And eventually got this:
The text was updated successfully, but these errors were encountered: