Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure UTF-8 encoding when communicating with a subprocess #8563

Closed

Conversation

privet-kitty
Copy link

Pull Request Check List

Resolves: #8550

  • Added tests for changed code.
  • Updated documentation for changed code.

Manual tests

  • I checked that the added test test_run_python_script_non_ascii_input failed on windows instances before fixing base_env: https://github.com/privet-kitty/poetry/actions/runs/6591942649
    • This is because the default encoding of sys.stdin is not utf-8 but cp1252 on a Windows instance (when piped) . So I didn't need to prepare a special non-utf-8 environment for testing. If it is better to set up the test environment explicitly, the only way I know is invoking a PowerShell command Set-WinSystemLocale <some locale> before testing in workflows.
  • Unfortunately I didn't come up with a good and simple test case that tests the reported error itself, which is about GET_SYS_TAGS. (I feel mocking this string is much the same as test_run_python_script_non_ascii_input.) As a one-time testing, I modified the master branch to use a non-ascii working directory in CI, saw some test cases failed, and it succeeded after merging this change.

@radoering
Copy link
Member

For my understanding: This fix is only relevant if input is passed to subprocess.run() so that the calls of subprocess.check_output() in base_env.py and env_manager.py with text=True need not be touched?

@dimbleby
Copy link
Contributor

I'm not entirely sure that's true: it seems to me this should also be relevant when dealing with output produced by the child process.

IIUC: by not specifing encoding="utf-8" in more places this merge request may introduce problems where the child python process now is producing UTF8 but the parent is not necessarily expecting that. (Previously this would have worked so long as the parent and child agreed on the encoding, which might or might not have been UTF8).

@privet-kitty
Copy link
Author

I'm not sure either about any unexpected regressions due to this change, but if you guys are really concerned about it, setting encoding not in _run but in run_python_script would be an idea, which is a minimal change and less likely to break other things.

@dimbleby
Copy link
Contributor

There are examples in env_manager.py where a python subprocess is called not via run_python_script(),

My first guess is that encoding="utf-8" should be added more or less anywhere that text=True is currently set.

But I'm not sure I yet understand what is going on to make this necessary: why don't the parent and child pythons already agree on the encoding, regardless of whether it is utf8?

@privet-kitty
Copy link
Author

There are examples in env_manager.py where a python subprocess is called not via run_python_script(),
My first guess is that encoding="utf-8" should be added more or less anywhere that text=True is currently set.

OK. I just guessed that tweaking run_python_script doesn't affect the other type of calling python, because in that case "bytes-level" processing is hidden in the internal implementation of this str-to-str function, but trying fixing other things wouldn't be a bad idea.

But I'm not sure I yet understand what is going on to make this necessary: why don't the parent and child pythons already agree on the encoding, regardless of whether it is utf8?

I'm not an expert of python or character code, but one thing I know is the encoding of source code is a different thing in Python, because Python assumes UTF-8 regardless of the locale:

If no encoding declaration is found, the default encoding is UTF-8. (from https://docs.python.org/3/reference/lexical_analysis.html#encoding-declarations)

Let's see how the following script works on a windows instance with the Japanese locale. You can reproduce it on GitHub Actions: https://github.com/privet-kitty/ci-encoding/actions/runs/6599347808.

import subprocess
import sys
import locale
from typing import Optional


def run(
    no: int,
    encoding: Optional[str],
    input: str,
    options: list[str] = [],
) -> None:
    print(no)
    cmd = [sys.executable, "-I"] + options + ["-"]
    try:
        proc = subprocess.run(
            cmd,
            stdout=subprocess.PIPE,
            input=input,
            check=True,
            text=True,
            encoding=encoding,
        )
        print("stdout:")
        print(proc.stdout)
    except Exception:
        print("exc_info:")
        print(sys.exc_info()[1])


print(f"{locale.getencoding()=}")
run(1, None, 'print("a")')  # OK
run(2, None, 'print("中")')  # NG
run(3, "utf-8", 'print("中")')  # NG
run(4, "utf-8", 'print("中")', options=["-X", "utf8"])  # OK

In my environment, the output is as follows. (This is much the same as that of the CI above, but the output order is a bit different, probably because an exception is raised in another thread.)

locale.getencoding()='cp932'
1
stdout:
a

2
SyntaxError: Non-UTF-8 code starting with '\x92' in file <stdin> on line 1, but no encoding declared; see https://peps.python.org/pep-0263/ for details
exc_info:
Command '['C:\\Users\\hugov\\.pyenv\\pyenv-win\\versions\\3.11.1\\python.exe', '-I', '-']' returned non-zero exit status 1.
3
Exception in thread Thread-3 (_readerthread):
Traceback (most recent call last):
  File "C:\Users\hugov\.pyenv\pyenv-win\versions\3.11.1\Lib\threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "C:\Users\hugov\.pyenv\pyenv-win\versions\3.11.1\Lib\threading.py", line 975, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\hugov\.pyenv\pyenv-win\versions\3.11.1\Lib\subprocess.py", line 1552, in _readerthread
    buffer.append(fh.read())
                  ^^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 0: invalid start byte
stdout:
None
4
stdout:
中

My interpretation:

case 2:
The preferred encoding of the parent, which is based on the locale, is cp932. So the subprocess encodes in cp932 and passes it to the child Python, but the child tries to decode it in utf-8 because of the rule above and fails.

case 3:
The subprocess encodes "中" in utf-8 according to the encoding and passes it to the child Python. The child decodes it in utf-8 because of the rule above and outputs in cp932 based on the locale, but of course the parent tries to decode it in utf-8 and fails.

case 4:
The subprocess encodes "中" in utf-8 according to the encoding and passes it to the child Python. The child decodes it in utf-8 because of the rule above and outputs in utf-8 because it is in utf-8 mode, and the parent successfully decodes it in utf-8. Congrats.

@dimbleby
Copy link
Contributor

dimbleby commented Oct 21, 2023

so perhaps a simpler / less invasive fix is to put a PEP-263-style explicit encoding at the top of GET_SYS_TAGS, getting the value from locale.getpreferredencoding(False)?

then both parent and child python can use the same locale encoding, which might or might not be utf8 but it doesn't matter because they will again agree.

so far as I can see other python scripts executed by poetry only use ascii characters - this one is the odd one out because it includes the value of __packaging__.file.

@privet-kitty
Copy link
Author

so perhaps a simpler / less invasive fix is to put a PEP-263-style explicit encoding at the top of GET_SYS_TAGS, getting the value from locale.getpreferredencoding(False)?

Possibly. Yet another thing I'm a little worried about is the -I option for calling the child Python. It also ignores the env vars like PYTHONUTF8 or PYTHONIOENCODING which affect the locale. If the parent is affected by them while the child isn't, what the parent assumes may be different from what the child outputs.

@dimbleby
Copy link
Contributor

I wouldn't worry about that. Anyone who is relying on such environment variables to influence the behaviour of poetry can simply be told not to.

@privet-kitty
Copy link
Author

OK. I added the encoding declaration and tried the same non-ascii working directory test but in vain. Again the GET_SYS_TAGS script doesn't work. For example, test_installer_with_pypi_repository test fails due to JSONDecodeError because get_supported_tags gets an empty string '' from run_python_script.

I can observe this phenomenon locally. Python's PEP-263 behavior seems to be weird, when - option is used and non-utf-8 code is declared:

# PowerShell console
> echo "print(1)" | python -  
1
> echo "# encoding: utf-8`nprint(1)" | python -
1
> echo "# encoding: cp1250`nprint(1)" | python -  # no output
> echo "# encoding: ascii`nprint(1)" | python -  # no output
> echo "# encoding: ascii`nassert 1 == 0" | python -  # no output (and exit code is zero)
> echo "# encoding: cp1250`nprint(1)" > source.py
> python source.py
1

So far I've failed to find any document that can explain this behavior.

@dimbleby
Copy link
Contributor

I reckon running python scripts as python -c script, instead of as python - and then using stdin, should sort this out then.

#8565

@radoering
Copy link
Member

Is this superseded by #8565 and can be closed?

@privet-kitty
Copy link
Author

@radoering yes, thanks.

Copy link

github-actions bot commented Mar 3, 2024

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 3, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UTF-8 path-chars Encoding Issue during poetry install
3 participants