Ensure UTF-8 encoding when communicating with a subprocess #8563

privet-kitty · 2023-10-21T06:22:39Z

Pull Request Check List

Resolves: #8550

Added tests for changed code.
Updated documentation for changed code.

Manual tests

I checked that the added test test_run_python_script_non_ascii_input failed on windows instances before fixing base_env: https://github.com/privet-kitty/poetry/actions/runs/6591942649
- This is because the default encoding of sys.stdin is not utf-8 but cp1252 on a Windows instance (when piped) . So I didn't need to prepare a special non-utf-8 environment for testing. If it is better to set up the test environment explicitly, the only way I know is invoking a PowerShell command Set-WinSystemLocale <some locale> before testing in workflows.
Unfortunately I didn't come up with a good and simple test case that tests the reported error itself, which is about GET_SYS_TAGS. (I feel mocking this string is much the same as test_run_python_script_non_ascii_input.) As a one-time testing, I modified the master branch to use a non-ascii working directory in CI, saw some test cases failed, and it succeeded after merging this change.

radoering · 2023-10-21T15:04:57Z

For my understanding: This fix is only relevant if input is passed to subprocess.run() so that the calls of subprocess.check_output() in base_env.py and env_manager.py with text=True need not be touched?

dimbleby · 2023-10-21T15:39:25Z

I'm not entirely sure that's true: it seems to me this should also be relevant when dealing with output produced by the child process.

IIUC: by not specifing encoding="utf-8" in more places this merge request may introduce problems where the child python process now is producing UTF8 but the parent is not necessarily expecting that. (Previously this would have worked so long as the parent and child agreed on the encoding, which might or might not have been UTF8).

privet-kitty · 2023-10-21T17:09:37Z

I'm not sure either about any unexpected regressions due to this change, but if you guys are really concerned about it, setting encoding not in _run but in run_python_script would be an idea, which is a minimal change and less likely to break other things.

dimbleby · 2023-10-21T17:20:21Z

There are examples in env_manager.py where a python subprocess is called not via run_python_script(),

My first guess is that encoding="utf-8" should be added more or less anywhere that text=True is currently set.

But I'm not sure I yet understand what is going on to make this necessary: why don't the parent and child pythons already agree on the encoding, regardless of whether it is utf8?

privet-kitty · 2023-10-21T20:13:48Z

There are examples in env_manager.py where a python subprocess is called not via run_python_script(),
My first guess is that encoding="utf-8" should be added more or less anywhere that text=True is currently set.

OK. I just guessed that tweaking run_python_script doesn't affect the other type of calling python, because in that case "bytes-level" processing is hidden in the internal implementation of this str-to-str function, but trying fixing other things wouldn't be a bad idea.

But I'm not sure I yet understand what is going on to make this necessary: why don't the parent and child pythons already agree on the encoding, regardless of whether it is utf8?

I'm not an expert of python or character code, but one thing I know is the encoding of source code is a different thing in Python, because Python assumes UTF-8 regardless of the locale:

If no encoding declaration is found, the default encoding is UTF-8. (from https://docs.python.org/3/reference/lexical_analysis.html#encoding-declarations)

Let's see how the following script works on a windows instance with the Japanese locale. You can reproduce it on GitHub Actions: https://github.com/privet-kitty/ci-encoding/actions/runs/6599347808.

import subprocess
import sys
import locale
from typing import Optional


def run(
    no: int,
    encoding: Optional[str],
    input: str,
    options: list[str] = [],
) -> None:
    print(no)
    cmd = [sys.executable, "-I"] + options + ["-"]
    try:
        proc = subprocess.run(
            cmd,
            stdout=subprocess.PIPE,
            input=input,
            check=True,
            text=True,
            encoding=encoding,
        )
        print("stdout:")
        print(proc.stdout)
    except Exception:
        print("exc_info:")
        print(sys.exc_info()[1])


print(f"{locale.getencoding()=}")
run(1, None, 'print("a")')  # OK
run(2, None, 'print("中")')  # NG
run(3, "utf-8", 'print("中")')  # NG
run(4, "utf-8", 'print("中")', options=["-X", "utf8"])  # OK

In my environment, the output is as follows. (This is much the same as that of the CI above, but the output order is a bit different, probably because an exception is raised in another thread.)

locale.getencoding()='cp932'
1
stdout:
a

2
SyntaxError: Non-UTF-8 code starting with '\x92' in file <stdin> on line 1, but no encoding declared; see https://peps.python.org/pep-0263/ for details
exc_info:
Command '['C:\\Users\\hugov\\.pyenv\\pyenv-win\\versions\\3.11.1\\python.exe', '-I', '-']' returned non-zero exit status 1.
3
Exception in thread Thread-3 (_readerthread):
Traceback (most recent call last):
  File "C:\Users\hugov\.pyenv\pyenv-win\versions\3.11.1\Lib\threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "C:\Users\hugov\.pyenv\pyenv-win\versions\3.11.1\Lib\threading.py", line 975, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\hugov\.pyenv\pyenv-win\versions\3.11.1\Lib\subprocess.py", line 1552, in _readerthread
    buffer.append(fh.read())
                  ^^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 0: invalid start byte
stdout:
None
4
stdout:
中

My interpretation:

case 2:
The preferred encoding of the parent, which is based on the locale, is cp932. So the subprocess encodes 中 in cp932 and passes it to the child Python, but the child tries to decode it in utf-8 because of the rule above and fails.

case 3:
The subprocess encodes "中" in utf-8 according to the encoding and passes it to the child Python. The child decodes it in utf-8 because of the rule above and outputs 中 in cp932 based on the locale, but of course the parent tries to decode it in utf-8 and fails.

case 4:
The subprocess encodes "中" in utf-8 according to the encoding and passes it to the child Python. The child decodes it in utf-8 because of the rule above and outputs 中 in utf-8 because it is in utf-8 mode, and the parent successfully decodes it in utf-8. Congrats.

dimbleby · 2023-10-21T20:22:11Z

so perhaps a simpler / less invasive fix is to put a PEP-263-style explicit encoding at the top of GET_SYS_TAGS, getting the value from locale.getpreferredencoding(False)?

then both parent and child python can use the same locale encoding, which might or might not be utf8 but it doesn't matter because they will again agree.

so far as I can see other python scripts executed by poetry only use ascii characters - this one is the odd one out because it includes the value of __packaging__.file.

privet-kitty · 2023-10-21T21:04:51Z

so perhaps a simpler / less invasive fix is to put a PEP-263-style explicit encoding at the top of GET_SYS_TAGS, getting the value from locale.getpreferredencoding(False)?

Possibly. Yet another thing I'm a little worried about is the -I option for calling the child Python. It also ignores the env vars like PYTHONUTF8 or PYTHONIOENCODING which affect the locale. If the parent is affected by them while the child isn't, what the parent assumes may be different from what the child outputs.

dimbleby · 2023-10-21T21:12:57Z

I wouldn't worry about that. Anyone who is relying on such environment variables to influence the behaviour of poetry can simply be told not to.

privet-kitty · 2023-10-22T09:08:59Z

OK. I added the encoding declaration and tried the same non-ascii working directory test but in vain. Again the GET_SYS_TAGS script doesn't work. For example, test_installer_with_pypi_repository test fails due to JSONDecodeError because get_supported_tags gets an empty string '' from run_python_script.

I can observe this phenomenon locally. Python's PEP-263 behavior seems to be weird, when - option is used and non-utf-8 code is declared:

# PowerShell console
> echo "print(1)" | python -  
1
> echo "# encoding: utf-8`nprint(1)" | python -
1
> echo "# encoding: cp1250`nprint(1)" | python -  # no output
> echo "# encoding: ascii`nprint(1)" | python -  # no output
> echo "# encoding: ascii`nassert 1 == 0" | python -  # no output (and exit code is zero)
> echo "# encoding: cp1250`nprint(1)" > source.py
> python source.py
1

So far I've failed to find any document that can explain this behavior.

dimbleby · 2023-10-22T12:16:07Z

I reckon running python scripts as python -c script, instead of as python - and then using stdin, should sort this out then.

#8565

radoering · 2023-10-28T15:04:21Z

Is this superseded by #8565 and can be closed?

privet-kitty · 2023-10-28T15:04:57Z

@radoering yes, thanks.

github-actions · 2024-03-03T19:15:09Z

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

privet-kitty added 2 commits October 22, 2023 21:47

Add test for non-ascii code

d35dd4e

Force run_python_script and the child python to use UTF-8

2dd04cf

privet-kitty force-pushed the force-python-to-use-utf8 branch from 4b69108 to 2dd04cf Compare October 22, 2023 12:51

privet-kitty closed this Oct 28, 2023

github-actions bot locked as resolved and limited conversation to collaborators Mar 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure UTF-8 encoding when communicating with a subprocess #8563

Ensure UTF-8 encoding when communicating with a subprocess #8563

privet-kitty commented Oct 21, 2023

radoering commented Oct 21, 2023

dimbleby commented Oct 21, 2023

privet-kitty commented Oct 21, 2023

dimbleby commented Oct 21, 2023

privet-kitty commented Oct 21, 2023

dimbleby commented Oct 21, 2023 •

edited

Loading

privet-kitty commented Oct 21, 2023

dimbleby commented Oct 21, 2023

privet-kitty commented Oct 22, 2023

dimbleby commented Oct 22, 2023

radoering commented Oct 28, 2023

privet-kitty commented Oct 28, 2023

github-actions bot commented Mar 3, 2024

Ensure UTF-8 encoding when communicating with a subprocess #8563

Ensure UTF-8 encoding when communicating with a subprocess #8563

Conversation

privet-kitty commented Oct 21, 2023

Pull Request Check List

Manual tests

radoering commented Oct 21, 2023

dimbleby commented Oct 21, 2023

privet-kitty commented Oct 21, 2023

dimbleby commented Oct 21, 2023

privet-kitty commented Oct 21, 2023

dimbleby commented Oct 21, 2023 • edited Loading

privet-kitty commented Oct 21, 2023

dimbleby commented Oct 21, 2023

privet-kitty commented Oct 22, 2023

dimbleby commented Oct 22, 2023

radoering commented Oct 28, 2023

privet-kitty commented Oct 28, 2023

github-actions bot commented Mar 3, 2024

dimbleby commented Oct 21, 2023 •

edited

Loading