Fix encoding used to read our own log files #2390

AndreLouisCaron · 2022-04-01T21:57:14Z

When running Tox on a French Canadian Windows computer under an
account with a username that contains diacritics, Tox crashes with a
UnicodeDecodeError because it tries to read its log files with a
hard-coded encoding of UTF-8. These log files contain output from
python -m virtualenv and python -m pip, whose output contains
references to the username (such as the contents of the APPDATA
environment variable). The problem is that the subprocesses which
open log files without any explicit encoding. In that case, the
built-in open() uses locale.getpreferredencoding(False), which is
"cp1252". A username containing an "Latin Small Letter E with
Acute" will be encoded as \xe9, which is not valid UTF-8 (the valid
UTF-8 sequence is \xc3\xa9).

One workaround is to run Tox with Python in "UTF-8 Mode". This can be
achieved by setting the PYTHONUTF8=1 environment variable or by
calling with python -Xutf8 -m tox. Unfortunately, this is not in
the Tox documentation and can be quite confusing. I spent more than
two hours troubleshooting this (I develop with Python full time and
I'd never heard of the "UTF-8 mode" before).

Of course, we could document the workaround, but it just seems good to
re-open the log files using the same encoding they were opened with in
the first place.

With this change, Tox works nicely with or without the "UTF-8 Mode".

Thanks for contributing a pull request!

If you are contributing for the first time or provide a trivial fix don't worry too
much about the checklist - we will help you get started.

Contribution checklist:

(also see CONTRIBUTING.rst for details)

wrote descriptive pull request text
added/updated test(s)
updated/extended the documentation
added relevant issue keyword
in message body
added news fragment in changelog folder
- fragment name: <issue number>.<type>.rst for example (588.bugfix.rst)
- <type> is must be one of bugfix, feature, deprecation, breaking, doc, misc
- if PR has no issue: consider creating one first or change it to the PR number after creating the PR
- "sign" fragment with -- by :user:`<your username>`.
- please, use full sentences with correct case and punctuation, for example:
```
Fixed an issue with non-ascii contents in doctest text files -- by :user:`superuser`.
```
- also see examples
added yourself to CONTRIBUTORS (preserving alphabetical order)

When running Tox on a French Canadian Windows computer under an account with a username that contains diacritics, Tox crashes with a `UnicodeDecodeError` because it tries to read its log files with a hard-coded encoding of `UTF-8`. These log files contain output from `python -m virtualenv` and `python -m pip`, whose output contains references to the username (such as the contents of the ``APPDATA`` environment variable). The problem is that the subprocesses which open log files without any explicit encoding. In that case, the built-in `open()` uses `locale.getpreferredencoding(False)`, which is `"cp1252"`. A username containing an "Latin Small Letter E with Acute" will be encoded as `\xe9`, which is not valid UTF-8 (the valid UTF-8 sequence is `\xc3\xa9`). One workaround is to run Tox with Python in "UTF-8 Mode". This can be achieved by setting the `PYTHONUTF8=1` environment variable or by calling with `python -Xutf8 -m tox`. Unfortunately, this is not in the Tox documentation and can be quite confusing. I spent more than two hours troubleshooting this (I develop with Python full time and I'd never heard of the "UTF-8 mode" before). Of course, we could document the workaround, but it just seems good to re-open the log files using the same encoding they were opened with in the first place. With this change, Tox works nicely with or without the "UTF-8 Mode".

asottile · 2022-04-01T21:59:40Z

this'll break for tools which write (correctly) UTF-8 output -- the real fix should be to treat the output as opaque bytes and pass it through unaltered without decoding it

AndreLouisCaron · 2022-04-01T22:02:57Z

This fixes issues #1550. I was just bit by this today when setting up a new Windows workstation. I spent more than two hours trying to figure out why Tox was working on my old workstation, but failing on the new one (despite using the same versions of Python and Tox on both).

It seems like the workaround is to enable the "UTF-8" mode in Python, but that's undocumented and potentially dangerous. It's not safe to assume that all test run will behave correctly when using this (there's a reason it's not on by default).

The fix seems simple: re-open the files using the same encoding they were encoded to. It also seems innocuous because locale.getpreferredencoding(False) will return "utf-8" on all systems where the code was already working.

AndreLouisCaron · 2022-04-01T22:05:48Z

@asottile

this'll break for tools which write (correctly) UTF-8 output -- the real fix should be to treat the output as opaque bytes and pass it through unaltered without decoding it

I'm not sure I understand. This encoding was implicit before #1237. Tox was running with this encoding for years without complains (because it's the "right thing to do on Windows", even though Microsoft's choice is debatable). I'm just reverting back to the original encoding.

asottile · 2022-04-01T22:09:10Z

@asottile

this'll break for tools which write (correctly) UTF-8 output -- the real fix should be to treat the output as opaque bytes and pass it through unaltered without decoding it

I'm not sure I understand. This encoding was implicit before #1237. Tox was running with this encoding for years without complains (because it's the "right thing to do on Windows", even though Microsoft's choice is debatable). I'm just reverting back to the original encoding.

tox "just runs tools" -- those tools can produce output in whatever encoding they want (they might not even be python!) and tox shouldn't attempt to decode the output at all (because there's no agreed upon output encoding) -- neither the previous behaviour or the current behaviour is the correct solution (though I'd argue choosing UTF-8 is usually better than falling back to cp1252 or US-ASCII, even on windows)

AndreLouisCaron · 2022-04-02T05:09:07Z

tox "just runs tools"

Would you agree that testing Python code is the primary use case for Tox and the tox.ini below is a pretty standard use case? All it does is run virtualenv, pip and pytest.

[tox]
skipsdist = True
envlist = py39

[testenv]
deps = pytest
commands =
  pytest --help

Yet, running Tox with this tox.ini on my machine fails to create the virtualenv and produces this stack trace:

C:\dev\tox-bug> py -3.9 -m tox
py39 create: C:\dev\tox-bug\.tox\py39
__________________________________________________________________________________________________________________ summary __________________________________________________________________________________________________________________
  py39: commands succeeded
  congratulations :)
Traceback (most recent call last):
  File "C:\Software\Python\3.9.12-64\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Software\Python\3.9.12-64\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Software\Python\3.9.12-64\lib\site-packages\tox\__main__.py", line 4, in <module>
    tox.cmdline()
  File "C:\Software\Python\3.9.12-64\lib\site-packages\tox\session\__init__.py", line 44, in cmdline
    main(args)
  File "C:\Software\Python\3.9.12-64\lib\site-packages\tox\session\__init__.py", line 69, in main
    exit_code = session.runcommand()
  File "C:\Software\Python\3.9.12-64\lib\site-packages\tox\session\__init__.py", line 197, in runcommand
    return self.subcommand_test()
  File "C:\Software\Python\3.9.12-64\lib\site-packages\tox\session\__init__.py", line 225, in subcommand_test
    run_sequential(self.config, self.venv_dict)
  File "C:\Software\Python\3.9.12-64\lib\site-packages\tox\session\commands\run\sequential.py", line 9, in run_sequential
    if venv.setupenv():
  File "C:\Software\Python\3.9.12-64\lib\site-packages\tox\venv.py", line 633, in setupenv
    status = self.update(action=action)
  File "C:\Software\Python\3.9.12-64\lib\site-packages\tox\venv.py", line 273, in update
    self.hook.tox_testenv_create(action=action, venv=self)
  File "C:\Software\Python\3.9.12-64\lib\site-packages\pluggy\_hooks.py", line 265, in __call__
    return self._hookexec(self.name, self.get_hookimpls(), kwargs, firstresult)
  File "C:\Software\Python\3.9.12-64\lib\site-packages\pluggy\_manager.py", line 80, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "C:\Software\Python\3.9.12-64\lib\site-packages\pluggy\_callers.py", line 60, in _multicall
    return outcome.get_result()
  File "C:\Software\Python\3.9.12-64\lib\site-packages\pluggy\_result.py", line 60, in get_result
    raise ex[1].with_traceback(ex[2])
  File "C:\Software\Python\3.9.12-64\lib\site-packages\pluggy\_callers.py", line 39, in _multicall
    res = hook_impl.function(*args)
  File "C:\Software\Python\3.9.12-64\lib\site-packages\tox\venv.py", line 727, in tox_testenv_create
    venv._pcall(
  File "C:\Software\Python\3.9.12-64\lib\site-packages\tox\venv.py", line 603, in _pcall
    return action.popen(
  File "C:\Software\Python\3.9.12-64\lib\site-packages\tox\action.py", line 128, in popen
    lines = out_path.read_text("UTF-8").split("\n")
  File "C:\Software\Python\3.9.12-64\lib\site-packages\py\_path\common.py", line 171, in read_text
    return f.read()
  File "C:\Software\Python\3.9.12-64\lib\codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 472: invalid continuation byte

those tools can produce output in whatever encoding they want ... (because there's no agreed upon output encoding)

But there is an agreed upon output encoding for command-line tools on Windows. It's the current code page, which Python conveniently exposes as locale.getpreferredencoding(False). That's what all Python programs use by default, with good reason. If they didn't, the output would display as mojibake in my terminal.

Also, see the documentation for the open() built-in. You can also see this in action with logging. Consider this minimal repro of what Tox is doing:

# Virtualenv and pip do this (IIUC, Tox actually opens the file on their behalf to redirect the output to the file)
import os, logging
logging.basicConfig(filename='repro.log')  # implicit encoding is `locale.getpreferredlocale(False)`, via `open()`.
logging.warning(os.environ['APPDATA'])
logging.shutdown()
# Then, Tox does this. On a French Canadian system with non-ASCII username, this line fails with `UnicodeDecodeError`, like Tox!
print(pathlib.Path('repro.log').read_text('utf-8'))

Of course, I can configure Tox to run misbehaving tools that output in encodings that don't match my system's locale / current code page. In practice, all command-line tools on my Windows machine write to standard output in the current code page. This is precisely what virtualenv and pip are doing.

they might not even be python!

Indeed. The only workaround I currently have to prevent Tox from crashing is to set PYTHONUTF8=1, but that only affects Python programs. Anything else I run from Tox will continue to output cp1252. I've used Tox to run CMake in the past (to build C++ extensions). If CMake also prints contents of APPDATA, how will I work around that?

the real fix should be to treat the output as opaque bytes and pass it through unaltered without decoding it

I guess I agree that this would be better. To be honest, I don't know why Tox tries to read its own log files. I assume it has good reason to do so.

In any case, I would like to get Tox working on my machine without having to refactor its core machinery, without pinning to some old version of Tox, and without having to set PYTHONUTF8=1, which seems dangerous (if it weren't, Python's new UTF-8 mode would be on by default).

I'm aware that my current change is currently failing the build. Some test cases in CI are deliberately writing UTF-8 data in terminals where locale.getpreferredencoding(False) returns ascii. Assuming I can make this change pass all CI jobs without glaring hacks, would you consider this submission?

AndreLouisCaron · 2022-04-02T06:05:19Z

After further investigation, I found another approach that you may consider less risky. See #2391.

asottile · 2022-04-02T07:55:52Z

like I said above, what I'd accept as a fix is handling this opaquely as bytes and not decoding at all -- anything else is really just a guess at the encoding which will fail in some situations (perhaps fewer, but it will still have issues)

forcing python's UTF-8 mode of subprocesses isn't a fix either -- it puts (potentially surprising) constraints on the system-under-test

and either way you'll need to write a test and a changelog entry for this to be considered and to demonstrate your change. note also that you're committing to the legacy implementation of tox and that it will soon be replaced with tox 4 which is a full rewrite (it's possible your problem is already fixed with the released alpha)

AndreLouisCaron · 2022-04-02T12:46:30Z

Just so we're clear, I totally agree on principle with everything you've said about how things should work. I was exploring other ideas with this sketch because I don't have the time/ressources to invest in refactoring some core internals of Tox, especially as a new contributor, and I was (naively) hoping we could find a quick fix by following Python & Windows conventions.

note also that you're committing to the legacy implementation of tox and that it will soon be replaced with tox 4 which is a full rewrite (it's possible your problem is already fixed with the released alpha)

If you'd said this first, I would have immediately dropped this whole issue. I won't waste my time trying to fix something that's about to become irrelavant. I hit this problem on a new computer that I'm only starting to set up. At this point, it's far easier for me to create a new account without the accent in my first name than it is to fix this issue.

AndreLouisCaron requested review from gaborbernat, asottile and obestwalter as code owners April 1, 2022 21:57

Only use current code page as a fallback

cd507d5

AndreLouisCaron mentioned this pull request Apr 2, 2022

Force virtualenv and pip to output UTF-8 #2391

Closed

6 tasks

AndreLouisCaron closed this Apr 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix encoding used to read our own log files #2390

Fix encoding used to read our own log files #2390

AndreLouisCaron commented Apr 1, 2022 •

edited

Loading

asottile commented Apr 1, 2022

AndreLouisCaron commented Apr 1, 2022

AndreLouisCaron commented Apr 1, 2022 •

edited

Loading

asottile commented Apr 1, 2022

AndreLouisCaron commented Apr 2, 2022

AndreLouisCaron commented Apr 2, 2022

asottile commented Apr 2, 2022

AndreLouisCaron commented Apr 2, 2022

Fix encoding used to read our own log files #2390

Fix encoding used to read our own log files #2390

Conversation

AndreLouisCaron commented Apr 1, 2022 • edited Loading

Thanks for contributing a pull request!

Contribution checklist:

asottile commented Apr 1, 2022

AndreLouisCaron commented Apr 1, 2022

AndreLouisCaron commented Apr 1, 2022 • edited Loading

asottile commented Apr 1, 2022

AndreLouisCaron commented Apr 2, 2022

AndreLouisCaron commented Apr 2, 2022

asottile commented Apr 2, 2022

AndreLouisCaron commented Apr 2, 2022

AndreLouisCaron commented Apr 1, 2022 •

edited

Loading

AndreLouisCaron commented Apr 1, 2022 •

edited

Loading