-
-
Notifications
You must be signed in to change notification settings - Fork 523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix encoding used to read our own log files #2390
Fix encoding used to read our own log files #2390
Conversation
When running Tox on a French Canadian Windows computer under an account with a username that contains diacritics, Tox crashes with a `UnicodeDecodeError` because it tries to read its log files with a hard-coded encoding of `UTF-8`. These log files contain output from `python -m virtualenv` and `python -m pip`, whose output contains references to the username (such as the contents of the ``APPDATA`` environment variable). The problem is that the subprocesses which open log files without any explicit encoding. In that case, the built-in `open()` uses `locale.getpreferredencoding(False)`, which is `"cp1252"`. A username containing an "Latin Small Letter E with Acute" will be encoded as `\xe9`, which is not valid UTF-8 (the valid UTF-8 sequence is `\xc3\xa9`). One workaround is to run Tox with Python in "UTF-8 Mode". This can be achieved by setting the `PYTHONUTF8=1` environment variable or by calling with `python -Xutf8 -m tox`. Unfortunately, this is not in the Tox documentation and can be quite confusing. I spent more than two hours troubleshooting this (I develop with Python full time and I'd never heard of the "UTF-8 mode" before). Of course, we could document the workaround, but it just seems good to re-open the log files using the same encoding they were opened with in the first place. With this change, Tox works nicely with or without the "UTF-8 Mode".
this'll break for tools which write (correctly) UTF-8 output -- the real fix should be to treat the output as opaque bytes and pass it through unaltered without decoding it |
This fixes issues #1550. I was just bit by this today when setting up a new Windows workstation. I spent more than two hours trying to figure out why Tox was working on my old workstation, but failing on the new one (despite using the same versions of Python and Tox on both). It seems like the workaround is to enable the "UTF-8" mode in Python, but that's undocumented and potentially dangerous. It's not safe to assume that all test run will behave correctly when using this (there's a reason it's not on by default). The fix seems simple: re-open the files using the same encoding they were encoded to. It also seems innocuous because |
I'm not sure I understand. This encoding was implicit before #1237. Tox was running with this encoding for years without complains (because it's the "right thing to do on Windows", even though Microsoft's choice is debatable). I'm just reverting back to the original encoding. |
tox "just runs tools" -- those tools can produce output in whatever encoding they want (they might not even be |
Would you agree that testing Python code is the primary use case for Tox and the [tox]
skipsdist = True
envlist = py39
[testenv]
deps = pytest
commands =
pytest --help Yet, running Tox with this
But there is an agreed upon output encoding for command-line tools on Windows. It's the current code page, which Python conveniently exposes as Also, see the documentation for the # Virtualenv and pip do this (IIUC, Tox actually opens the file on their behalf to redirect the output to the file)
import os, logging
logging.basicConfig(filename='repro.log') # implicit encoding is `locale.getpreferredlocale(False)`, via `open()`.
logging.warning(os.environ['APPDATA'])
logging.shutdown()
# Then, Tox does this. On a French Canadian system with non-ASCII username, this line fails with `UnicodeDecodeError`, like Tox!
print(pathlib.Path('repro.log').read_text('utf-8')) Of course, I can configure Tox to run misbehaving tools that output in encodings that don't match my system's locale / current code page. In practice, all command-line tools on my Windows machine write to standard output in the current code page. This is precisely what
Indeed. The only workaround I currently have to prevent Tox from crashing is to set
I guess I agree that this would be better. To be honest, I don't know why Tox tries to read its own log files. I assume it has good reason to do so. In any case, I would like to get Tox working on my machine without having to refactor its core machinery, without pinning to some old version of Tox, and without having to set I'm aware that my current change is currently failing the build. Some test cases in CI are deliberately writing UTF-8 data in terminals where |
After further investigation, I found another approach that you may consider less risky. See #2391. |
like I said above, what I'd accept as a fix is handling this opaquely as forcing python's UTF-8 mode of subprocesses isn't a fix either -- it puts (potentially surprising) constraints on the system-under-test and either way you'll need to write a test and a changelog entry for this to be considered and to demonstrate your change. note also that you're committing to the legacy implementation of tox and that it will soon be replaced with tox 4 which is a full rewrite (it's possible your problem is already fixed with the released alpha) |
Just so we're clear, I totally agree on principle with everything you've said about how things should work. I was exploring other ideas with this sketch because I don't have the time/ressources to invest in refactoring some core internals of Tox, especially as a new contributor, and I was (naively) hoping we could find a quick fix by following Python & Windows conventions.
If you'd said this first, I would have immediately dropped this whole issue. I won't waste my time trying to fix something that's about to become irrelavant. I hit this problem on a new computer that I'm only starting to set up. At this point, it's far easier for me to create a new account without the accent in my first name than it is to fix this issue. |
Fixes #1550.
When running Tox on a French Canadian Windows computer under an
account with a username that contains diacritics, Tox crashes with a
UnicodeDecodeError
because it tries to read its log files with ahard-coded encoding of
UTF-8
. These log files contain output frompython -m virtualenv
andpython -m pip
, whose output containsreferences to the username (such as the contents of the
APPDATA
environment variable). The problem is that the subprocesses which
open log files without any explicit encoding. In that case, the
built-in
open()
useslocale.getpreferredencoding(False)
, which is"cp1252"
. A username containing an "Latin Small Letter E withAcute" will be encoded as
\xe9
, which is not valid UTF-8 (the validUTF-8 sequence is
\xc3\xa9
).One workaround is to run Tox with Python in "UTF-8 Mode". This can be
achieved by setting the
PYTHONUTF8=1
environment variable or bycalling with
python -Xutf8 -m tox
. Unfortunately, this is not inthe Tox documentation and can be quite confusing. I spent more than
two hours troubleshooting this (I develop with Python full time and
I'd never heard of the "UTF-8 mode" before).
Of course, we could document the workaround, but it just seems good to
re-open the log files using the same encoding they were opened with in
the first place.
With this change, Tox works nicely with or without the "UTF-8 Mode".
Thanks for contributing a pull request!
If you are contributing for the first time or provide a trivial fix don't worry too
much about the checklist - we will help you get started.
Contribution checklist:
(also see CONTRIBUTING.rst for details)
in message body
<issue number>.<type>.rst
for example (588.bugfix.rst)<type>
is must be one ofbugfix
,feature
,deprecation
,breaking
,doc
,misc
-- by :user:`<your username>`.
CONTRIBUTORS
(preserving alphabetical order)