target label paths with non-latin1 characters have inconsistent starlark semantics. #11602

aiuto · 2020-06-16T23:48:17Z

If we have a source file with a non Latin1 character in the file name (e.g. λ.file)

and it is referenced in a glob(), let's say to attribute srcs (of type label_list)
and you evaluate attr.srcs[n].label.name
the length of the string is 7 on linux and macos
but it is 6 on windows

Hilarity ensues while trying to figure out how to write these strings as part of an action.
Discovered while trying to fix #10174 More details to follow.

alandonovan · 2020-06-17T02:32:31Z

This seems like a dup of #374.

aiuto · 2020-06-17T02:44:41Z

It is related to it, but it concerns a small detail - that reading the same file name on Windows and Linux can result in different strings. FIxing #374 would most likely fix this, but we could fix this discrepancy on it's own without doing all the work needed for the other issue.

Thinking about it differently, 374 could possibly be a deep starlark change. We might be able to fix this problem at the the file system API layer.

alandonovan · 2020-06-17T14:00:37Z

The fix to #374 requires changing the decoding applied as soon as input is obtained (from read and readdir) and the encoding applied immediately before strings are output (by print, FileWriteAction, Forge RPCs, and a few other places). It also requires changing the JVM-wide file.encoding property. It took me a week to prepare a CL to implement as an immediate breaking change, but it would be a project to put it behind a flag.

aiuto · 2020-06-17T19:44:06Z

I'm not sure where you are going with putting it behind a flag, but is your point that this can't be fixed without fixing #374 as well? If so, yes, this is a duplicate.

alandonovan · 2020-06-19T14:35:57Z

the length of the string is 7 on linux and macos but it is 6 on windows

On second thoughts, I'm surprised about the platform variation. The only parts that vary by platform in #374 are the functions used within glob to read the directory. On Unix we end up in NativePosixFiles.readdir; on Windows we end up (I assume) in java.io.WinNTFileSystem.list. It would be good to know what you get from the lowest levels of the awfully messy glob machinery when you read a directory entry named λ.

Please do provide more details.

aiuto · 2020-06-19T16:42:45Z

I created what should be a repo here: https://github.com/tonyaiuto/bazel/tree/master/utf8_encode
bazel build :*
more bazel-bin/*.txt

But.... I do not have access to a windows machine, so I can not be certain of it. It should match the pattern where I found the problem trying to write regression tests for ctx.actions.write(), which was why I filed this issue. I expect that on linux we will see

bazel-bin/filename.txt

|AÂ©ä¸�ð��¿.file|
starlark length: 15
n_chars: 9
n_utf8_bytes: 15
# Expected: glob targets parsed as Latin1

and on windows we will see

|AÂ©ä¸�ð��¿.file|
starlark length: 9
n_chars: 9
n_utf8_bytes: 15
# Unexpected: glob targets parsed as UTF-8

…UILD and .bzl input parsing. Internally BUILD and .bzl files are read as uninterpreted raw byte strings and not as UTF-8, so we should write content back out the same way. For example, if a BUILD file contains: ``` ... notice = "Copyright © 2019 Acme LLC", ``` Bazel will store the copyright symbol as two distinct octets ([0xc2, 0xa9]). When writing that with an action, the same two octets should be emitted. The behavior fixed by this change is to not interpret each character as a standalone Unicode code point to be encoded as UTF-8. While this change is appropriate for strings read from BUILD files, there is still a discrepancy in handling file paths with non Latin1 characters on Windows file systems. That is described in #11602. While this change does produce an observable behavior change, I am considering it a bug fix rather than an incompatible change because the previous behavior was both wrong and also (as described in #11602) inconsistent across different OSes. Closes #10174 RELNOTES: Fix behavior of ctx.actions.write so content is written without an incorrect encoding to UTF-8. See #10174 for details. PiperOrigin-RevId: 318056290

alandonovan · 2020-08-18T21:06:15Z

After further investigation:

On Windows, Bazel's functions for reading directory entries (e.g. JavaIoFileSystem.getDirectoryEntries) trivially decodes the UTF-16 file names correctly to Java strings, which is inconsistent with the POSIX behavior in which we treat directory entries and file contents as Latin1. If such strings are later used as file names, all is well, because the reverse trivial encoding (UTF16->UTF-16) applies. Starlark code can however observe that the string elements are UTF-16 codes, not UTF-8 codes, and the lengths differ. By contrast, if such strings are later used as file contents, they will be incorrect, because the UTF-16 codes will be treated as UTF-8 codes and the lower-8 bits written out as bytes. Non-ASCII will become garbage.

So, we have two choices. The first is to make getDirectoryEntries on Windows misinterpret file names just as it does in POSIX. I started this, but I no longer thing it is viable: we would need to do this in every place in the file system API that accepts one of Bazel's mis-encoded strings, otherwise, for example, glob() won't be able to see files with non-ASCII names because the mangled filenames it gets from the directory can't be statted. There are a large number of such places. (This is easy for POSIX, and already done, twice over in fact: once in the native file system wrappers by making explicit calls to JNI Latin1 functions instead of UTF8; and once in the JavaIoFileSystem when used on POSIX because of Java's existing -Dfile.encoding=latin1 mechanism. This mechanism doesn't exist on Windows because it isn't needed---filenames aren't bytes.) So this is a lot of work to introduce bugs to make Windows as bad as POSIX.

The other choice is to fix the behavior on POSIX. That is issue #374. We know how to do it. It's an observable change to Starlark code and external compilers, and could thus be considered a breaking change, but I will argue that if a Starlark program or external compiler relies on anything more complex than that non-ASCII string literals and file names transit through Blaze, it is probably already subtly broken in many ways.

I am not going to waste any more time on encoding issues unless it is a complete fix to #374.

github-actions · 2024-05-19T01:31:19Z

Thank you for contributing to the Bazel repository! This issue has been marked as stale since it has not had any activity in the last 1+ years. It will be closed in the next 90 days unless any other activity occurs. If you think this issue is still relevant and should stay open, please post any comment here and the issue will no longer be marked as stale.

aiuto · 2024-05-19T04:11:54Z

We need a way to turn issues into archive documents for the person who eventually wants to fix the unicode character handline

fmeum · 2024-08-29T17:25:51Z

@tetromino Is the plan to switch to representing strings as UTF-8 internally still alive? This seems separate from the much more complicated to solve #374, so maybe there is a way to get rid of the Latin-1 hack without also having to think about an escaping scheme for labels (#23335 (comment)).

I started to work on a PR that maps between the internal Latin-1 representation and the external world on the filesystem level, similar to what @alandonovan described above: #23443
Such a change would not preclude any later changes in this area, it would just fix issues affecting Windows users at the price of slightly higher complexity in JavaIoFileSystem. If this sounds acceptable, I could finish it and send it out for review.

CC @meteorcloudy @tjgq for their input as they have been reviewing most of my Unicode fixes so far.

aiuto · 2024-08-29T19:25:49Z

IIRC, we do represent strings as UTF-8 internally, we just call everything Latin-1.
That is, we read your BUILD file (or the results of readdir(1)) and get back a blob of UTF-8 encoded text. It happens to be UTF-8 because that is what people's text editors saved. Then we like to our decode stack and say it is Latin-1 so that no translation happens. Then we write those strings directly without re-encoding and everything just sort of works.

The specific of this bug is that linux, windows, and macos all encode write non-Latin-1 characters in different bit patterns that are all UTF-8 compliant. Everything works as long as the host and exec platforms are the same. If you mix them, things can break.

tetromino · 2024-08-29T21:17:06Z

@fmeum - the plan is there, but delayed to probably next year (so not in Bazel 8.0).

fmeum · 2024-08-31T09:00:49Z

Everything works as long as the host and exec platforms are the same. If you mix them, things can break.

This is not true as far as I can tell. Remote execution correctly translates between the internal encoding and UTF-8 since d91e05e, so mixing different host and exec platforms should actually not be a problem. Subtle issues arise when Bazel uses the Java (N)IO APIs to interact with the filesystem, such as in JavaIoFileSystem or WindowsFileSystem. Java uses the encoding specified by `sun.jnu.encoding` for that and this can't be influenced reliably (it's always UTF-8 on macOS and can only be forced to Latin-1 on Linux if a latin-1 locale is installed, which isn't usually the case). That's why I think that introducing a reencoding step to the filesystem layer would help squash the remaining issues. It's not too intrusive, there is already a central method for interop with Java APIs that we could extend.

Bazel aims to support arbitrary file system path encodings (even raw byte sequences) by attempting to force the JVM to use a Latin-1 locale for OS interactions. As a result, Bazel internally encodes `String`s as raw byte arrays with a Latin-1 coder and no encoding information. Whenever it interacts with encoding-aware APIs, this may require a reencoding of the `String` contents, depending on the OS and availability of a Latin-1 locale. This PR introduces the concepts of *internal*, *Unicode*, and *platform* strings and adds dedicated optimized functions for converting between these three types (see the class comment on the new `StringEncoding` helper class for details). These functions are then used to standardize and fix conversion throughout the code base. As a result, a number of new end-to-end integration tests for the handling of Unicode in file paths, command-line arguments and environment variables now pass. Full support for Unicode beyond the current active code page on Windows is left to a follow-up PR as it may require patching the embedded JDK. * Replace ad-hoc conversion logic with the new consistent set of helper functions. * Make more parts of the Bazel client's Windows implementation Unicode-aware. This also fixes the behavior of `SetEnv` on Windows, which previously would remove an environment variable if passed an empty value for it, which doesn't match the Unix behavior. * Drop the `charset` parameter from all methods related to parameter files. The `ISO-8859-1` vs. `UTF-8` choice was flawed since Bazel's internal string representation doesn't maintain any encoding information - `ISO-8859-1` just meant "write out raw bytes", which is the only choice that matches what arguments would look like if passed on the command line. * Convert server args to the internal string representation. The arguments for requests to the server were already converted to Bazel's internal string representation, which resulted in a mismatch between `--client_cwd` and `--workspace_directory` if the workspace path contains non-ASCII characters. * Read the downloader config using Bazel's filesystem implementation. * Make `MacOSXFsEventsDiffAwareness` UTF-8 aware. It previously used the `GetStringUTF` JNI method, which, despite its name, doesn't return the UTF-8 representation of a string, but modified CESU-8 (nobody ever wants this). * Correctly reencode path strings for `LocalDiffAwareness`. * Correctly reencode the value of `user.dir`. * Correctly turn `ExecRequest` fields into strings for `ProcessBuilder` for `bazel --batch run`. This makes it possible to reenable the `test_consistent_command_line_encoding` test, fixing bazelbuild#1775. * Fix encoding issues in `TargetCompleteEvents`. * Fix encoding issues in `SubprocessFactory` implementations. * Drop obsolete warning if `file.encoding` doesn't equal `ISO-8859-1` as file names are encoded with `sun.jnu.encoding` now. * Consistently reencode internal strings passed into and out of `FileSystem` implementations, e.g. if reading a symlink target. Tests are added that verify the interaction between `FileSystem` implementations and the Java (N)IO APIs on Unicode file paths. Fixes bazelbuild#1775. Fixes bazelbuild#11602. Fixes bazelbuild#18293. Work towards #374. Work towards bazelbuild#23859. Closes bazelbuild#24010. PiperOrigin-RevId: 694114597 Change-Id: I5bdcbc14a90dd1f0f34698aebcbd07cd2bde7a23

Bazel aims to support arbitrary file system path encodings (even raw byte sequences) by attempting to force the JVM to use a Latin-1 locale for OS interactions. As a result, Bazel internally encodes `String`s as raw byte arrays with a Latin-1 coder and no encoding information. Whenever it interacts with encoding-aware APIs, this may require a reencoding of the `String` contents, depending on the OS and availability of a Latin-1 locale. This PR introduces the concepts of *internal*, *Unicode*, and *platform* strings and adds dedicated optimized functions for converting between these three types (see the class comment on the new `StringEncoding` helper class for details). These functions are then used to standardize and fix conversion throughout the code base. As a result, a number of new end-to-end integration tests for the handling of Unicode in file paths, command-line arguments and environment variables now pass. Full support for Unicode beyond the current active code page on Windows is left to a follow-up PR as it may require patching the embedded JDK. * Replace ad-hoc conversion logic with the new consistent set of helper functions. * Make more parts of the Bazel client's Windows implementation Unicode-aware. This also fixes the behavior of `SetEnv` on Windows, which previously would remove an environment variable if passed an empty value for it, which doesn't match the Unix behavior. * Drop the `charset` parameter from all methods related to parameter files. The `ISO-8859-1` vs. `UTF-8` choice was flawed since Bazel's internal string representation doesn't maintain any encoding information - `ISO-8859-1` just meant "write out raw bytes", which is the only choice that matches what arguments would look like if passed on the command line. * Convert server args to the internal string representation. The arguments for requests to the server were already converted to Bazel's internal string representation, which resulted in a mismatch between `--client_cwd` and `--workspace_directory` if the workspace path contains non-ASCII characters. * Read the downloader config using Bazel's filesystem implementation. * Make `MacOSXFsEventsDiffAwareness` UTF-8 aware. It previously used the `GetStringUTF` JNI method, which, despite its name, doesn't return the UTF-8 representation of a string, but modified CESU-8 (nobody ever wants this). * Correctly reencode path strings for `LocalDiffAwareness`. * Correctly reencode the value of `user.dir`. * Correctly turn `ExecRequest` fields into strings for `ProcessBuilder` for `bazel --batch run`. This makes it possible to reenable the `test_consistent_command_line_encoding` test, fixing bazelbuild#1775. * Fix encoding issues in `TargetCompleteEvents`. * Fix encoding issues in `SubprocessFactory` implementations. * Drop obsolete warning if `file.encoding` doesn't equal `ISO-8859-1` as file names are encoded with `sun.jnu.encoding` now. * Consistently reencode internal strings passed into and out of `FileSystem` implementations, e.g. if reading a symlink target. Tests are added that verify the interaction between `FileSystem` implementations and the Java (N)IO APIs on Unicode file paths. Fixes bazelbuild#1775. Fixes bazelbuild#11602. Fixes bazelbuild#18293. Work towards bazelbuild#374. Work towards bazelbuild#23859. Closes bazelbuild#24010. PiperOrigin-RevId: 694114597 Change-Id: I5bdcbc14a90dd1f0f34698aebcbd07cd2bde7a23

aiuto added team-Core Skyframe, bazel query, BEP, options parsing, bazelrc untriaged labels Jun 22, 2020

haxorz removed the untriaged label Aug 14, 2020

haxorz assigned alandonovan Aug 14, 2020

haxorz added the P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) label Aug 14, 2020

janakdr unassigned alandonovan Jan 23, 2021

janakdr added the type: bug label Jan 23, 2021

aiuto added this to the unicode milestone Apr 20, 2021

brandjon added team-Starlark-Integration Issues involving Bazel's integration with Starlark, excluding builtin symbols and removed team-Core Skyframe, bazel query, BEP, options parsing, bazelrc labels Mar 15, 2023

github-actions bot added the stale Issues or PRs that are stale (no activity for 30 days) label May 19, 2024

github-actions bot removed the stale Issues or PRs that are stale (no activity for 30 days) label May 20, 2024

fmeum mentioned this issue Oct 28, 2024

Fix most Unicode encoding bugs #24010

Closed

copybara-service bot closed this as completed in a58fe3f Nov 7, 2024

iancha1992 mentioned this issue Nov 8, 2024

[8.0.0] Fix most Unicode encoding bugs #24260

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

target label paths with non-latin1 characters have inconsistent starlark semantics. #11602

target label paths with non-latin1 characters have inconsistent starlark semantics. #11602

aiuto commented Jun 16, 2020 •

edited

Loading

alandonovan commented Jun 17, 2020

aiuto commented Jun 17, 2020

alandonovan commented Jun 17, 2020

aiuto commented Jun 17, 2020

alandonovan commented Jun 19, 2020 •

edited

Loading

aiuto commented Jun 19, 2020

alandonovan commented Aug 18, 2020

github-actions bot commented May 19, 2024

aiuto commented May 19, 2024

fmeum commented Aug 29, 2024

aiuto commented Aug 29, 2024

tetromino commented Aug 29, 2024

fmeum commented Aug 31, 2024 via email •

edited

Loading

target label paths with non-latin1 characters have inconsistent starlark semantics. #11602

target label paths with non-latin1 characters have inconsistent starlark semantics. #11602

Comments

aiuto commented Jun 16, 2020 • edited Loading

alandonovan commented Jun 17, 2020

aiuto commented Jun 17, 2020

alandonovan commented Jun 17, 2020

aiuto commented Jun 17, 2020

alandonovan commented Jun 19, 2020 • edited Loading

aiuto commented Jun 19, 2020

alandonovan commented Aug 18, 2020

github-actions bot commented May 19, 2024

aiuto commented May 19, 2024

fmeum commented Aug 29, 2024

aiuto commented Aug 29, 2024

tetromino commented Aug 29, 2024

fmeum commented Aug 31, 2024 via email • edited Loading

aiuto commented Jun 16, 2020 •

edited

Loading

alandonovan commented Jun 19, 2020 •

edited

Loading

fmeum commented Aug 31, 2024 via email •

edited

Loading