-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow any characters in filenames / labels #374
Comments
|
In POSIX, filenames are "bags of bytes"--there is no encoding; however, |
Well, I think we can probably require valid UTF-8 file names and strongly recommend that people use UTF-8 for their file system. For labels / BUILD files, we probably need an escaping scheme, at least for the control characters. If there's a file that isn't valid UTF-8, we give an error message? |
Our company codes mainly in C++, but our frontend uses a lot of JS and nodejs modules which have all sorts of characters in the filenames--for example, -, #, @, (, and ). Right now this is a major blocker for getting all our codebase under one build system since we can't reference files with semi-special characters. I don't think Bazel should decide what characters are acceptable in file names, as that reduces file names to those that fit both (1) supported languages and (2) supported platforms. This seems unnecessarily restrictive, and is becoming a major pain point for us. |
Agreed. Unfortunately, it's a bit tricky to fix, as a lot of code assumes that the mapping from labels to file names (and vice versa) is trivial, and doesn't require escaping. Any suggestions on an escaping scheme? |
URL based? |
You mean an own URI scheme? Sounds good. |
I mean replacing special characters by %XX where XX is the UTF-8 code in hexa. |
Sorry, I won't be able to work on this. @philwo had an interest, maybe he can make some progress here. :-/ |
This blocks our Bazel deployment as well. |
This is blocking us. We have a templating system where we need to build our template files. The filenames themselves contains template variables (e.g. |
I totally agree that this is important, should be done, I want this myself, however I don't have the time to work on it in the coming months, thus I have to unassign it. |
Here is my proposal:
|
Plain ASCII (and even that partial) makes this feels like we are in the early 90s. There are reasonable ways to handle that. If my project is C/C++, and it is cross-platform, and I have problems handling Unicode, then I will not use Unicode in file names. And the fact that bazel "explodes" is not such a problem. Even better would be to to allow for a character-set option in the project file. I did not move one project to bazel because test units check that Unicode file names work. |
@fmeum I worked around this by adding the test directories to an For reference, the error I saw was:
sorry if this is unrelated to the issue in the original report, it sounded relevant while I was debugging |
This change patches the app manifest of the `java.exe` launcher in the embedded JDK to always use the UTF-8 codepage on Windows 1903 and later. This is necessary because the launcher sets sun.jnu.encoding to the system code page, which by default is a legacy code page such as Cp1252 on Windows. This causes the JVM to be unable to interact with files whose paths contain Unicode characters not representable in the system code page, as well as command-line arguments and environment variables containing such characters. The Windows VMs in CI are not running Windows 1903 or later yet, so this change can currently only be tested locally by running `bazel info character-encoding` and verifying that it prints `sun.jnu.encoding = UTF-8`. Work towards #374 Work towards #18293 Work towards #23859 Closes #24172. PiperOrigin-RevId: 693466466 Change-Id: I4914c21e846493a8880ac8c6f5e1afa9fae87366
This change patches the app manifest of the `java.exe` launcher in the embedded JDK to always use the UTF-8 codepage on Windows 1903 and later. This is necessary because the launcher sets sun.jnu.encoding to the system code page, which by default is a legacy code page such as Cp1252 on Windows. This causes the JVM to be unable to interact with files whose paths contain Unicode characters not representable in the system code page, as well as command-line arguments and environment variables containing such characters. The Windows VMs in CI are not running Windows 1903 or later yet, so this change can currently only be tested locally by running `bazel info character-encoding` and verifying that it prints `sun.jnu.encoding = UTF-8`. Work towards bazelbuild#374 Work towards bazelbuild#18293 Work towards bazelbuild#23859 Closes bazelbuild#24172. PiperOrigin-RevId: 693466466 Change-Id: I4914c21e846493a8880ac8c6f5e1afa9fae87366
This change patches the app manifest of the `java.exe` launcher in the embedded JDK to always use the UTF-8 codepage on Windows 1903 and later. This is necessary because the launcher sets sun.jnu.encoding to the system code page, which by default is a legacy code page such as Cp1252 on Windows. This causes the JVM to be unable to interact with files whose paths contain Unicode characters not representable in the system code page, as well as command-line arguments and environment variables containing such characters. The Windows VMs in CI are not running Windows 1903 or later yet, so this change can currently only be tested locally by running `bazel info character-encoding` and verifying that it prints `sun.jnu.encoding = UTF-8`. Work towards #374 Work towards #18293 Work towards #23859 Closes #24172. PiperOrigin-RevId: 693466466 Change-Id: I4914c21e846493a8880ac8c6f5e1afa9fae87366 Commit 7bb8d2b Co-authored-by: Fabian Meumertzheim <fabian@meumertzhe.im>
Bazel aims to support arbitrary file system path encodings (even raw byte sequences) by attempting to force the JVM to use a Latin-1 locale for OS interactions. As a result, Bazel internally encodes `String`s as raw byte arrays with a Latin-1 coder and no encoding information. Whenever it interacts with encoding-aware APIs, this may require a reencoding of the `String` contents, depending on the OS and availability of a Latin-1 locale. This PR introduces the concepts of *internal*, *Unicode*, and *platform* strings and adds dedicated optimized functions for converting between these three types (see the class comment on the new `StringEncoding` helper class for details). These functions are then used to standardize and fix conversion throughout the code base. As a result, a number of new end-to-end integration tests for the handling of Unicode in file paths, command-line arguments and environment variables now pass. Full support for Unicode beyond the current active code page on Windows is left to a follow-up PR as it may require patching the embedded JDK. * Replace ad-hoc conversion logic with the new consistent set of helper functions. * Make more parts of the Bazel client's Windows implementation Unicode-aware. This also fixes the behavior of `SetEnv` on Windows, which previously would remove an environment variable if passed an empty value for it, which doesn't match the Unix behavior. * Drop the `charset` parameter from all methods related to parameter files. The `ISO-8859-1` vs. `UTF-8` choice was flawed since Bazel's internal string representation doesn't maintain any encoding information - `ISO-8859-1` just meant "write out raw bytes", which is the only choice that matches what arguments would look like if passed on the command line. * Convert server args to the internal string representation. The arguments for requests to the server were already converted to Bazel's internal string representation, which resulted in a mismatch between `--client_cwd` and `--workspace_directory` if the workspace path contains non-ASCII characters. * Read the downloader config using Bazel's filesystem implementation. * Make `MacOSXFsEventsDiffAwareness` UTF-8 aware. It previously used the `GetStringUTF` JNI method, which, despite its name, doesn't return the UTF-8 representation of a string, but modified CESU-8 (nobody ever wants this). * Correctly reencode path strings for `LocalDiffAwareness`. * Correctly reencode the value of `user.dir`. * Correctly turn `ExecRequest` fields into strings for `ProcessBuilder` for `bazel --batch run`. This makes it possible to reenable the `test_consistent_command_line_encoding` test, fixing #1775. * Fix encoding issues in `TargetCompleteEvents`. * Fix encoding issues in `SubprocessFactory` implementations. * Drop obsolete warning if `file.encoding` doesn't equal `ISO-8859-1` as file names are encoded with `sun.jnu.encoding` now. * Consistently reencode internal strings passed into and out of `FileSystem` implementations, e.g. if reading a symlink target. Tests are added that verify the interaction between `FileSystem` implementations and the Java (N)IO APIs on Unicode file paths. Fixes #1775. Fixes #11602. Fixes #18293. Work towards #374. Work towards #23859. Closes #24010. PiperOrigin-RevId: 694114597 Change-Id: I5bdcbc14a90dd1f0f34698aebcbd07cd2bde7a23
After a series of patches, Bazel 8 should now support the following on Linux, macOS and Windows (Windows build 1903 and higher):
You can try this with Bazelisk by setting Please file a separate issue if you run into any problems with the above. |
This change patches the app manifest of the `java.exe` launcher in the embedded JDK to always use the UTF-8 codepage on Windows 1903 and later. This is necessary because the launcher sets sun.jnu.encoding to the system code page, which by default is a legacy code page such as Cp1252 on Windows. This causes the JVM to be unable to interact with files whose paths contain Unicode characters not representable in the system code page, as well as command-line arguments and environment variables containing such characters. The Windows VMs in CI are not running Windows 1903 or later yet, so this change can currently only be tested locally by running `bazel info character-encoding` and verifying that it prints `sun.jnu.encoding = UTF-8`. Work towards bazelbuild#374 Work towards bazelbuild#18293 Work towards bazelbuild#23859 Closes bazelbuild#24172. PiperOrigin-RevId: 693466466 Change-Id: I4914c21e846493a8880ac8c6f5e1afa9fae87366
Bazel aims to support arbitrary file system path encodings (even raw byte sequences) by attempting to force the JVM to use a Latin-1 locale for OS interactions. As a result, Bazel internally encodes `String`s as raw byte arrays with a Latin-1 coder and no encoding information. Whenever it interacts with encoding-aware APIs, this may require a reencoding of the `String` contents, depending on the OS and availability of a Latin-1 locale. This PR introduces the concepts of *internal*, *Unicode*, and *platform* strings and adds dedicated optimized functions for converting between these three types (see the class comment on the new `StringEncoding` helper class for details). These functions are then used to standardize and fix conversion throughout the code base. As a result, a number of new end-to-end integration tests for the handling of Unicode in file paths, command-line arguments and environment variables now pass. Full support for Unicode beyond the current active code page on Windows is left to a follow-up PR as it may require patching the embedded JDK. * Replace ad-hoc conversion logic with the new consistent set of helper functions. * Make more parts of the Bazel client's Windows implementation Unicode-aware. This also fixes the behavior of `SetEnv` on Windows, which previously would remove an environment variable if passed an empty value for it, which doesn't match the Unix behavior. * Drop the `charset` parameter from all methods related to parameter files. The `ISO-8859-1` vs. `UTF-8` choice was flawed since Bazel's internal string representation doesn't maintain any encoding information - `ISO-8859-1` just meant "write out raw bytes", which is the only choice that matches what arguments would look like if passed on the command line. * Convert server args to the internal string representation. The arguments for requests to the server were already converted to Bazel's internal string representation, which resulted in a mismatch between `--client_cwd` and `--workspace_directory` if the workspace path contains non-ASCII characters. * Read the downloader config using Bazel's filesystem implementation. * Make `MacOSXFsEventsDiffAwareness` UTF-8 aware. It previously used the `GetStringUTF` JNI method, which, despite its name, doesn't return the UTF-8 representation of a string, but modified CESU-8 (nobody ever wants this). * Correctly reencode path strings for `LocalDiffAwareness`. * Correctly reencode the value of `user.dir`. * Correctly turn `ExecRequest` fields into strings for `ProcessBuilder` for `bazel --batch run`. This makes it possible to reenable the `test_consistent_command_line_encoding` test, fixing bazelbuild#1775. * Fix encoding issues in `TargetCompleteEvents`. * Fix encoding issues in `SubprocessFactory` implementations. * Drop obsolete warning if `file.encoding` doesn't equal `ISO-8859-1` as file names are encoded with `sun.jnu.encoding` now. * Consistently reencode internal strings passed into and out of `FileSystem` implementations, e.g. if reading a symlink target. Tests are added that verify the interaction between `FileSystem` implementations and the Java (N)IO APIs on Unicode file paths. Fixes bazelbuild#1775. Fixes bazelbuild#11602. Fixes bazelbuild#18293. Work towards bazelbuild#374. Work towards bazelbuild#23859. Closes bazelbuild#24010. PiperOrigin-RevId: 694114597 Change-Id: I5bdcbc14a90dd1f0f34698aebcbd07cd2bde7a23
Thanks a lot! Is there some documentation about this? Specifically:
|
The current situation is that all Starlark and other Bazel files (e.g. Starlark strings behave as UTF-8 byte arrays, with no notion of Unicode characters. Very few methods have bugs (e.g. I wasn't aware of that docs statement, I will update it. |
Thanks! |
Stardoc assumes Latin-1 for docstrings, though. Encoding a Starlark file in UTF-8 will result in double-encoding, cf. https://github.com/phst/rules_elisp/blob/master/docs/generate.py#L236-L239 |
If Starlark files are now asssumed to be UTF-8, then I guess for Stardoc https://github.com/bazelbuild/bazel/blob/8.0.0/src/main/java/com/google/devtools/build/lib/starlarkdocextract/RuleInfoExtractor.java#L65 and similar occurrences (basically wherever a string proto field in the Stardoc proto is set) need to be fixed |
Yes, all output files produced by Bazel should use UTF-8 and
Thanks for pointing that out, I sent #24935 to fix this. |
See bazelbuild/bazel#374 (comment): > all output files produced by Bazel should use UTF-8 and \n line endings on > all platforms, including Windows.
See bazelbuild/bazel#374 (comment): > all output files produced by Bazel should use UTF-8 and \n line endings on > all platforms, including Windows.
See bazelbuild/bazel#374 (comment): > all output files produced by Bazel should use UTF-8 and \n line endings on > all platforms, including Windows.
Here's another doc that I guess is outdated now: https://bazel.build/concepts/labels
|
OK, then the runfiles libraries also need to be adapted.
|
See bazelbuild/bazel#374 (comment): > all output files produced by Bazel should use UTF-8 and \n line endings on > all platforms, including Windows.
Thanks for sending the fix for Python!
Microsoft now recommends using the |
Ultimately any character can be part of a filename. We should probably allow that.
Some mangling to generate the corresponding label should probably be done.
Original report on the mailing-list:
https://groups.google.com/d/msgid/bazel-discuss/CAN0GiO3__5jXo5rZqroSj0mFxpqCzUZZVkY%3DSNsJK1%2BZ1BdJLg%40mail.gmail.com
The text was updated successfully, but these errors were encountered: