Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow any characters in filenames / labels #374

Open
damienmg opened this issue Aug 13, 2015 · 92 comments
Open

Allow any characters in filenames / labels #374

damienmg opened this issue Aug 13, 2015 · 92 comments
Labels
P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) team-Loading-API BUILD file and macro processing: labels, package(), visibility, glob type: feature request
Milestone

Comments

@damienmg
Copy link
Contributor

Ultimately any character can be part of a filename. We should probably allow that.

Some mangling to generate the corresponding label should probably be done.

Original report on the mailing-list:
https://groups.google.com/d/msgid/bazel-discuss/CAN0GiO3__5jXo5rZqroSj0mFxpqCzUZZVkY%3DSNsJK1%2BZ1BdJLg%40mail.gmail.com

@abergmeier
Copy link
Contributor

  1. So are we talking Unicode?
  2. Where does any character stop?
  3. how do you treat characters that are not allowed on certain platforms?
  4. When using mangling, how do you handle collisions?

@kayasoze
Copy link

In POSIX, filenames are "bags of bytes"--there is no encoding; however, NUL and / are not allowed. Windows has a few more restrictions. Perhaps the BUILD file should be parsed in the encoding of the system locale, usually UTF-8, and filenames run though a ValidForCurrentPlatform() function which checks for disallowed characters. However, opting for strict platform neutrality in this way means that Bazel would have to represent filenames as a bag of bytes and not a Unicode string, as there is no guarantee that the filename will roundtrip through Unicode correctly. The problem can probably be simplified by restricting filenames to be UTF-8 or UTF-16, which should cover most people's needs even though that's not strict POSIX.

@ulfjack
Copy link
Contributor

ulfjack commented Oct 22, 2015

Well, I think we can probably require valid UTF-8 file names and strongly recommend that people use UTF-8 for their file system. For labels / BUILD files, we probably need an escaping scheme, at least for the control characters. If there's a file that isn't valid UTF-8, we give an error message?

@btelles
Copy link

btelles commented Nov 18, 2015

Our company codes mainly in C++, but our frontend uses a lot of JS and nodejs modules which have all sorts of characters in the filenames--for example, -, #, @, (, and ).

Right now this is a major blocker for getting all our codebase under one build system since we can't reference files with semi-special characters. I don't think Bazel should decide what characters are acceptable in file names, as that reduces file names to those that fit both (1) supported languages and (2) supported platforms. This seems unnecessarily restrictive, and is becoming a major pain point for us.

@ulfjack
Copy link
Contributor

ulfjack commented Nov 19, 2015

Agreed. Unfortunately, it's a bit tricky to fix, as a lot of code assumes that the mapping from labels to file names (and vice versa) is trivial, and doesn't require escaping. Any suggestions on an escaping scheme?

@damienmg
Copy link
Contributor Author

URL based?

@abergmeier
Copy link
Contributor

You mean an own URI scheme? Sounds good.

@damienmg
Copy link
Contributor Author

I mean replacing special characters by %XX where XX is the UTF-8 code in hexa.

@ulfjack ulfjack assigned philwo and unassigned ulfjack Mar 2, 2016
@ulfjack
Copy link
Contributor

ulfjack commented Mar 2, 2016

Sorry, I won't be able to work on this. @philwo had an interest, maybe he can make some progress here. :-/

@kayasoze
Copy link

This blocks our Bazel deployment as well.

@RonnieAtOracle
Copy link

This is blocking us. We have a templating system where we need to build our template files. The filenames themselves contains template variables (e.g. ${ServiceName}.java ). Both $, {, and } are not supported by Bazel in file names.

@philwo
Copy link
Member

philwo commented May 25, 2016

I totally agree that this is important, should be done, I want this myself, however I don't have the time to work on it in the coming months, thus I have to unassign it.

@DemiMarie
Copy link

Here is my proposal:

  • Metacharacters (:, %, =, any others) must be %-encoded
  • All other characters are allowed. This includes Unicode characters.
  • Non-UTF8 names are not allowed, even if escaped. This is because there is no good way to handle them cross-platform, nor to display them to the user.

@mihnita
Copy link

mihnita commented Oct 6, 2016

Plain ASCII (and even that partial) makes this feels like we are in the early 90s.

There are reasonable ways to handle that.
For POSIX using the default locale would be good enough.

If my project is C/C++, and it is cross-platform, and I have problems handling Unicode, then I will not use Unicode in file names. And the fact that bazel "explodes" is not such a problem.
But if I use something like Java, then "it just works", and bazel would work too.

Even better would be to to allow for a character-set option in the project file.
This is what maven does. And what Java does with -Dfile.encoding=UTF-8
So if it is there, use it. If not, then take the system charset.

I did not move one project to bazel because test units check that Unicode file names work.
So the files are there, make it through git, work with maven and ant and gradle and java.
But bazel fails because there is an "@" in the file name... Which is supported on all OSes.

@snakethatlovesstaticlibs
Copy link

snakethatlovesstaticlibs commented Apr 11, 2024

@fmeum I worked around this by adding the test directories to an exclude in a filegroup, and it seems to be working now (because as you mentioned, the files are part of tests, so I don't need them)

For reference, the error I saw was:

ERROR: .../bazel_build/external/erlang/BUILD.bazel:10:15: Foreign Cc - Configure: Building erlang failed: error reading file '@@erlang//:erts/test/erlc_SUITE_data/src/😀/erl_test_unicode.erl': .../bazel_build/external/erlang/erts/test/erlc_SUITE_data/src/😀/erl_test_unicode.erl (No such file or directory)

sorry if this is unrelated to the issue in the original report, it sounded relevant while I was debugging

copybara-service bot pushed a commit that referenced this issue Nov 5, 2024
This change patches the app manifest of the `java.exe` launcher in the embedded JDK to always use the UTF-8 codepage on Windows 1903 and later.

This is necessary because the launcher sets sun.jnu.encoding to the system code page, which by default is a legacy code page such as Cp1252 on Windows. This causes the JVM to be unable to interact with files whose paths contain Unicode characters not representable in the system code page, as well as command-line arguments and environment variables containing such characters.

The Windows VMs in CI are not running Windows 1903 or later yet, so this change can currently only be tested locally by running `bazel info character-encoding` and verifying that it prints `sun.jnu.encoding = UTF-8`.

Work towards #374
Work towards #18293
Work towards #23859

Closes #24172.

PiperOrigin-RevId: 693466466
Change-Id: I4914c21e846493a8880ac8c6f5e1afa9fae87366
bazel-io pushed a commit to bazel-io/bazel that referenced this issue Nov 6, 2024
This change patches the app manifest of the `java.exe` launcher in the embedded JDK to always use the UTF-8 codepage on Windows 1903 and later.

This is necessary because the launcher sets sun.jnu.encoding to the system code page, which by default is a legacy code page such as Cp1252 on Windows. This causes the JVM to be unable to interact with files whose paths contain Unicode characters not representable in the system code page, as well as command-line arguments and environment variables containing such characters.

The Windows VMs in CI are not running Windows 1903 or later yet, so this change can currently only be tested locally by running `bazel info character-encoding` and verifying that it prints `sun.jnu.encoding = UTF-8`.

Work towards bazelbuild#374
Work towards bazelbuild#18293
Work towards bazelbuild#23859

Closes bazelbuild#24172.

PiperOrigin-RevId: 693466466
Change-Id: I4914c21e846493a8880ac8c6f5e1afa9fae87366
github-merge-queue bot pushed a commit that referenced this issue Nov 7, 2024
This change patches the app manifest of the `java.exe` launcher in the
embedded JDK to always use the UTF-8 codepage on Windows 1903 and later.

This is necessary because the launcher sets sun.jnu.encoding to the
system code page, which by default is a legacy code page such as Cp1252
on Windows. This causes the JVM to be unable to interact with files
whose paths contain Unicode characters not representable in the system
code page, as well as command-line arguments and environment variables
containing such characters.

The Windows VMs in CI are not running Windows 1903 or later yet, so this
change can currently only be tested locally by running `bazel info
character-encoding` and verifying that it prints `sun.jnu.encoding =
UTF-8`.

Work towards #374
Work towards #18293
Work towards #23859

Closes #24172.

PiperOrigin-RevId: 693466466
Change-Id: I4914c21e846493a8880ac8c6f5e1afa9fae87366

Commit
7bb8d2b

Co-authored-by: Fabian Meumertzheim <fabian@meumertzhe.im>
copybara-service bot pushed a commit that referenced this issue Nov 7, 2024
Bazel aims to support arbitrary file system path encodings (even raw byte sequences) by attempting to force the JVM to use a Latin-1 locale for OS interactions. As a result, Bazel internally encodes `String`s as raw byte arrays with a Latin-1 coder and no encoding information. Whenever it interacts with encoding-aware APIs, this may require a reencoding of the `String` contents, depending on the OS and availability of a Latin-1 locale.

This PR introduces the concepts of *internal*, *Unicode*, and *platform* strings and adds dedicated optimized functions for converting between these three types (see the class comment on the new `StringEncoding` helper class for details). These functions are then used to standardize and fix conversion throughout the code base. As a result, a number of new end-to-end integration tests for the handling of Unicode in file paths, command-line arguments and environment variables now pass.

Full support for Unicode beyond the current active code page on Windows is left to a follow-up PR as it may require patching the embedded JDK.

* Replace ad-hoc conversion logic with the new consistent set of helper functions.
* Make more parts of the Bazel client's Windows implementation Unicode-aware. This also fixes the behavior of `SetEnv` on Windows, which previously would remove an environment variable if passed an empty value for it, which doesn't match the Unix behavior.
* Drop the `charset` parameter from all methods related to parameter files. The `ISO-8859-1` vs. `UTF-8` choice was flawed since Bazel's internal string representation doesn't maintain any encoding information - `ISO-8859-1` just meant "write out raw bytes", which is the only choice that matches what arguments would look like if passed on the command line.
* Convert server args to the internal string representation. The arguments for requests to the server were already converted to Bazel's internal string representation, which resulted in a mismatch between `--client_cwd` and `--workspace_directory` if the workspace path contains non-ASCII characters.
* Read the downloader config using Bazel's filesystem implementation.
* Make `MacOSXFsEventsDiffAwareness` UTF-8 aware. It previously used the `GetStringUTF` JNI method, which, despite its name, doesn't return the UTF-8 representation of a string, but modified CESU-8 (nobody ever wants this).
* Correctly reencode path strings for `LocalDiffAwareness`.
* Correctly reencode the value of `user.dir`.
* Correctly turn `ExecRequest` fields into strings for `ProcessBuilder` for `bazel --batch run`. This makes it possible to reenable the `test_consistent_command_line_encoding` test, fixing #1775.
* Fix encoding issues in `TargetCompleteEvents`.
* Fix encoding issues in `SubprocessFactory` implementations.
* Drop obsolete warning if `file.encoding` doesn't equal `ISO-8859-1` as file names are encoded with `sun.jnu.encoding` now.
* Consistently reencode internal strings passed into and out of `FileSystem` implementations, e.g. if reading a symlink target. Tests are added that verify the interaction between `FileSystem` implementations and the Java (N)IO APIs on Unicode file paths.

Fixes #1775.

Fixes #11602.

Fixes #18293.

Work towards #374.

Work towards #23859.

Closes #24010.

PiperOrigin-RevId: 694114597
Change-Id: I5bdcbc14a90dd1f0f34698aebcbd07cd2bde7a23
@fmeum
Copy link
Collaborator

fmeum commented Nov 20, 2024

After a series of patches, Bazel 8 should now support the following on Linux, macOS and Windows (Windows build 1903 and higher):

  • all Unicode characters except control characters, :, \ and, in some positions, ., in file names that are also labels (i.e., can be referenced in BUILD files)
  • all Unicode characters in other file names
  • all Unicode characters in runfiles (including spaces and newlines)
  • all Unicode characters in the output base or workspace path (including spaces)

You can try this with Bazelisk by setting USE_BAZEL_VERSION to the hash of the latest commit on the release branch.

Please file a separate issue if you run into any problems with the above.

ramil-bitrise pushed a commit to bitrise-io/bazel that referenced this issue Dec 18, 2024
This change patches the app manifest of the `java.exe` launcher in the embedded JDK to always use the UTF-8 codepage on Windows 1903 and later.

This is necessary because the launcher sets sun.jnu.encoding to the system code page, which by default is a legacy code page such as Cp1252 on Windows. This causes the JVM to be unable to interact with files whose paths contain Unicode characters not representable in the system code page, as well as command-line arguments and environment variables containing such characters.

The Windows VMs in CI are not running Windows 1903 or later yet, so this change can currently only be tested locally by running `bazel info character-encoding` and verifying that it prints `sun.jnu.encoding = UTF-8`.

Work towards bazelbuild#374
Work towards bazelbuild#18293
Work towards bazelbuild#23859

Closes bazelbuild#24172.

PiperOrigin-RevId: 693466466
Change-Id: I4914c21e846493a8880ac8c6f5e1afa9fae87366
ramil-bitrise pushed a commit to bitrise-io/bazel that referenced this issue Dec 18, 2024
Bazel aims to support arbitrary file system path encodings (even raw byte sequences) by attempting to force the JVM to use a Latin-1 locale for OS interactions. As a result, Bazel internally encodes `String`s as raw byte arrays with a Latin-1 coder and no encoding information. Whenever it interacts with encoding-aware APIs, this may require a reencoding of the `String` contents, depending on the OS and availability of a Latin-1 locale.

This PR introduces the concepts of *internal*, *Unicode*, and *platform* strings and adds dedicated optimized functions for converting between these three types (see the class comment on the new `StringEncoding` helper class for details). These functions are then used to standardize and fix conversion throughout the code base. As a result, a number of new end-to-end integration tests for the handling of Unicode in file paths, command-line arguments and environment variables now pass.

Full support for Unicode beyond the current active code page on Windows is left to a follow-up PR as it may require patching the embedded JDK.

* Replace ad-hoc conversion logic with the new consistent set of helper functions.
* Make more parts of the Bazel client's Windows implementation Unicode-aware. This also fixes the behavior of `SetEnv` on Windows, which previously would remove an environment variable if passed an empty value for it, which doesn't match the Unix behavior.
* Drop the `charset` parameter from all methods related to parameter files. The `ISO-8859-1` vs. `UTF-8` choice was flawed since Bazel's internal string representation doesn't maintain any encoding information - `ISO-8859-1` just meant "write out raw bytes", which is the only choice that matches what arguments would look like if passed on the command line.
* Convert server args to the internal string representation. The arguments for requests to the server were already converted to Bazel's internal string representation, which resulted in a mismatch between `--client_cwd` and `--workspace_directory` if the workspace path contains non-ASCII characters.
* Read the downloader config using Bazel's filesystem implementation.
* Make `MacOSXFsEventsDiffAwareness` UTF-8 aware. It previously used the `GetStringUTF` JNI method, which, despite its name, doesn't return the UTF-8 representation of a string, but modified CESU-8 (nobody ever wants this).
* Correctly reencode path strings for `LocalDiffAwareness`.
* Correctly reencode the value of `user.dir`.
* Correctly turn `ExecRequest` fields into strings for `ProcessBuilder` for `bazel --batch run`. This makes it possible to reenable the `test_consistent_command_line_encoding` test, fixing bazelbuild#1775.
* Fix encoding issues in `TargetCompleteEvents`.
* Fix encoding issues in `SubprocessFactory` implementations.
* Drop obsolete warning if `file.encoding` doesn't equal `ISO-8859-1` as file names are encoded with `sun.jnu.encoding` now.
* Consistently reencode internal strings passed into and out of `FileSystem` implementations, e.g. if reading a symlink target. Tests are added that verify the interaction between `FileSystem` implementations and the Java (N)IO APIs on Unicode file paths.

Fixes bazelbuild#1775.

Fixes bazelbuild#11602.

Fixes bazelbuild#18293.

Work towards bazelbuild#374.

Work towards bazelbuild#23859.

Closes bazelbuild#24010.

PiperOrigin-RevId: 694114597
Change-Id: I5bdcbc14a90dd1f0f34698aebcbd07cd2bde7a23
@phst
Copy link
Contributor

phst commented Jan 15, 2025

After a series of patches, Bazel 8 should now support the following on Linux, macOS and Windows (Windows build 1903 and higher):

Thanks a lot! Is there some documentation about this? Specifically:
Assume I have a file whose name appears as ☃.txt in a graphical file explorer (Windows Explorer on Windows, Finder on macOS, Nautilus or similar on GNU/Linux systems). This filename will be encoded as UTF-16 on Windows, UTF-8 on macOS (actually UTF-16 NFD on HFS+ AFAIK, but the macOS POSIX API will present it as UTF-8 NFD), and (typically) UTF-8 on GNU/Linux (assuming LC_CTYPE is set to something with UTF-8).

  • In a BUILD file, how do I reference such a file in a cross-platform way? Is it sufficient to encode the BUILD file as UTF-8, despite https://bazel.build/versions/8.0.0/concepts/build-files saying the BUILD files are interpreted as Latin-1?
  • In a Starlark file, how does that filename appear in a string? What are the string's elements? (My assumption is that the string's elements are the UTF-8 code units as stored in the BUILD file.)
  • How will that filename be encoded in runfiles manifests? As UTF-8, even on Windows? Is there any difference between host and exec configs?

@fmeum
Copy link
Collaborator

fmeum commented Jan 15, 2025

The current situation is that all Starlark and other Bazel files (e.g. .bazelrc, manifests, ...) are assumed (but not enforced) to be encoded in UTF-8. File system paths are reencoded into the appropriate OS/FS-dependent encoding when Bazel interfaces with the OS/FS.

Starlark strings behave as UTF-8 byte arrays, with no notion of Unicode characters. Very few methods have bugs (e.g. trim can trim individual bytes from UTF-8 characters that would be whitespace in Latin-1).

I wasn't aware of that docs statement, I will update it.

@phst
Copy link
Contributor

phst commented Jan 15, 2025

Thanks!
What about the runfiles manifest on Windows? It can't be UTF-16 (that would be the appropriate OS encoding, but not ASCII-compatible), so is it UTF-8 even on Windows?

@phst
Copy link
Contributor

phst commented Jan 15, 2025

Starlark and other Bazel files (e.g. .bazelrc, manifests, ...) are assumed (but not enforced) to be encoded in UTF-8

Stardoc assumes Latin-1 for docstrings, though. Encoding a Starlark file in UTF-8 will result in double-encoding, cf. https://github.com/phst/rules_elisp/blob/master/docs/generate.py#L236-L239

@phst
Copy link
Contributor

phst commented Jan 15, 2025

If Starlark files are now asssumed to be UTF-8, then I guess for Stardoc https://github.com/bazelbuild/bazel/blob/8.0.0/src/main/java/com/google/devtools/build/lib/starlarkdocextract/RuleInfoExtractor.java#L65 and similar occurrences (basically wherever a string proto field in the Stardoc proto is set) need to be fixed

@fmeum
Copy link
Collaborator

fmeum commented Jan 16, 2025

Thanks! What about the runfiles manifest on Windows? It can't be UTF-16 (that would be the appropriate OS encoding, but not ASCII-compatible), so is it UTF-8 even on Windows?

Yes, all output files produced by Bazel should use UTF-8 and \n line endings on all platforms, including Windows.

If Starlark files are now asssumed to be UTF-8, then I guess for Stardoc https://github.com/bazelbuild/bazel/blob/8.0.0/src/main/java/com/google/devtools/build/lib/starlarkdocextract/RuleInfoExtractor.java#L65 and similar occurrences (basically wherever a string proto field in the Stardoc proto is set) need to be fixed

Thanks for pointing that out, I sent #24935 to fix this.

phst added a commit to phst/rules_python that referenced this issue Jan 20, 2025
See bazelbuild/bazel#374 (comment):

> all output files produced by Bazel should use UTF-8 and \n line endings on
> all platforms, including Windows.
phst added a commit to phst/rules_python that referenced this issue Jan 20, 2025
See bazelbuild/bazel#374 (comment):

> all output files produced by Bazel should use UTF-8 and \n line endings on
> all platforms, including Windows.
phst added a commit to phst/rules_python that referenced this issue Jan 20, 2025
See bazelbuild/bazel#374 (comment):

> all output files produced by Bazel should use UTF-8 and \n line endings on
> all platforms, including Windows.
@phst
Copy link
Contributor

phst commented Jan 22, 2025

I wasn't aware of that docs statement, I will update it.

Here's another doc that I guess is outdated now: https://bazel.build/concepts/labels

  • Repository names: No documentation, but I'd assume that even with this change repository names (both canonical and apparent) are ASCII-only, with some more restrictions (no newlines, spaces, slashes, ...)
  • Package names: Since package names correspond to directory names, these can now also contain non-ASCII characters?
  • Target names: Definitely can contain non-ASCII characters now.

@phst
Copy link
Contributor

phst commented Jan 22, 2025

Yes, all output files produced by Bazel should use UTF-8 and \n line endings on all platforms, including Windows.

OK, then the runfiles libraries also need to be adapted.

  • Go: Probably no change needed, Go uses WTF-8 for filenames on Windows
  • Python: Probably only need to make sure that files are opened with the correct encoding (fix: Fix encoding of runfiles manifest and repository mapping files. rules_python#2568)
  • C++: A lot more changes required. The use of narrow strings throughout makes manifest files and directories with non-ASCII characters nonportable/impossible. We need at least overloads for Create and Rlocation with std::wstring on Windows.
  • Others: ???

phst added a commit to phst/rules_python that referenced this issue Jan 22, 2025
See bazelbuild/bazel#374 (comment):

> all output files produced by Bazel should use UTF-8 and \n line endings on
> all platforms, including Windows.
@fmeum
Copy link
Collaborator

fmeum commented Jan 22, 2025

Thanks for sending the fix for Python!

C++: A lot more changes required. The use of narrow strings throughout makes manifest files and directories with non-ASCII characters nonportable/impossible. We need at least overloads for Create and Rlocation with std::wstring on Windows.

Microsoft now recommends using the -A variety of functions with the UTF-8 code page (forced via an app manifest) instead of the wide string functions for new software. Existing software probably already has its own conversion functions to and from UTF-8, so I would personally lean against complicating the API for everyone by adding more overloads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) team-Loading-API BUILD file and macro processing: labels, package(), visibility, glob type: feature request
Projects
None yet
Development

No branches or pull requests