Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new_http_archive can't handle archives containing unicode-encoded filenames #1653

Closed
nelhage opened this issue Aug 16, 2016 · 12 comments
Closed
Assignees
Labels
P1 I'll work on this now. (Assignee required) team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. type: bug
Milestone

Comments

@nelhage
Copy link

nelhage commented Aug 16, 2016

I'm attempting to add libgit2 to a project as a bazel external, like so:

new_http_archive(
  name = "com_github_libgit2",
  url = "https://github.com/libgit2/libgit2/archive/v0.24.1.tar.gz",
  strip_prefix = "libgit2-0.24.1",
  sha256 = "60198cbb34066b9b5c1613d15c0479f6cd25f4aef42f7ec515cd1cc13a77fede",
  build_file = "BUILD.libgit2",
)

Unfortunately, bazel build @com_github_libgit2//... fails with

Unhandled exception thrown during build; message: Unrecoverable error while evaluating node 'REPOSITORY_DIRECTORY:@com_github_libgit2' (requested by nodes 'REPOSITORY:@com_github_libgit2')
INFO: Elapsed time: 3.331s
java.lang.RuntimeException: Unrecoverable error while evaluating node 'REPOSITORY_DIRECTORY:@com_github_libgit2' (requested by nodes 'REPOSITORY:@com_github_libgit2')
        at com.google.devtools.build.skyframe.ParallelEvaluator$Evaluate.run(ParallelEvaluator.java:1070)
        at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:474)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.file.InvalidPathException: Malformed input or input contains unmappable characters: /home/nelhage/.cache/bazel/_bazel_nelhage/63183be0b56fd73a3d972912805a7bbe/external/com_github_libgit2/tests/resources/status/这
        at sun.nio.fs.UnixPath.encode(UnixPath.java:147)
        at sun.nio.fs.UnixPath.<init>(UnixPath.java:71)
        at sun.nio.fs.UnixFileSystem.getPath(UnixFileSystem.java:281)
        at java.io.File.toPath(File.java:2234)
        at com.google.devtools.build.lib.bazel.repository.CompressedTarFunction.decompress(CompressedTarFunction.java:69)
        at com.google.devtools.build.lib.bazel.repository.DecompressorValue.decompress(DecompressorValue.java:76)
        at com.google.devtools.build.lib.bazel.repository.NewHttpArchiveFunction.fetch(NewHttpArchiveFunction.java:70)
        at com.google.devtools.build.lib.rules.repository.RepositoryDelegatorFunction.compute(RepositoryDelegatorFunction.java:155)
        at com.google.devtools.build.skyframe.ParallelEvaluator$Evaluate.run(ParallelEvaluator.java:1016)
        ... 4 more
java.lang.RuntimeException: Unrecoverable error while evaluating node 'REPOSITORY_DIRECTORY:@com_github_libgit2' (requested by nodes 'REPOSITORY:@com_github_libgit2')
        at com.google.devtools.build.skyframe.ParallelEvaluator$Evaluate.run(ParallelEvaluator.java:1070)
        at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:474)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.file.InvalidPathException: Malformed input or input contains unmappable characters: /home/nelhage/.cache/bazel/_bazel_nelhage/63183be0b56fd73a3d972912805a7bbe/external/com_github_libgit2/tests/resources/status/这
        at sun.nio.fs.UnixPath.encode(UnixPath.java:147)
        at sun.nio.fs.UnixPath.<init>(UnixPath.java:71)
        at sun.nio.fs.UnixFileSystem.getPath(UnixFileSystem.java:281)
        at java.io.File.toPath(File.java:2234)
        at com.google.devtools.build.lib.bazel.repository.CompressedTarFunction.decompress(CompressedTarFunction.java:69)
        at com.google.devtools.build.lib.bazel.repository.DecompressorValue.decompress(DecompressorValue.java:76)
        at com.google.devtools.build.lib.bazel.repository.NewHttpArchiveFunction.fetch(NewHttpArchiveFunction.java:70)
        at com.google.devtools.build.lib.rules.repository.RepositoryDelegatorFunction.compute(RepositoryDelegatorFunction.java:155)
        at com.google.devtools.build.skyframe.ParallelEvaluator$Evaluate.run(ParallelEvaluator.java:1016)
        ... 4 more

The problem seems to be proximally caused by the fact that bazel forces itself into a latin-1 locale:

bazel/src/main/cpp/blaze.cc

Lines 1718 to 1724 in 936c2c2

// Make the JVM use ISO-8859-1 for parsing its command line because "blaze
// run" doesn't handle non-ASCII command line arguments. This is apparently
// the most reliable way to select the platform default encoding.
setenv("LANG", "en_US.ISO-8859-1", 1);
setenv("LANGUAGE", "en_US.ISO-8859-1", 1);
setenv("LC_ALL", "en_US.ISO-8859-1", 1);
setenv("LC_CTYPE", "en_US.ISO-8859-1", 1);
UnixPath.encode will happily encode utf-8 paths in a utf-8 locale.

I think this is distinct from #374 in that I don't even need to reference the problematic files in a rule; bazel can't even unpack the tarball, even though I don't care about the files in question.

@philwo philwo added type: bug P2 We'll consider working on this in future. (Assignee optional) labels Aug 16, 2016
@damienmg damienmg modified the milestones: 0.6, 0.7 Dec 12, 2016
@bscarlet
Copy link

I just ran into this too.

@jfaust
Copy link

jfaust commented Nov 18, 2017

This is causing problems for me with libvips, which has a filename with russian characters in it in the release.

@dslomov dslomov added P1 I'll work on this now. (Assignee required) external-repos-triaged and removed P2 We'll consider working on this in future. (Assignee optional) labels Jan 9, 2018
bazel-io pushed a commit that referenced this issue Jan 31, 2018
Add a test verifying that http_archive can extract a tar archive
containing unicode characters. While such files cannot be referred
to by labels, it is still important that the archive can be extracted.
Also fix that use case on Darwin, by appropriately reencoding the string,
so that the Files java standard library can encode it back to what we
had in the first place.

Work-around for #1653, showing that http_archive from @bazel_tools can
be used; however, the issue still remains for zip archives.

Change-Id: If944203bf618c21705af676347d8591ab015d559
PiperOrigin-RevId: 183987726
bazel-io pushed a commit that referenced this issue Feb 1, 2018
*** Reason for rollback ***

Breaks on our CI Linux machines (but works on our work desktop Linux machines); apparently, even our own Linux machines are too different from each other...

Fixes #4557

*** Original change description ***

http_archive: verify that unicode characters are OK in tar archives

Add a test verifying that http_archive can extract a tar archive
containing unicode characters. While such files cannot be referred
to by labels, it is still important that the archive can be extracted.
Also fix that use case on Darwin, by appropriately reencoding the string,
so that the Files java standard library can encode it back to what we
had in the first place.

Work-around for #1653, showing that http_archive from @bazel_tools can
be used; however, the issue still remains for zip archives.

***

PiperOrigin-RevId: 184132385
katre pushed a commit that referenced this issue Feb 20, 2018
*** Reason for rollback ***

Breaks on our CI Linux machines (but works on our work desktop Linux machines); apparently, even our own Linux machines are too different from each other...

Fixes #4557

*** Original change description ***

http_archive: verify that unicode characters are OK in tar archives

Add a test verifying that http_archive can extract a tar archive
containing unicode characters. While such files cannot be referred
to by labels, it is still important that the archive can be extracted.
Also fix that use case on Darwin, by appropriately reencoding the string,
so that the Files java standard library can encode it back to what we
had in the first place.

Work-around for #1653, showing that http_archive from @bazel_tools can
be used; however, the issue still remains for zip archives.

***

PiperOrigin-RevId: 184132385
katre pushed a commit that referenced this issue Feb 28, 2018
*** Reason for rollback ***

Breaks on our CI Linux machines (but works on our work desktop Linux machines); apparently, even our own Linux machines are too different from each other...

Fixes #4557

*** Original change description ***

http_archive: verify that unicode characters are OK in tar archives

Add a test verifying that http_archive can extract a tar archive
containing unicode characters. While such files cannot be referred
to by labels, it is still important that the archive can be extracted.
Also fix that use case on Darwin, by appropriately reencoding the string,
so that the Files java standard library can encode it back to what we
had in the first place.

Work-around for #1653, showing that http_archive from @bazel_tools can
be used; however, the issue still remains for zip archives.

***

PiperOrigin-RevId: 184132385
philwo pushed a commit that referenced this issue Mar 6, 2018
*** Reason for rollback ***

Breaks on our CI Linux machines (but works on our work desktop Linux machines); apparently, even our own Linux machines are too different from each other...

Fixes #4557

*** Original change description ***

http_archive: verify that unicode characters are OK in tar archives

Add a test verifying that http_archive can extract a tar archive
containing unicode characters. While such files cannot be referred
to by labels, it is still important that the archive can be extracted.
Also fix that use case on Darwin, by appropriately reencoding the string,
so that the Files java standard library can encode it back to what we
had in the first place.

Work-around for #1653, showing that http_archive from @bazel_tools can
be used; however, the issue still remains for zip archives.

***

PiperOrigin-RevId: 184132385
@steeve
Copy link
Contributor

steeve commented May 21, 2018

This is still an issue as of Bazel 0.13 with

    native.new_git_repository(
        name = "com_github_mosra_corrade",
        remote = "https://github.com/mosra/corrade.git",
        commit = "10a9abeca9938091edbf8a7fdf7cd6a944ce01c4",
        build_file = "//:third_party/com_github_mosra_corrade/BUILD.bazel.in",
    )

@katre
Copy link
Member

katre commented May 22, 2018

We are deprecating the native versions of http_archive and git_repository. Have you tried the skylark-based versions, do they have the same error?

To use the Skylark new_git_repository, just add this to your WORKSPACE:

load(
    "@bazel_tools//tools/build_defs/repo:git.bzl",
    "git_repository",
    "new_git_repository",
)

@steeve
Copy link
Contributor

steeve commented May 22, 2018 via email

@nelhage
Copy link
Author

nelhage commented Sep 2, 2018

This no longer crashes with the Starlark rules, but it still fails during extraction:

ERROR: Malformed input or input contains unmappable characters: /home/nelhage/.cache/bazel/_bazel_nelhage/9608ccc15fa131b938e0090f75bd00b5/external/com_github_libgit2/tests/resources/status/这

I thought I could work around with patch_cmds = ["rm -rf tests"] but it appears to not even make it as far as running that command.

@EricCousineau-TRI
Copy link
Contributor

Also affected, using Starlark http_archive with Bazel 0.16.1, trying to clone from sphinx:

ERROR: Analysis of target '//tools/workspace/sphinx:sphinx_build' failed; build aborted: no such package '@sphinx//': Malformed input or input contains unmappable characters: .../external/sphinx/tests/roots/test-image-glob/testimäge.png

Same as above; using patch_cmds does not work.

May try to make a workaround Python script to run via repository_ctx so that it can manually scrub these symbols when extracting.

@ashi009
Copy link
Contributor

ashi009 commented Jan 15, 2019

Is there any progress on fixing this? It's understandable that this is not an issue for Google internal but apparently a huge bumper for many community users.

Given that this is a 1-year-old P1 bug, shall we provide some workarounds for now? eg. adding an http_repository attribute to skip non-ascii-named files or to skip certain directories.

@dslomov dslomov added team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. and removed category: extensibility > external repositories labels Mar 21, 2019
@dslomov
Copy link
Contributor

dslomov commented Mar 21, 2019

Let's make sure extract and download_and_extract work.

@aehlig
Copy link
Contributor

aehlig commented Mar 25, 2019

Interestingly enough, the problem is very sensitive to tiny changes in the environment; https://bazel-review.googlesource.com/c/bazel/+/93754 passes on my corp desktop, but fails on our CI machines.

@benjaminp
Copy link
Collaborator

Yes, see #7757 for a bit of an explanation.

@philwo philwo added the team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website label Jun 15, 2020
@philwo philwo removed the team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website label Nov 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. type: bug
Projects
None yet
Development

No branches or pull requests