`new_http_archive` can't handle archives containing unicode-encoded filenames #1653

nelhage · 2016-08-16T04:49:37Z

I'm attempting to add libgit2 to a project as a bazel external, like so:

new_http_archive(
  name = "com_github_libgit2",
  url = "https://github.com/libgit2/libgit2/archive/v0.24.1.tar.gz",
  strip_prefix = "libgit2-0.24.1",
  sha256 = "60198cbb34066b9b5c1613d15c0479f6cd25f4aef42f7ec515cd1cc13a77fede",
  build_file = "BUILD.libgit2",
)

Unfortunately, bazel build @com_github_libgit2//... fails with

Unhandled exception thrown during build; message: Unrecoverable error while evaluating node 'REPOSITORY_DIRECTORY:@com_github_libgit2' (requested by nodes 'REPOSITORY:@com_github_libgit2')
INFO: Elapsed time: 3.331s
java.lang.RuntimeException: Unrecoverable error while evaluating node 'REPOSITORY_DIRECTORY:@com_github_libgit2' (requested by nodes 'REPOSITORY:@com_github_libgit2')
        at com.google.devtools.build.skyframe.ParallelEvaluator$Evaluate.run(ParallelEvaluator.java:1070)
        at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:474)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.file.InvalidPathException: Malformed input or input contains unmappable characters: /home/nelhage/.cache/bazel/_bazel_nelhage/63183be0b56fd73a3d972912805a7bbe/external/com_github_libgit2/tests/resources/status/这
        at sun.nio.fs.UnixPath.encode(UnixPath.java:147)
        at sun.nio.fs.UnixPath.<init>(UnixPath.java:71)
        at sun.nio.fs.UnixFileSystem.getPath(UnixFileSystem.java:281)
        at java.io.File.toPath(File.java:2234)
        at com.google.devtools.build.lib.bazel.repository.CompressedTarFunction.decompress(CompressedTarFunction.java:69)
        at com.google.devtools.build.lib.bazel.repository.DecompressorValue.decompress(DecompressorValue.java:76)
        at com.google.devtools.build.lib.bazel.repository.NewHttpArchiveFunction.fetch(NewHttpArchiveFunction.java:70)
        at com.google.devtools.build.lib.rules.repository.RepositoryDelegatorFunction.compute(RepositoryDelegatorFunction.java:155)
        at com.google.devtools.build.skyframe.ParallelEvaluator$Evaluate.run(ParallelEvaluator.java:1016)
        ... 4 more
java.lang.RuntimeException: Unrecoverable error while evaluating node 'REPOSITORY_DIRECTORY:@com_github_libgit2' (requested by nodes 'REPOSITORY:@com_github_libgit2')
        at com.google.devtools.build.skyframe.ParallelEvaluator$Evaluate.run(ParallelEvaluator.java:1070)
        at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:474)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.file.InvalidPathException: Malformed input or input contains unmappable characters: /home/nelhage/.cache/bazel/_bazel_nelhage/63183be0b56fd73a3d972912805a7bbe/external/com_github_libgit2/tests/resources/status/这
        at sun.nio.fs.UnixPath.encode(UnixPath.java:147)
        at sun.nio.fs.UnixPath.<init>(UnixPath.java:71)
        at sun.nio.fs.UnixFileSystem.getPath(UnixFileSystem.java:281)
        at java.io.File.toPath(File.java:2234)
        at com.google.devtools.build.lib.bazel.repository.CompressedTarFunction.decompress(CompressedTarFunction.java:69)
        at com.google.devtools.build.lib.bazel.repository.DecompressorValue.decompress(DecompressorValue.java:76)
        at com.google.devtools.build.lib.bazel.repository.NewHttpArchiveFunction.fetch(NewHttpArchiveFunction.java:70)
        at com.google.devtools.build.lib.rules.repository.RepositoryDelegatorFunction.compute(RepositoryDelegatorFunction.java:155)
        at com.google.devtools.build.skyframe.ParallelEvaluator$Evaluate.run(ParallelEvaluator.java:1016)
        ... 4 more

The problem seems to be proximally caused by the fact that bazel forces itself into a latin-1 locale:

bazel/src/main/cpp/blaze.cc

Lines 1718 to 1724 in 936c2c2

    
           // Make the JVM use ISO-8859-1 for parsing its command line because "blaze 
        
           // run" doesn't handle non-ASCII command line arguments. This is apparently 
        
           // the most reliable way to select the platform default encoding. 
        
           setenv("LANG", "en_US.ISO-8859-1", 1); 
        
           setenv("LANGUAGE", "en_US.ISO-8859-1", 1); 
        
           setenv("LC_ALL", "en_US.ISO-8859-1", 1); 
        
           setenv("LC_CTYPE", "en_US.ISO-8859-1", 1);

— UnixPath.encode will happily encode utf-8 paths in a utf-8 locale.

I think this is distinct from #374 in that I don't even need to reference the problematic files in a rule; bazel can't even unpack the tarball, even though I don't care about the files in question.

The text was updated successfully, but these errors were encountered:

bscarlet · 2017-09-16T08:41:48Z

I just ran into this too.

jfaust · 2017-11-18T00:15:15Z

This is causing problems for me with libvips, which has a filename with russian characters in it in the release.

Add a test verifying that http_archive can extract a tar archive containing unicode characters. While such files cannot be referred to by labels, it is still important that the archive can be extracted. Also fix that use case on Darwin, by appropriately reencoding the string, so that the Files java standard library can encode it back to what we had in the first place. Work-around for #1653, showing that http_archive from @bazel_tools can be used; however, the issue still remains for zip archives. Change-Id: If944203bf618c21705af676347d8591ab015d559 PiperOrigin-RevId: 183987726

*** Reason for rollback *** Breaks on our CI Linux machines (but works on our work desktop Linux machines); apparently, even our own Linux machines are too different from each other... Fixes #4557 *** Original change description *** http_archive: verify that unicode characters are OK in tar archives Add a test verifying that http_archive can extract a tar archive containing unicode characters. While such files cannot be referred to by labels, it is still important that the archive can be extracted. Also fix that use case on Darwin, by appropriately reencoding the string, so that the Files java standard library can encode it back to what we had in the first place. Work-around for #1653, showing that http_archive from @bazel_tools can be used; however, the issue still remains for zip archives. *** PiperOrigin-RevId: 184132385

steeve · 2018-05-21T19:03:15Z

This is still an issue as of Bazel 0.13 with

    native.new_git_repository(
        name = "com_github_mosra_corrade",
        remote = "https://github.com/mosra/corrade.git",
        commit = "10a9abeca9938091edbf8a7fdf7cd6a944ce01c4",
        build_file = "//:third_party/com_github_mosra_corrade/BUILD.bazel.in",
    )

katre · 2018-05-22T09:24:53Z

We are deprecating the native versions of http_archive and git_repository. Have you tried the skylark-based versions, do they have the same error?

To use the Skylark new_git_repository, just add this to your WORKSPACE:

load(
    "@bazel_tools//tools/build_defs/repo:git.bzl",
    "git_repository",
    "new_git_repository",
)

steeve · 2018-05-22T09:26:12Z

Yes I did try it with the skylark originally

On Tue 22 May 2018 at 11:25, katre ***@***.***> wrote: We are deprecating the native versions of http_archive and git_repository. Have you tried the skylark-based versions, do they have the same error? To use the Skylark new_git_repository, just add this to your WORKSPACE: load( ***@***.***_tools//tools/build_defs/repo:git.bzl", "git_repository", "new_git_repository", ) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1653 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAIY-7zwAr3VrRK1-la1E7cLDqlZ0iZaks5t09lzgaJpZM4JlCMK> .

-- twitter.com/steeve github.com/steeve linkd.in/smorin

nelhage · 2018-09-02T02:31:53Z

This no longer crashes with the Starlark rules, but it still fails during extraction:

ERROR: Malformed input or input contains unmappable characters: /home/nelhage/.cache/bazel/_bazel_nelhage/9608ccc15fa131b938e0090f75bd00b5/external/com_github_libgit2/tests/resources/status/这

I thought I could work around with patch_cmds = ["rm -rf tests"] but it appears to not even make it as far as running that command.

EricCousineau-TRI · 2018-10-11T18:37:47Z

Also affected, using Starlark http_archive with Bazel 0.16.1, trying to clone from sphinx:

ERROR: Analysis of target '//tools/workspace/sphinx:sphinx_build' failed; build aborted: no such package '@sphinx//': Malformed input or input contains unmappable characters: .../external/sphinx/tests/roots/test-image-glob/testimäge.png

Same as above; using patch_cmds does not work.

May try to make a workaround Python script to run via repository_ctx so that it can manually scrub these symbols when extracting.

EricCousineau-TRI · 2018-10-11T20:12:18Z

Here's a simple workaround using a Python script:
https://github.com/RobotLocomotion/drake/blob/d47cec4896b6278873698de80b6dffb3ff842c81/tools/workspace/python_extract.py
https://github.com/RobotLocomotion/drake/blob/d47cec4896b6278873698de80b6dffb3ff842c81/tools/workspace/http_archive_python.bzl

Example usage:
https://github.com/RobotLocomotion/drake/blob/d47cec4896b6278873698de80b6dffb3ff842c81/tools/workspace/sphinx/repository.bzl#L8-L21

ashi009 · 2019-01-15T09:29:25Z

Is there any progress on fixing this? It's understandable that this is not an issue for Google internal but apparently a huge bumper for many community users.

Given that this is a 1-year-old P1 bug, shall we provide some workarounds for now? eg. adding an http_repository attribute to skip non-ascii-named files or to skip certain directories.

dslomov · 2019-03-21T17:58:52Z

Let's make sure extract and download_and_extract work.

aehlig · 2019-03-25T14:26:52Z

Interestingly enough, the problem is very sensitive to tiny changes in the environment; https://bazel-review.googlesource.com/c/bazel/+/93754 passes on my corp desktop, but fails on our CI machines.

benjaminp · 2019-03-25T15:16:49Z

Yes, see #7757 for a bit of an explanation.

philwo added type: bug P2 We'll consider working on this in future. (Assignee optional) labels Aug 16, 2016

damienmg modified the milestones: 0.6, 0.7 Dec 12, 2016

damienmg added the category: extensibility > external repositories label Dec 12, 2016

j3parker mentioned this issue Nov 14, 2017

Unicode in external repo filenames causes uncaught exception #4084

Closed

dslomov added P1 I'll work on this now. (Assignee required) external-repos-triaged and removed P2 We'll consider working on this in future. (Assignee optional) labels Jan 9, 2018

dslomov assigned aehlig Jan 22, 2018

drigz mentioned this issue May 15, 2018

direct_run changes locale #5203

Closed

EricCousineau-TRI mentioned this issue Oct 11, 2018

[DNM] Attempt at Sphinx 1.8.0 inclusion via source RobotLocomotion/drake#9654

Closed

nelhage mentioned this issue Feb 17, 2019

switch to boost 1.69 nelhage/rules_boost#109

Closed

dslomov added team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. and removed category: extensibility > external repositories labels Mar 21, 2019

dslomov removed the external-repos-triaged label Mar 21, 2019

bazel-io closed this as completed in 191c7bd Mar 28, 2019

philwo added the team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website label Jun 15, 2020

philwo removed the team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website label Nov 29, 2021

novas0x2a mentioned this issue Oct 11, 2023

Malformed input or input contains unmappable characters (regression in 6.4.0rc1) #19798

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`new_http_archive` can't handle archives containing unicode-encoded filenames #1653

`new_http_archive` can't handle archives containing unicode-encoded filenames #1653

nelhage commented Aug 16, 2016

bscarlet commented Sep 16, 2017

jfaust commented Nov 18, 2017

steeve commented May 21, 2018

katre commented May 22, 2018

steeve commented May 22, 2018 via email

nelhage commented Sep 2, 2018

EricCousineau-TRI commented Oct 11, 2018

EricCousineau-TRI commented Oct 11, 2018

ashi009 commented Jan 15, 2019

dslomov commented Mar 21, 2019

aehlig commented Mar 25, 2019

benjaminp commented Mar 25, 2019

new_http_archive can't handle archives containing unicode-encoded filenames #1653

new_http_archive can't handle archives containing unicode-encoded filenames #1653

Comments

nelhage commented Aug 16, 2016

bscarlet commented Sep 16, 2017

jfaust commented Nov 18, 2017

steeve commented May 21, 2018

katre commented May 22, 2018

steeve commented May 22, 2018 via email

nelhage commented Sep 2, 2018

EricCousineau-TRI commented Oct 11, 2018

EricCousineau-TRI commented Oct 11, 2018

ashi009 commented Jan 15, 2019

dslomov commented Mar 21, 2019

aehlig commented Mar 25, 2019

benjaminp commented Mar 25, 2019

`new_http_archive` can't handle archives containing unicode-encoded filenames #1653

`new_http_archive` can't handle archives containing unicode-encoded filenames #1653