Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support non-utf-8 file names #210

Closed
bluss opened this issue Nov 1, 2016 · 4 comments
Closed

Support non-utf-8 file names #210

bluss opened this issue Nov 1, 2016 · 4 comments
Labels
bug A bug.

Comments

@bluss
Copy link

bluss commented Nov 1, 2016

File names on linux are just bags of bytes. Regular shell tools work with them no problem:

$ cat broken� 
hi
$ ls -d broken�
broken?
$ grep "hi" -n broken� 
1:hi

The main annoyance is that it poisons the whole directory with ripgrep:

$ rg hi *
Argument 'broken�' is not valid UTF-8. Use hex escape sequences to match arbitrary bytes in a pattern (e.g., \xFF).

Doing the right thing should be easy - an argument is always an OsStr when it enters the rust program, which is losslessly converted to a Path.

@BurntSushi
Copy link
Owner

BurntSushi commented Nov 1, 2016

Yup, this is a silly bug. Note that if you run rg hi instead of rg hi *, then it works. The specific problem is this very ill-advised code in src/args.rs:

    pub fn parse() -> Result<Args> {
        // Get all of the arguments, being careful to require valid UTF-8.
        let mut argv = vec![];
        for arg in env::args_os() {
            match arg.into_string() {
                Ok(s) => argv.push(s),
                Err(s) => {
                    errored!("Argument '{}' is not valid UTF-8. \
                              Use hex escape sequences to match arbitrary \
                              bytes in a pattern (e.g., \\xFF).",
                              s.to_string_lossy());
                }
            }
        }
        let mut raw: RawArgs =
            Docopt::new(USAGE)
                .and_then(|d| d.argv(argv).version(Some(version())).decode())
                .unwrap_or_else(|e| e.exit());

The specific reason why I did this was to catch invalid UTF-8 in patterns, but of course, this catches invalid UTF-8 everywhere.

Finally, Docopt has this:

    fn get_argv() -> Vec<String> {
        // Hmm, we should probably handle a Unicode decode error here... ---AG
        ::std::env::args().skip(1).collect()
    }

... and the argv parser is built around &str instead of &OsStr.

Sigh. I should probably fix this inside of Docopt, but I also want to switch to clap. See #136.

@BurntSushi BurntSushi added the bug A bug. label Nov 1, 2016
@bluss
Copy link
Author

bluss commented Nov 1, 2016

Yes, clap makes it possible to do this correctly, it's nice that way.

@bluss
Copy link
Author

bluss commented Nov 1, 2016

The reason I got to this point was that I'm trying to make a way to list matches in rg in latest-modified-first order. First trying to just feed it files in that order.

@BurntSushi
Copy link
Owner

Oh interesting. You'll also need to pass -j1. If you do that, then the results should come back in the order you gave them.

BurntSushi added a commit that referenced this issue Nov 13, 2016
There were two important reasons for the switch:

1. Performance. Docopt does poorly when the argv becomes large, which is
   a reasonable common use case for search tools. (e.g., use with xargs)
2. Better failure modes. Clap knows a lot more about how a particular
   argv might be invalid, and can therefore provide much clearer error
   messages.

While both were important, (1) made it urgent.

Note that since Clap requires at least Rust 1.11, this will in turn
increase the minimum Rust version supported by ripgrep from Rust 1.9 to
Rust 1.11. It is therefore a breaking change, so the soonest release of
ripgrep with Clap will have to be 0.3.

There is also at least one subtle breaking change in real usage.
Previous to this commit, this used to work:

    rg -e -foo

Where this would cause ripgrep to search for the string `-foo`. Clap
currently has problems supporting this use case
(see: clap-rs/clap#742),
but it can be worked around by using this instead:

    rg -e [-]foo

or even

    rg [-]foo

and this still works:

    rg -- -foo

This commit also adds Bash, Fish and PowerShell completion files to the
release, fixes a bug that prevented ripgrep from working on file
paths containing invalid UTF-8 and shows short descriptions in the
output of `-h` but longer descriptions in the output of `--help`.

Fixes #136, #189, #210, #230
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A bug.
Projects
None yet
Development

No branches or pull requests

2 participants