Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for other text encodings #1

Closed
BurntSushi opened this issue Sep 9, 2016 · 30 comments
Closed

add support for other text encodings #1

BurntSushi opened this issue Sep 9, 2016 · 30 comments
Labels
enhancement An enhancement to the functionality of the software. question An issue that is lacking clarity on one or more points.

Comments

@BurntSushi
Copy link
Owner

Right now, ripgrep only supports reading UTF-8 encoded text (even if some of it is invalid). In my estimation, this is probably good enough for the vast majority of use cases.

However, it may be useful to search other encodings. I don't think I'd be willing to, say, modify the regex engine itself to support other encodings, but if it were easy to do transcoding on the fly, then I think it wouldn't add too much complexity. The encoding_rs project in particular appears to support this type of text decoding.

Some thoughts/concerns:

  • Transcoding would require using the incremental searcher as opposed to the memory map searcher. (Which is fine.)
  • Transcoding requires picking a source encoding, and doing this seems non-trivial. You might imagine the CLI growing a new flag that specifies a text encoding, but what happens if you want to search directories containing files with many different types of text encodings? Do we need a more elaborate system? I'm inclined towards not, since I think the juice probably isn't worth the squeeze.
@BurntSushi BurntSushi added enhancement An enhancement to the functionality of the software. question An issue that is lacking clarity on one or more points. labels Sep 9, 2016
@nabijaczleweli
Copy link

TBH I'd just ignore (and/or die violently on) anything that doesn't map to UTF-8, which is the de facto standard in the Rust world.
Trying to support non-utf8 encodings seems more trouble than it's worth, IMO

@BurntSushi
Copy link
Owner Author

BurntSushi commented Sep 22, 2016

@nabijaczleweli ripgrep isn't just for Rust people though. There are plenty of non-English speaking countries where UTF-8 isn't quite everywhere. There's enough folks that care about it that there are at least a couple search tools that do implement searching on files that aren't UTF-8. The Platinum Searcher and highway come to mind. Of course, both grep and git grep support other encodings as well. I'm also told that Windows has lots of UTF-16 data too.

Anyway, it's totally possible that this may never happen. But it's also a somewhat feasible thing to support without sacrificing performance for the UTF-8 case (at the cost of some code complexity). It is honestly somewhat low on my list of things I'd like to do with ripgrep.

@BurntSushi
Copy link
Owner Author

(To be clear, the limit to what I'd be willing to support are text encodings that can be feasibly transcoded to UTF-8.)

@alexandernst
Copy link

How do you plan to detect the encoding of the text that is being ripgrep-ed in order to correctly decode it (or transcode it to utf8)?

@BurntSushi
Copy link
Owner Author

@alexandernst That is precisely one of the problems described in this issue. ;-)

@lambda
Copy link
Contributor

lambda commented Sep 24, 2016

Just yesterday, I was given a dump of log files from a Windows server, and took a while to figure out why I couldn't find the string that someone else reported as being in one of the files, when I was using grep -r to look for it. Took me a bit to figure out that while most of the files were in some ASCII compatible encoding, a few of them were in UTF-16.

For a tool that does recursive grepping, a lot of times what you have is some big directory tree, and you don't really know what's in it, and so you're looking for some needle in that big haystack. Not knowing what's in it can include not knowing the encoding.

Right now, what rg does is skip UTF-16 files as binary, and not report anything about them, so you wouldn't notice if you missed files due to this issue.

I think you do want something that applies a heuristic to determine the encoding, though I'm not sure you want that by default.

I think the way I would go about supporting this is by reporting files that could be text but are not valid UTF-8; so, those files which it's searching as invalid UTF-8, and those which it's detecting as binary but which could be UTF-16, UTF-16BE, or UTF-16LE, or possibly other encodings. If you don't at least report something, you won't know that you're missing something due to the encoding issue, but I'm not sure you'd want to do any kind of automatic encoding detection by default.

Then you could add a flag that will do heuristic encoding detection, based on BOM for UTF-16, or even/odd nulls hack for UTF-16 BE or LE, or maybe even character frequency based detection for other encodings.

@BurntSushi
Copy link
Owner Author

@lambda I agree we need to do something, but we are definitely limited. Things like, "report if a file isn't valid UTF-8" are probably off the table, because that would be way too expensive. We're probably limited to scanning the first few hundred/thousand bytes or so. Right now, it's a simple memchr for the NUL byte to determine if it's a "binary" file or not, to give you an idea of the scale. Reporting the existence of every binary file seems a bit... extreme. Hmm. I'll noodle on it.

@rr-
Copy link

rr- commented Sep 24, 2016

To me the inability to rg --binary "\xDE\xAD\xBE\xEF" is an absolute deal breaker. I work with huge binary files a lot and I often want to search for raw byte strings. I don't want to use two tools at once (rg for text, ag for binary).

Also grepping for anything in Japanese companies, be it source code or text documents, almost always means you deal with Shift-JIS.

TBH I wouldn't expect rg to be able to detect encoding on its own, I'd be plenty happy with being able to manually guide it with --binary, --encoding=sjis etc.

@BurntSushi
Copy link
Owner Author

BurntSushi commented Sep 24, 2016

@rr- You can do that, it's just not documented on ripgrep (it's buried inside the regex engine documentation, which obviously is not going to be discoverable to end users, I need to work on that). Try this instead:

rg '(?-u)\xDE\xAD\xBE\xEF'

If you did try this:

rg '\xDE\xAD\xBE\xEF'

Then each hex escape is treated as a Unicode codepoint with its corresponding UTF-8 encoding. But in the former case, we disable Unicode support and get raw byte support. Unicode can actually be selectively re-enabled, so if you wanted a Unicode aware \w in the middle of some bytes somewhere, this would work:

rg '(?-u)\xDE\xAD(?u:\w+)\xBE\xEF'

@BurntSushi
Copy link
Owner Author

BurntSushi commented Sep 24, 2016

But yes, I think everyone understands that only supporting UTF-8 (errm, ASCII compatible text encodings, technically) might be good enough for the vast majority of users, but is definitely going to be a deal breaker for many as well. This issue really doesn't need more motivation, it needs feasible ideas that can be implemented, and those need to take into account UX, performance and actual correctness. (Thanks @lambda for getting that started.)

@rr-
Copy link

rr- commented Sep 24, 2016

OK how about this:

  1. Expose --encoding which defaults to utf-8
  2. Using --encoding=utf16le test internally encodes test as t\0e\0s\0t\0
  3. rg treats the file as binary and searches for this literal pattern as a sequence of bytes

There are obviously going to be problems around UTF-8 ability to display ö as LATIN SMALL LETTER O WITH DIAERESIS and COMBINING DIAERESIS + LATIN SMALL LETTER O as well as inconsistencies in Shift-JIS encoding of characters such as REVERSE SOLIDUS but maybe it's good enough?

@BurntSushi
Copy link
Owner Author

Well the huge problem with that is things like \w won't work. This is why I made the stipulation of a transcoding from X to UTF-8 and then matching. I mean, an explicit --encoding flag is basically an answer to this problem: it's exactly what I suggested way above. But it has downsides, like, what if your directory tree has some UTF-16 and some UTF-8?

Certainly, we can throw our hands in the air, require specific instruction of encoding by the end user and call it a day. But it doesn't really help getting at the unknown-unknowns that @lambda mentioned. Making unknown-unknowns discoverable is an important feature IMO. But perhaps it is orthogonal to text encoding support.

@Nephyrin
Copy link

There's some similar discussion of UTF-16 for Ag @ ggreer/the_silver_searcher#914

Using byte-order-marks to detect UTF-other encodings may work. At some point, if there's no meta-data to indicate encoding, a tool can only be so-magical.

@puetzk
Copy link

puetzk commented Sep 26, 2016

As far as specifying encoding: .gitattributes supports

pattern encoding=name

to control its supporting tools, and you're already parsing git-style patterns for ignore. So that seems like a good way of supplying the encoding in complex cases...

That would make --encoding=name cmdline switch equivalent to appending a "* encoding=name" line (though you might also want a --default-encoding that "prepends" instead).

I'm not sure how best to combine this with recognizing BYTE ORDER MARK, though, and I agree that would also be very nice to have.

@emk
Copy link
Contributor

emk commented Oct 8, 2016

I'm not sure whether it would be useful or not, but I do maintain a Rust wrapper around uchardet, which detects the most common character encodings. It was extracted from Mozilla. In my experience it works quite well for identifying UTF-8 versus UTF-16 (either endianness) and some other common character encodings, but it gets a little dodgier as you go further afield.

Any such character detection library is obviously statistical in nature, and it will fail on some files.

My Rust wrapper includes a copy of the uchardet source code that hasn't been synced with upstream in a while, but I'd be happy to do so (it's just a git submodule resolved at build time), and to make any necessary feature improvements.

@BurntSushi
Copy link
Owner Author

@emk Sorry for the late response. That's a neat idea! But I'd like to discourage using C dependencies beyond what we get with libc. In any case, it will be some time before I get to this, and I'd probably like to at least start without automatic detection. After that, depending on how much complexity it adds, I might be willing to expose it under an optional cargo feature.

(My hope is to move most of the search code out into a separate crate maintained in this repo. Once that's done, we can start talking about APIs.)

@alejandro5042
Copy link
Contributor

alejandro5042 commented Oct 28, 2016

Another vote for this: PowerShell on Windows unfortunately writes UTF-16 when piping to a file (Out-File) [1]. That means a common case like this fails:

"hi" > test.txt
rg hi

Matches nothing.

[1] In a classic case of inconsistency, it writes ASCII using Set-Content 😢

@emk
Copy link
Contributor

emk commented Oct 28, 2016

The UTF-16LE encoding used on Windows (or rather, UCS-2LE, if that's still a thing) is an especially important use-case, and one of the easiest to detect. The typical steps can be seen in this uchardet source file:

  1. Check for Unicode Byte Order Marks, which will normally work well for Windows files.
  2. Check for alternating (mostly-)ASCII and 0 bytes, which will detect a large fraction of the remaining files.
  3. Fall back to more sophisticated character set probers and heuristics to deal with harder cases and other encodings.

@Bobo1239 is currently working to update my Rust uchardet bindings to build against the latest uchardet and to support Windows. But it wouldn't take much to implement a pure-Rust solution that could detect UTF-8 and UTF-16.

@emk
Copy link
Contributor

emk commented Oct 29, 2016

@BurntSushi mentioned this issue elsewhere (#7) and wrote:

The problem then comes with the output though. Do you output in the same encoding as the input? If so, that sounds pretty hairy. (Possibly even intractable.)

Is there any reason you couldn't just use Rust's standard output facilities? For a first version, detect at least UTF-16LE/BE, transcode to UTF-8 for searching, and allow Rust to output it the same way as existing strings? You pretty much need to choose a single output encoding in any case, and it might as well be whatever Rust generates for you.

@BurntSushi
Copy link
Owner Author

@emk Rust's standard output facilities are encoding agnostic, but that's orthogonal. Here's my concern more explicitly stated:

  1. With respect to UX: should the match output of ripgrep correspond to the original encoding or to the transcoded UTF-8?
  2. If the answer to (1) is "the original encoding", how do you do it?

@emk
Copy link
Contributor

emk commented Oct 29, 2016

@BurntSushi Good questions, and tricky ones. :-)

As far as I can see, the fundamental problem with output in the original encoding is figuring out what to do if you find matches in files that use several different encodings. In that case, you pretty much need to choose a single encoding for output. This seems like a major force pushing towards keeping all output in UTF-8. I suppose the counter argument is that if the input from stdin is in UTF-16LE, then it's arguably nicer to use UTF-16LE for stdout as well. If you want to do that, the encoding crate can potentially do a fair bit of heavy lifting here, especially if you want to support a wide range of encodings.

I originally made Rust bindings for uchardet for use with substudy, which has to deal with quite a few encodings. In my experience, it's possible to detect and transcode the most common encodings automatically, and it works fairly well.

Now that I'm thinking about substudy, that provides an intesting data point: I have foreign-language subtitle files in multiple legacy encodings. In a perfectly ideal world, I would love ripgrep to correctly identify ISO Latin-1 or Windows-1252, but transcode everything to UTF-8 for matching and output, because my terminal is configured to display UTF-8, not Windows-1252. Similarly, if I write , then I would want that to match smart quotes in both Windows-1252 and UTF-8 files. But this would be fairly ambitious.

Overall, it seems like UTF-16LE is the most important case to handle (for Windows support), and it's one of the easiest to implement (at least the basics).

@BurntSushi
Copy link
Owner Author

As far as I can see, the fundamental problem with output in the original encoding is figuring out what to do if you find matches in files that use several different encodings. In that case, you pretty much need to choose a single encoding for output. This seems like a major force pushing towards keeping all output in UTF-8. I suppose the counter argument is that if the input from stdin is in UTF-16LE, then it's arguably nicer to use UTF-16LE for stdout as well. If you want to do that, the encoding crate can potentially do a fair bit of heavy lifting here, especially if you want to support a wide range of encodings.

I'd also like to evaluate encoding_rs.

I don't disagree with anything you've said. Although I'd like to continue to emphasize that we should attack this problem without any C dependencies first. For example, one could imagine a solution to this problem that has no automatic detection at all. That seems like a decent place to start.

Similarly, if I write , then I would want that to match smart quotes in both Windows-1252 and UTF-8 files. But this would be fairly ambitious.

So long as is input as UTF-8 in the regex (which must be UTF-8) and transcoding is correct, then this should actually happen for free.

@emk
Copy link
Contributor

emk commented Oct 29, 2016

I'd also like to evaluate encoding_rs.

Ah, I hadn't seen that yet! Thank you.

I don't disagree with anything you've said. Although I'd like to continue to emphasize that we should attack this problem without any C dependencies first.

If it helps, I'm happy to provide a small Rust crate which provides a translation of the essential bits of the main UTF-8 / UTF-16LE / UTF-16BE detector here. This will at least handle most modern text files on Linux/Windows/Mac, which is probably 98% of what people actually want. The lack of UTF-16LE searching, in particular, is a big limitation on Windows.

@BurntSushi
Copy link
Owner Author

@emk That sounds like a great next step after we have the ability to search transcoded data. :-)

@emk
Copy link
Contributor

emk commented Oct 29, 2016

Fair enough. :-)

Yes, I agree it would be good to implement the transliteration machinery before the detection code! If there's no auto-detection, what should the UI for selecting an input encoding look like?

As for automatic detection, looking at the source to the encoding detectors for the first time in a couple of years, the logic to detect the major encodings is surprisingly straightforward:

  1. If it has a BOM, use the BOM to identify it as UTF-8, UTF-16 or UTF-32. This is the only code path that will ever identify a file as UTF-16 or UTF-32, interestingly enough.
  2. If it contains only ASCII characters, identify it as ASCII.
  3. If it contains ASCII plus 0xA0 (nbsp), supposedly a semi-common case, identify it as ISO Latin 1.
  4. If none of the above match, run a UTF-8 prober. When working on substudy, I found this this was usually overkill, because UTF-8 has a very distinctive and strict structure, and you can usually identify it simply by asking: "Does this contain at least one UTF-8 character and does it parse as valid UTF-8?" If true, it's generally reasonable to assume UTF-8.

This will generally produce decent results for ASCII and the various UTF encodings. Given your "pure Rust" requirement, it's probably wisest not to try for anything more right away. All the necessary language models, etc., are available in uchardet, but there's a fair amount of fiddly code to handle all the weird encodings out there.

@BurntSushi
Copy link
Owner Author

I don't know what the UI should look like, but we need one. I would
probably start by looking at iconv.

On Oct 28, 2016 9:56 PM, "Eric Kidd" notifications@github.com wrote:

Fair enough. :-)

Yes, I agree it would be good to implement the transliteration machinery
before the detection code! If there's no auto-detection, what should the UI
for selecting an input encoding look like?

As for automatic detection, looking at the source to the encoding
detectors for the first time in a couple of years, the logic to detect the
major encodings is surprisingly straightforward:

  1. If it has a BOM, use the BOM to identify it as UTF-8, UTF-16 or
    UTF-32. This is the only code path that will ever identify a file as UTF-16
    or UTF-32, interestingly enough.
  2. If it contains only ASCII characters, identify it as ASCII.
  3. If it contains ASCII plus 0xA0 (nbsp), supposedly a semi-common
    case, identify it as ISO Latin 1.
  4. If none of the above match, run a UTF-8 prober
    https://cgit.freedesktop.org/uchardet/uchardet/tree/src/nsUTF8Prober.cpp.
    When working on substudy, I found this this was usually overkill, because
    UTF-8 has a very distinctive and strict structure, and you can usually
    validate it simply by asking: "Does this contain at least one UTF-8
    character and does it parse as valid UTF-8?" If true, it's generally
    reasonable to assume UTF-8.

This will generally produce decent results for ASCII and the various UTF
encodings. Given your "pure Rust" requirement, it's probably wisest not to
try for anything more right away. All the necessary language models, etc.,
are available in uchardet, but there's a fair amount of fiddly code to
handle all the weird encodings out there.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAb34n7FpwVKii93xQLcD-Me9EfNTg-Nks5q4qeygaJpZM4J5OAx
.

BurntSushi added a commit that referenced this issue Mar 12, 2017
This includes, but is not limited to, UTF-16, latin-1, GBK, EUC-JP and
Shift_JIS. (Courtesy of the `encoding_rs` crate.)

Specifically, this feature enables ripgrep to search files that are
encoded in an encoding other than UTF-8. The list of available encodings
is tied directly to what the `encoding_rs` crate supports, which is in
turn tied to the Encoding Standard. The full list of available encodings
can be found here: https://encoding.spec.whatwg.org/#concept-encoding-get

This pull request also introduces the notion that text encodings can be
automatically detected on a best effort basis. Currently, the only
support for this is checking for a UTF-16 bom. In all other cases, a
text encoding of `auto` (the default) implies a UTF-8 or ASCII
compatible source encoding. When a text encoding is otherwise specified,
it is unconditionally used for all files searched.

Since ripgrep's regex engine is fundamentally built on top of UTF-8,
this feature works by transcoding the files to be searched from their
source encoding to UTF-8. This transcoding only happens when:

1. `auto` is specified and a non-UTF-8 encoding is detected.
2. A specific encoding is given by end users (including UTF-8).

When transcoding occurs, errors are handled by automatically inserting
the Unicode replacement character. In this case, ripgrep's output is
guaranteed to be valid UTF-8 (excluding non-UTF-8 file paths, if they
are printed).

In all other cases, the source text is searched directly, which implies
an assumption that it is at least ASCII compatible, but where UTF-8 is
most useful. In this scenario, encoding errors are not detected. In this
case, ripgrep's output will match the input exactly, byte-for-byte.

This design may not be optimal in all cases, but it has some advantages:

1. In the happy path ("UTF-8 everywhere") remains happy. I have not been
   able to witness any performance regressions.
2. In the non-UTF-8 path, implementation complexity is kept relatively
   low. The cost here is transcoding itself. A potentially superior
   implementation might build decoding of any encoding into the regex
   engine itself. In particular, the fundamental problem with
   transcoding everything first is that literal optimizations are nearly
   negated.

Future work should entail improving the user experience. For example, we
might want to auto-detect more text encodings. A more elaborate UX
experience might permit end users to specify multiple text encodings,
although this seems hard to pull off in an ergonomic way.

Fixes #1
BurntSushi added a commit that referenced this issue Mar 12, 2017
This includes, but is not limited to, UTF-16, latin-1, GBK, EUC-JP and
Shift_JIS. (Courtesy of the `encoding_rs` crate.)

Specifically, this feature enables ripgrep to search files that are
encoded in an encoding other than UTF-8. The list of available encodings
is tied directly to what the `encoding_rs` crate supports, which is in
turn tied to the Encoding Standard. The full list of available encodings
can be found here: https://encoding.spec.whatwg.org/#concept-encoding-get

This pull request also introduces the notion that text encodings can be
automatically detected on a best effort basis. Currently, the only
support for this is checking for a UTF-16 bom. In all other cases, a
text encoding of `auto` (the default) implies a UTF-8 or ASCII
compatible source encoding. When a text encoding is otherwise specified,
it is unconditionally used for all files searched.

Since ripgrep's regex engine is fundamentally built on top of UTF-8,
this feature works by transcoding the files to be searched from their
source encoding to UTF-8. This transcoding only happens when:

1. `auto` is specified and a non-UTF-8 encoding is detected.
2. A specific encoding is given by end users (including UTF-8).

When transcoding occurs, errors are handled by automatically inserting
the Unicode replacement character. In this case, ripgrep's output is
guaranteed to be valid UTF-8 (excluding non-UTF-8 file paths, if they
are printed).

In all other cases, the source text is searched directly, which implies
an assumption that it is at least ASCII compatible, but where UTF-8 is
most useful. In this scenario, encoding errors are not detected. In this
case, ripgrep's output will match the input exactly, byte-for-byte.

This design may not be optimal in all cases, but it has some advantages:

1. In the happy path ("UTF-8 everywhere") remains happy. I have not been
   able to witness any performance regressions.
2. In the non-UTF-8 path, implementation complexity is kept relatively
   low. The cost here is transcoding itself. A potentially superior
   implementation might build decoding of any encoding into the regex
   engine itself. In particular, the fundamental problem with
   transcoding everything first is that literal optimizations are nearly
   negated.

Future work should entail improving the user experience. For example, we
might want to auto-detect more text encodings. A more elaborate UX
experience might permit end users to specify multiple text encodings,
although this seems hard to pull off in an ergonomic way.

Fixes #1
@ssokolow
Copy link

The Whirlwind tour section of the README still needs to be updated:

One last thing before we get started: ripgrep assumes UTF-8 everywhere. It can still search files that are invalid UTF-8 (like, say, latin-1), but it will simply not work on UTF-16 encoded files or other more exotic encodings. Support for other encodings may happen.

BurntSushi added a commit that referenced this issue Mar 14, 2017
@BurntSushi
Copy link
Owner Author

@ssokolow Fixed! Thanks.

@batisteo
Copy link

@BurntSushi You might want to edit your anti-pitch on your blog post.

If you need to search files with text encodings other than UTF-8 (like UTF-16), then ripgrep won’t work.

@BurntSushi
Copy link
Owner Author

@batisteo Hmm right thanks, I always forget to update the blog.

BurntSushi added a commit that referenced this issue Mar 23, 2021
This increases the initial buffer size from 8KB to 64KB. This actually
leads to a reasonably noticeable improvement in at least one work-load,
and is unlikely to regress in any other case. Also, since Rust programs
(at least on Linux) seem to always use a minimum of 6-8MB of memory,
adding an extra 56KB is negligible.

Before:

    $ hyperfine -i "rg 'zqzqzqzq' OpenSubtitles2018.raw.en --no-mmap"
    Benchmark #1: rg 'zqzqzqzq' OpenSubtitles2018.raw.en --no-mmap
      Time (mean ± σ):      2.109 s ±  0.012 s    [User: 565.5 ms, System: 1541.6 ms]
      Range (min … max):    2.094 s …  2.128 s    10 runs

After:

    $ hyperfine -i "rg 'zqzqzqzq' OpenSubtitles2018.raw.en --no-mmap"
    Benchmark #1: rg 'zqzqzqzq' OpenSubtitles2018.raw.en --no-mmap
      Time (mean ± σ):      1.802 s ±  0.006 s    [User: 462.3 ms, System: 1337.9 ms]
      Range (min … max):    1.795 s …  1.814 s    10 runs
wacdev referenced this issue in wacfork/ripgrep Mar 13, 2023
search *.type if filetype not exist
3tieto pushed a commit to 3TiFork/ripgrep that referenced this issue Nov 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An enhancement to the functionality of the software. question An issue that is lacking clarity on one or more points.
Projects
None yet
Development

No branches or pull requests