UTF-16 Support #914

Nephyrin · 2016-05-30T17:44:32Z

Encoding a file as UTF-16, either little or big endian, with or without a BOM, results in AG treating it as binary.

$ echo Hello > test.txt
$ ag Hello test.txt
1:Hello
$ emacs test.txt --batch --eval "(progn (set-buffer-file-coding-system 'utf-16) (save-buffer))"
$ ag Hello test.txt
$ ag --search-binary H.e.l.l.o test.txt
Binary file test.txt matches.

I've uploaded a test utf16.txt file here:
https://nemu.pointysoftware.net/sink/utf16.txt

Nephyrin · 2016-05-30T17:47:20Z

... And here's one in little endian with a byte order mark (Specifically, utf-16le-with-signature-unix in emacs):
https://nemu.pointysoftware.net/sink/utf-16le-with-signature-unix.txt

bmalehorn · 2016-07-13T06:37:51Z

If we support UTF-16, someone will want it to support UTF-16LE, then Windows-8859, and before long someone's asking for KOI8-R. Now we need some function, convert_any_encoding_to_utf8(), which might exist as some library but would slow ag down a lot - about 30% of ag's time is spent trying to detect just UTF-8, certainly detecting 10 different encodings would slow ag down a lot.

Now say you get a match. Better print out the line that matches, right? But it would be more accurate to print out the UTF-16 bytes, but that will probably break on your terminal (which almost universally expect UTF-8).

And, say your run

ag "$(cat search-utf16.txt)" utf-16.txt

well, search-utf16.txt is UTF-16, but that should be fine, right? Wrong, UTF-16 can contain \0, so you can't read it out of your char **argv, which are null-terminated.

I would say, a reasonable feature ag could have would be to see UTF-16 BOM and then assume the file is UTF-16. But that would require pulling in some unicode library, it's probably not worth it.

tl;dr - Encodings other than UTF-8 are fundamentally anti-unix. Just save your files as UTF-8.

Nephyrin · 2016-08-26T23:11:34Z

I'm not sure that posix terminal encoding being UTF-8 discounts the existance of use-cases for UTF-16 files, though they're definitely rare. I mean, sure, supporting another unicode encoding is a slippery slope to a navajo translation layer, but UTF-16 files are the preferred format for non-english localization files, which happen to be part of our codebase. Not all use-cases for grep-like tools are on small git repositories you have full control over :(

It would be nice, at least, if encountering a BOM emitted a warning that the file was skipped instead of silently searching it as UTF-8. When I originally made this bug it was because I spent some time confused as to why I couldn't locate a string that I knew was there before I back-filled that it was in a localization file and ag would silently ignore it.

I do agree that it is a pretty low priority request

JFLarvoire · 2017-03-07T13:20:04Z

Now we need some function, convert_any_encoding_to_utf8()

I don't think this is the right way to do that. The performance hit would be huge.
Instead, it'd be better to do it the other way around: Every time a new file encoding is encountered, transcode the search string into that new encoding; compile the regexp; and run it against the new file. Thereafter, we'd use one or the other compiled regexps depending on the subsequent encodings encountered.
This would be particularly useful for the Windows version of AG, which is likely to encounter 3 different text file encodings: ANSI, UTF-8 and UTF-16. Requiring Windows users to only use UTF-8 is not realistic.
Now UTF-16 is a particular case, that will cause further problems of its own: To run 16-bits regular expressions, we need the 16-bits version of PCRE. So ag would have to be linked with both the 8-bits and 16-bits versions of PCRE. This will require using prefixes to avoid name collisions.

Nephyrin mentioned this issue Sep 24, 2016

add support for other text encodings BurntSushi/ripgrep#1

Closed

AnrDaemon mentioned this issue May 23, 2017

Relax definition of a "text file" for standard types. #1085

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-16 Support #914

UTF-16 Support #914

Nephyrin commented May 30, 2016

Nephyrin commented May 30, 2016 •

edited

Loading

bmalehorn commented Jul 13, 2016

Nephyrin commented Aug 26, 2016

JFLarvoire commented Mar 7, 2017 •

edited

Loading

UTF-16 Support #914

UTF-16 Support #914

Comments

Nephyrin commented May 30, 2016

Nephyrin commented May 30, 2016 • edited Loading

bmalehorn commented Jul 13, 2016

Nephyrin commented Aug 26, 2016

JFLarvoire commented Mar 7, 2017 • edited Loading

Nephyrin commented May 30, 2016 •

edited

Loading

JFLarvoire commented Mar 7, 2017 •

edited

Loading