Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-16 Support #914

Open
Nephyrin opened this issue May 30, 2016 · 4 comments
Open

UTF-16 Support #914

Nephyrin opened this issue May 30, 2016 · 4 comments

Comments

@Nephyrin
Copy link

Encoding a file as UTF-16, either little or big endian, with or without a BOM, results in AG treating it as binary.

$ echo Hello > test.txt
$ ag Hello test.txt
1:Hello
$ emacs test.txt --batch --eval "(progn (set-buffer-file-coding-system 'utf-16) (save-buffer))"
$ ag Hello test.txt
$ ag --search-binary H.e.l.l.o test.txt
Binary file test.txt matches.

I've uploaded a test utf16.txt file here:
https://nemu.pointysoftware.net/sink/utf16.txt

@Nephyrin
Copy link
Author

Nephyrin commented May 30, 2016

... And here's one in little endian with a byte order mark (Specifically, utf-16le-with-signature-unix in emacs):
https://nemu.pointysoftware.net/sink/utf-16le-with-signature-unix.txt

@bmalehorn
Copy link

If we support UTF-16, someone will want it to support UTF-16LE, then Windows-8859, and before long someone's asking for KOI8-R. Now we need some function, convert_any_encoding_to_utf8(), which might exist as some library but would slow ag down a lot - about 30% of ag's time is spent trying to detect just UTF-8, certainly detecting 10 different encodings would slow ag down a lot.

Now say you get a match. Better print out the line that matches, right? But it would be more accurate to print out the UTF-16 bytes, but that will probably break on your terminal (which almost universally expect UTF-8).

And, say your run

ag "$(cat search-utf16.txt)" utf-16.txt

well, search-utf16.txt is UTF-16, but that should be fine, right? Wrong, UTF-16 can contain \0, so you can't read it out of your char **argv, which are null-terminated.

I would say, a reasonable feature ag could have would be to see UTF-16 BOM and then assume the file is UTF-16. But that would require pulling in some unicode library, it's probably not worth it.

tl;dr - Encodings other than UTF-8 are fundamentally anti-unix. Just save your files as UTF-8.

@Nephyrin
Copy link
Author

I'm not sure that posix terminal encoding being UTF-8 discounts the existance of use-cases for UTF-16 files, though they're definitely rare. I mean, sure, supporting another unicode encoding is a slippery slope to a navajo translation layer, but UTF-16 files are the preferred format for non-english localization files, which happen to be part of our codebase. Not all use-cases for grep-like tools are on small git repositories you have full control over :(

It would be nice, at least, if encountering a BOM emitted a warning that the file was skipped instead of silently searching it as UTF-8. When I originally made this bug it was because I spent some time confused as to why I couldn't locate a string that I knew was there before I back-filled that it was in a localization file and ag would silently ignore it.

I do agree that it is a pretty low priority request

@JFLarvoire
Copy link
Contributor

JFLarvoire commented Mar 7, 2017

Now we need some function, convert_any_encoding_to_utf8()

I don't think this is the right way to do that. The performance hit would be huge.
Instead, it'd be better to do it the other way around: Every time a new file encoding is encountered, transcode the search string into that new encoding; compile the regexp; and run it against the new file. Thereafter, we'd use one or the other compiled regexps depending on the subsequent encodings encountered.
This would be particularly useful for the Windows version of AG, which is likely to encounter 3 different text file encodings: ANSI, UTF-8 and UTF-16. Requiring Windows users to only use UTF-8 is not realistic.
Now UTF-16 is a particular case, that will cause further problems of its own: To run 16-bits regular expressions, we need the 16-bits version of PCRE. So ag would have to be linked with both the 8-bits and 16-bits versions of PCRE. This will require using prefixes to avoid name collisions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants