-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
finding zero bytes in utf-16 encoded files #1207
Comments
I think ripgrep is seeing the BOM and properly decoding the test as utf-16 (hence no null bytes). Try using the
|
@lespea I'm pretty sure that's the case (removing BOM makes the command find the bytes, it's treating it as a plain binary). I don't know how to get over it though. Using |
Interesting issue. The The reason why this is happening is because ripgrep is indeed detecting the BOM and transcoding your UTF-16 to UTF-8, which gets rid of all NUL bytes in this case. ripgrep does not expose any options to override this behavior. Even if you set I suspect the way to fix this is to allow one to specify The only work-around available to you at the moment, as far as I know, is stripping the BOM:
|
@BurntSushi Thanks for explaining. I understand my case is a little bit out of the normal usage of ripgrep - I'm processing text files that may be corrupted (UTF8, UTF16 ... or a mix of them, don't ask). Anyway I think |
Yes, I agree. ripgrep should be able to handle this use case. There should be a way to override transcoding so that you can treat even completely valid UTF-16 as arbitrary bytes. |
@BurntSushi Do you consider it a good first task? Or would it require some major rewrite? I'd like to contribute. |
@LesnyRumcajs Ah yes, great idea! This is probably a decent first task, although there is a fair bit of plumbing. The high level idea is to add support for completely disabling transcoding support. This change will require changes to three different crates. (Such is the cost for splitting ripgrep's functionality out so that others can use it.)
You'll need to add support for a new At this point, you'll get some compiler errors because the callers of the Finally, add a new test covering this feature to ripgrep's integration tests for new features. If you need help with this let me know, but I think just following some of the pattern of other examples should be good. Writing all that out makes it seem like a fair bit of work, but I think it's doable! |
@BurntSushi Woah, thanks a lot for the hints - you saved me several hours of grinding my teeth and potential PR rejects! I'll get to it. |
This makes it possible to use the transcoder to pass through its bytes unconditionally without any transcoding. This is the same as not using it at all, but makes consumer code organization a bit simpler if this is linked back to a runtime configuration option. This addresses part of the work toward completing BurntSushi/ripgrep#1207
This brings in a new API for disabling BOM sniffing. This is part of the work toward completing #1207
What version of ripgrep are you using?
ripgrep 0.10.0
-SIMD -AVX (compiled)
+SIMD +AVX (runtime)
How did you install ripgrep?
cargo install ripgrep
What operating system are you using ripgrep on?
Fedora 29
Describe your question, feature request, or bug.
I'm struggling to find files that contain 00 bytes. I created an UTF-16 LE file with text
test
(hexdump)Given that there are 00 bytes inside I'm issuing a command
rg -cuuu '(?-u:\x00)'
but get no results at all. It works for searching fort
, likeFrom my understanding the
-uuu
flag along with some UTF escaping should do the trick. It works fine for non-zero bytes. It also works for binary files (tried with a file comprising of a single 00 byte). Am I missing something?The text was updated successfully, but these errors were encountered: