Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assumes files with non-utf-8 encodings are binary #22

Open
benogle opened this issue Jul 9, 2015 · 3 comments
Open

Assumes files with non-utf-8 encodings are binary #22

benogle opened this issue Jul 9, 2015 · 3 comments

Comments

@benogle
Copy link

benogle commented Jul 9, 2015

It detects these files as binary:

https://github.com/benogle/encodings/blob/master/big5.txt
https://github.com/benogle/encodings/blob/master/bom_utf-16le.txt

And likely many others in https://github.com/benogle/encodings

@gjtorikian
Copy link
Owner

Damn, this looks like a fun problem. I wish you hadn't opened this. 😣

The problem can be traced to this line. 16/32 BE and LE were each enough to add. The Big5, GB, and KR files have some crazy byte detection, and use the 0x00 code point, so they're being unfairly flagged as binary.

@gjtorikian gjtorikian removed their assignment Jul 30, 2015
@alexkozy
Copy link

alexkozy commented Feb 9, 2018

https://github.com/ashtuchkin/iconv-lite/blob/master/encodings/tables/big5-added.json
There is one more file which is detected as binary. First 32 bytes of this file has 10.15625% of suspicious bytes.

@SamVerschueren
Copy link

Also a file with an emoji is flagged as binary.

This is a text file 🌈

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants