Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

❇️ Small improvement over Unicode detection #93

Merged
merged 1 commit into from
Sep 13, 2021
Merged

Conversation

Ousret
Copy link
Member

@Ousret Ousret commented Sep 13, 2021

Tiny improvement that allow small Unicode payload to be detected with more accuracy.

import charset_normalizer

payload = "😀".encode("utf_8")

results = from_bytes(payload)

print(results.best().encoding)  # print "utf_8"!

chardet has trouble with those cases. Any Unicode payload containing emoticons makes the detection fail. Probably their prober that need to be retrained with more up-to-date stats.

>>> chardet.detect("😀".encode("utf_8"))
{'encoding': None, 'confidence': 0.0, 'language': None}

That a tiny improvement but noticeable/appreciable.

@Ousret Ousret added enhancement New feature or request detection Related to the charset detection mechanism, chaos/mess/coherence labels Sep 13, 2021
@Ousret
Copy link
Member Author

Ousret commented Sep 13, 2021

This will land in 2.0.5

Prior to that, charset_normalizer answered (not correctly) with a single emoticon.

{'encoding': 'cp1125', 'language': 'Russian', 'confidence': 1.0}

@Ousret Ousret merged commit 4109ccd into master Sep 13, 2021
@Ousret Ousret deleted the improve-detection branch September 13, 2021 21:48
@Ousret Ousret mentioned this pull request Sep 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
detection Related to the charset detection mechanism, chaos/mess/coherence enhancement New feature or request
Development

Successfully merging this pull request may close these issues.

1 participant