❇️ Small improvement over Unicode detection #93

Ousret · 2021-09-13T21:35:33Z

Tiny improvement that allow small Unicode payload to be detected with more accuracy.

import charset_normalizer

payload = "😀".encode("utf_8")

results = from_bytes(payload)

print(results.best().encoding)  # print "utf_8"!

chardet has trouble with those cases. Any Unicode payload containing emoticons makes the detection fail. Probably their prober that need to be retrained with more up-to-date stats.

>>> chardet.detect("😀".encode("utf_8"))
{'encoding': None, 'confidence': 0.0, 'language': None}

That a tiny improvement but noticeable/appreciable.

Ousret · 2021-09-13T21:38:13Z

This will land in 2.0.5

Prior to that, charset_normalizer answered (not correctly) with a single emoticon.

{'encoding': 'cp1125', 'language': 'Russian', 'confidence': 1.0}

❇️ Small improvement over Unicode detection

58a5b84

Ousret added enhancement New feature or request detection Related to the charset detection mechanism, chaos/mess/coherence labels Sep 13, 2021

Ousret merged commit 4109ccd into master Sep 13, 2021

Ousret deleted the improve-detection branch September 13, 2021 21:48

Ousret mentioned this pull request Sep 14, 2021

Version 2.0.5 #98

Merged

Ousret mentioned this pull request Oct 4, 2021

Fix encoding error with non-prettified encoded responses httpie/cli#1168

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

❇️ Small improvement over Unicode detection #93

❇️ Small improvement over Unicode detection #93

Ousret commented Sep 13, 2021

Ousret commented Sep 13, 2021 •

edited

Loading

❇️ Small improvement over Unicode detection #93

❇️ Small improvement over Unicode detection #93

Conversation

Ousret commented Sep 13, 2021

Ousret commented Sep 13, 2021 • edited Loading

Ousret commented Sep 13, 2021 •

edited

Loading