-
-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Incorrect encoding detected in 3.3.1 #371
Comments
jefferyto
added
bug
Something isn't working
help wanted
Extra attention is needed
labels
Oct 26, 2023
I can reproduce this. And I am working on a fix. |
Ousret
added
detection
Related to the charset detection mechanism, chaos/mess/coherence
and removed
help wanted
Extra attention is needed
labels
Oct 31, 2023
A solution was found for the first one, the second one is a little more problematic but no longer return cp1257. |
Ousret
added a commit
that referenced
this issue
Oct 31, 2023
) (#378) and added noise (md) probe that identify malformed arabic representation due to the presence of letters in isolated form (credit to my wife, thanks!)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I'm updating the charset-normalizer package in OpenWrt (with Python 3.11.6) and tried the example in https://charset-normalizer.readthedocs.io/en/latest/user/handling_result.html#handling-result:
In 3.3.0 this would print
cp1251
but in 3.3.1 this printscp1257
(str(result)
returns'Bńåźč ÷īāåź čģą ļšąāī ķą īįšąēīāąķčå.'
).I also tried the French phrase from https://charset-normalizer.readthedocs.io/en/latest/index.html#introduction:
and
from_bytes(my_byte_str).best()
also has the encodingcp1257
.I have compiled the package for arm, aarch64 and x86_64 and I get the same results.
The text was updated successfully, but these errors were encountered: