Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Windows-1252 encoding is not detected in turkish text #407

Closed
milahu opened this issue Dec 30, 2023 · 3 comments
Closed

[BUG] Windows-1252 encoding is not detected in turkish text #407

milahu opened this issue Dec 30, 2023 · 3 comments
Labels
detection Related to the charset detection mechanism, chaos/mess/coherence
Milestone

Comments

@milahu
Copy link

milahu commented Dec 30, 2023

charset_normalizer returns None

$ chardetect star_trek_tng_-_season_1_ep_03_-_the_naked_now.srt
star_trek_tng_-_season_1_ep_03_-_the_naked_now.srt: Windows-1252 with confidence 0.73

$ file -i star_trek_tng_-_season_1_ep_03_-_the_naked_now.srt
star_trek_tng_-_season_1_ep_03_-_the_naked_now.srt: application/x-subrip; charset=iso-8859-1

$ python -c "import charset_normalizer; print(charset_normalizer.from_path('star_trek_tng_-_season_1_ep_03_-_the_naked_now.srt').best())"
None

who is right? chardetect is right! the expected encoding is Windows-1252

iso-8859-1 produces an ugly <U+0085> when piped to less (utf16 hex bytes)
or c285 as utf8 hex bytes

unicode-explorer.com/c/0085

U+0085: The "Next Line" (NEL) control character was used in the 1970s for controlling printers and displays (e.g. VT100). Moves to the first position of the next line.

--- star_trek_tng_-_season_1_ep_03_-_the_naked_now.srt.iso-8859-1
+++ star_trek_tng_-_season_1_ep_03_-_the_naked_now.srt.Windows-1252
@@ -2242,7 +2242,7 @@
 
 505
 00:43:04,098 --> 00:43:05,428
-Adil davranmaktan bahsetmiþken<U+0085>
+Adil davranmaktan bahsetmiþken…
 
 506
 00:43:06,771 --> 00:43:09,777

input file

star_trek_tng_-_season_1_ep_03_-_the_naked_now.srt

@milahu milahu added bug Something isn't working help wanted Extra attention is needed labels Dec 30, 2023
@Ousret Ousret added detection Related to the charset detection mechanism, chaos/mess/coherence and removed bug Something isn't working help wanted Extra attention is needed labels Jan 2, 2024
@Ousret
Copy link
Member

Ousret commented Jan 2, 2024

OK, noted. Will try to improve this case for the next minor.

@milahu
Copy link
Author

milahu commented Jan 2, 2024

Adil davranmaktan bahsetmiþken…

it really is just that one byte that breaks charset_normalizer

$ printf '\x85' | iconv -f cp1254 -t utf8
…

when i remove that byte, the encoding cp1254 is found

@Ousret
Copy link
Member

Ousret commented Sep 25, 2024

We fixed that case. It will be available in the next release.

@Ousret Ousret closed this as completed Sep 25, 2024
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 9, 2024
##### v3.4.0 (`https://github.com/Ousret/charset_normalizer/blob/HEAD/CHANGELOG.md#340-2024-10-08`)

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 9, 2024
##### v3.4.0 (`https://github.com/Ousret/charset_normalizer/blob/HEAD/CHANGELOG.md#340-2024-10-08`)

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 9, 2024
##### v3.4.0 (`https://github.com/Ousret/charset_normalizer/blob/HEAD/CHANGELOG.md#340-2024-10-08`)

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 9, 2024
##### v3.4.0 (`https://github.com/Ousret/charset_normalizer/blob/HEAD/CHANGELOG.md#340-2024-10-08`)

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 9, 2024
##### v3.4.0 (`https://github.com/Ousret/charset_normalizer/blob/HEAD/CHANGELOG.md#340-2024-10-08`)

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 13, 2024
##### v3.4.0 

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 13, 2024
##### v3.4.0 

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 13, 2024
##### v3.4.0 

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 14, 2024
##### v3.4.0 

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 14, 2024
##### v3.4.0 

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 14, 2024
##### v3.4.0 

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 14, 2024
##### v3.4.0 

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 14, 2024
##### v3.4.0 

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
detection Related to the charset detection mechanism, chaos/mess/coherence
Development

No branches or pull requests

2 participants