-
Notifications
You must be signed in to change notification settings - Fork 6
Home
This wiki is supposed to you a short overview of how things work under the hood. It is not a detailed description but rather an approximation and simplification with no aim of being 100% accurate.
When opening a text file with UTF-8 all characters that are not recognized by UTF-8 are transformed into a slightly weird-looking question mark (�). That's what we take advantage of. If we read a file with UTF-8 and find any of those question marks we know for sure that it is not UTF-8. On the other hand, if we don't find any strange-looking question marks, we know that the file must be encoded with UTF-8 and we have found the encoding.
If we find out that the encoding is not UTF-8 we try to determine the language first and with that information assign the appropriate encoding. Because once we know the language and that it is not UTF-8 we can simply look at an encoding table and read off the encoding.
To detect the language, files are read either in UTF-8 or ISO-8859-1 depending on whether UTF-8 has been detected previously.
Files are scanned for specific words that are unique to only one language. Each language has one to three of those words. When a file contains the word "the" for example it's a strong indication that the language is English. If we can find "the" 150 times and we're unable to find words that indicate other languages, we can be pretty sure that the file is written in English.
The words for each language are carefully chosen, to make sure that they appear with a similar frequency. If "the" appears on average 150 times in an English text with 30000 characters, we need to make sure that "c'est" which is a strong indication for French, appears about 150 as well in a typical French text with 30000 characters.
After counting the matches for each language we assume that the language with the most matches must be the language that the file is written in. What we still need to determine is the likelihood for our assumption to be true. That's where the confidence score comes in.
Lorem Ipsum