-
Notifications
You must be signed in to change notification settings - Fork 6
Home
This wiki is supposed to you a short overview of how things work under the hood. It is not a detailed description but rather an approximation and simplification with no aim of being 100% accurate.
When opening a text file with UTF-8 all characters that are not recognized by UTF-8 are transformed into a slightly weird-looking question mark (�). That's what we take advantage of. If we read a file with UTF-8 and find any of those question marks we know for sure that it is not UTF-8. On the other hand, if we don't find any strange-looking question marks, we know that the file must be encoded with UTF-8 and we have found the encoding.
If we find out that the encoding is not UTF-8 we try to determine the language first and with that information assign the appropriate encoding. Because once we know the language and that it is not UTF-8 we can simply look at an encoding table and read off the encoding.
To detect the language, files are read either in UTF-8 or ISO-8859-1 depending on whether UTF-8 has been detected previously.
Files are scanned for specific words that are unique to only one language. Each language has one to three of those words. When a file contains the word "the" for example it's a strong indication that the language is English. If we can find "the" 150 times and we're unable to find words that indicate other languages, we can be pretty sure that the file is written in English.
The words for each language are carefully chosen, to make sure that they appear with a similar frequency. If "the" appears on average 150 times in an English text with 30000 characters, we need to make sure that "c'est" which is a strong indication for French, appears about 150 as well in a typical French text with 30000 characters.
After counting the matches for each language we assume that the language with the most matches must be the language that the file is written in. What we still need to determine is the likelihood for our assumption to be true. That's where the confidence score comes in.
Determining the final confidence score involves two steps:
To calculate the language ratio we compare the two languages that have the most matches. If we find 150 English matches, 12 Chinese matches, and 2 Japanese matches, we only compare English and Chinese. Of those two languages, English takes up about 93% and Chinese 7%. Hence the language ratio will be 0.93.
The language ratio however is not our final score. It is merely a starting point. To prevent inaccuracy we need to take into account the frequency of our matches.
Let’s look at an example to get a better understanding of why frequency is so important. Suppose our text file contains 30000 characters and we’re unable to find any matches except for one English match. According to our language ratio, we can be 100% certain that the text must be English. However, if we consider the frequency, we notice that 1 match in a text of 30000 characters is very little. Typically we would expect around 150 matches in a text of that size. So, we have no choice but to decrease our final confidence score significantly.
But why were there so few matches in the first place? Well, there are several possible explanations. For one, somebody might have messed with the encoding so that the text became indecipherable. Incidentally, the letter combination “the” appeared once in the whole text. Another explanation for a very low frequency might be that the original text file is written in a language or encoding that is not yet supported by our library. That’s why frequency is so important when determining the confidence score.