Skip to content
gignu edited this page Apr 19, 2021 · 43 revisions

How it works

This wiki is supposed to give you a short overview of how things work under the hood. It is not a detailed description but rather an approximation and simplification with no aim of being 100% accurate.

Encoding Detection

Checking for Unicode:

Byte Order Mark

The first thing we try is using the byte order mark to detect the encoding. If this approach yields no result, we continue with the next technique:

UTF-8

When opening a text file with UTF-8 all characters that are not recognized by UTF-8 are transformed into a slightly weird-looking question mark (�). You can read more about it on nodejs.org. That's what we take advantage of. If we read a file with UTF-8 and find any of those question marks we know for sure that it is not UTF-8. On the other hand, if we don't find any strange-looking question marks, we know that the file must be encoded with UTF-8 and we have found the encoding.

Determining other Encodings:

If we find out that the encoding is none of the above, we try to first determine the language. And once we've found the language, we assign the appropriate encoding. Because once we know the language we can simply look at an encoding table and read off the encoding.

Language Detection

To detect the language, files are read either in Unicode or ISO-8859-1 which is called Latin1 in Node.js.

Files are scanned for specific words that are unique to only one language. Each language has one to three of those words. When a file contains the word "the" for example it's a strong indication that the language is English. If we can find "the" 150 times and we're unable to find words that indicate other languages, we can be pretty sure that the file is written in English.

The words for each language are carefully chosen, to make sure that they appear with a similar frequency. If "the" appears on average 150 times in an English text with 30000 characters, we need to make sure that "c'est" which is a strong indication for French, appears about 150 as well in a typical French text with 30000 characters.

After counting the matches for each language we assume that the language with the most matches must be the language that the file is written in. What we still need to determine is the likelihood for our assumption to be true. That's where the confidence score comes in.

Confidence Score

Unicode

If Unicode was detected with either the byte order mark technique or the UTF-8 technique the confidence score for that encoding will be 1 which means 100%.

Language Ratio

To calculate the language ratio we compare the two languages that have the most matches. If we find 150 English matches, 12 Chinese matches, and 2 Japanese matches, we only compare English and Chinese. Of those two languages, English takes up about 93% and Chinese 7%. Hence the language ratio will be 0.93.

The language ratio however is not our final score. It is merely a starting point. To prevent inaccuracy we need to take into account the frequency of our matches.

Frequency Adjustments

Let’s look at an example to get a better understanding of why frequency is so important. Suppose our text file contains 30000 characters and we’re unable to find any matches except for one English match. According to our language ratio, we can be 100% certain that the text must be English. However, if we consider the frequency, we notice that 1 match in a text of 30000 characters is very little. Typically we would expect around 150 matches in a text of that size. So, we have no choice but to decrease our final confidence score significantly.

But why were there so few matches in the first place? Well, there are several possible explanations. For one, somebody might have messed with the encoding so that the text became indecipherable. Incidentally, the letter combination “the” appeared once in the whole text which is why the language ratio detects English. Another explanation for a very low frequency might be that the original text file is written in a language or encoding that is not yet supported by the library. That’s why frequency is so important when determining the confidence score.

Note: If the encoding is not Unicode, the confidence score for the encoding as well as the confidence score for the language will be the same.