Intended to normalise the character encoding of a given string to a preferred character encoding when the given string's byte sequences don't match the expectations of the preferred character encoding. Useful in cases where a block of data might conceivably be composed of several different unspecified, unknown encodings.
When a byte sequence doesn't conform to the expectations of a particular character encoding, and an attempt is made to render that byte sequence into readable characters using that particular character encoding, it can sometimes result in the appearance of generic replacement characters and "mojibake" (文字化け).
Wikipedia excerpt:
Mojibake means "character transformation" in Japanese. The word is composed of 文字 (moji, IPA: [mod͡ʑi]), "character" and 化け (bake, IPA: [bäke̞], pronounced "bah-keh"), "transform".
Related trivia: The word "emoji" has similar etymology. 😜😀
"Demojibakefier" is a play on the word "mojibake", so named because ideally, it should eliminate, or at least reduce the occurrence of replacement characters, mojibake, etc.
Let's start with some sample code to reproduce a potential use-case (for the purpose of the sample code, please assume that it uses UTF-8 encoding).
<?php
/** Japanese placeholder text generated by <https://lipsum.sugutsukaeru.jp/index.cgi>. */
$TextJA = 'またはに、引用商業と引用するればい一般が特にししことも、
対応ただ、一切としては判断者の著作による法上の問題は満たすことで、
被紹介者も、可能の引用が満たすて事典を執筆しりてくださいたます。
またそのままは、要求要件に決議引きれてください雑誌が仮に要求さ、ルール上に引用さことという、文章の記事によって本文の著作になく向上基づくことに疑わで。
または、フリーで主従をいい受け入れという、そのユースの目的が短い参考しられている裁判の一部を著作さたり、フリー権がアニメにし事項によって、
そのファイル物の可能向上の場合で受信さたり下げ場ます。';
/**
* Output some basic HTML to define the character encoding we're using,
* language, etc.
*/
echo '<!doctype html><html lang="ja-JP"><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head><body>';
/**
* When we echo the placeholder text in its original state, we should see it
* correctly. We'll also add some linebreaks for better readability.
*/
echo $TextJA . '<br /><br />';
/**
* We'll convert our placeholder text from UTF-8 to SHIFT-JIS, and then echo
* it, to intentionally produce output with mixed encodings (something that we
* generally should never, ever do, but we're doing it to provide an example
* for what the Demojibakefier class can do).
*/
echo $TextJA_SHIFTJIS = iconv('UTF-8', 'SHIFT-JIS', $TextJA);
/** Output closing HTML tags. */
echo '</body></html>';
When executing the above sample code via a browser request, it should produce something like this (the latter part being completely unintelligible):
またはに、引用商業と引用するればい一般が特にししことも、 対応ただ、一切としては判断者の著作による法上の問題は満たすことで、 被紹介者も、可能の引用が満たすて事典を執筆しりてくださいたます。 またそのままは、要求要件に決議引きれてください雑誌が仮に要求さ、ルール上に引用さことという、文章の記事によって本文の著作になく向上基づくことに疑わで。 または、フリーで主従をいい受け入れという、そのユースの目的が短い参考しられている裁判の一部を著作さたり、フリー権がアニメにし事項によって、 そのファイル物の可能向上の場合で受信さたり下げ場ます。
�܂��͂ɁA���p���Ƃƈ��p��������ʂ����ɂ������Ƃ��A �Ή������A��Ƃ��Ă͔��f�҂̒���ɂ��@��̖��͖��������ƂŁA ��Љ�҂��A��\�̈��p���������Ď��T�����M����Ă����������܂��B �܂����̂܂܂́A�v���v���Ɍ��c������Ă��������G�������ɗv�����A���[����Ɉ��p�����ƂƂ����A���͂̋L���ɂ���Ė{���̒���ɂȂ������Â����Ƃɋ^��ŁB �܂��́A�t���[�Ŏ�]����������Ƃ����A���̃��[�X�̖ړI���Z���Q�l�����Ă���ٔ��̈ꕔ�삳����A�t���[�����A�j���ɂ������ɂ���āA ���̃t�@�C�����̉�\����̏ꍇ�Ŏ�M�����艺����܂��B
In the case of our sample code, we already know that the latter uses SHIFT-JIS (because we're the ones that converted from UTF-8 to SHIFT-JIS), meaning that we could easily just use iconv()
to convert it back to UTF-8 again, without the need for complex classes, external dependencies, etc. But what about for cases where we don't know which character encoding might be being used? It's possible that in some cases, we mightn't be able to reliably predict which types of character encoding some data could potentially contain, due to the nature of how that data is sourced or for a variety of other possible reasons. That's where the Demojibakefier can help.
Let's try the same thing again, but this time, we'll pretend that we don't know which character encoding we've converted the placeholder text to. We'll pretend that the only thing we know, is that everything should be using UTF-8. We'll use the Demojibakefier to try to automatically convert it back to UTF-8, without the need for us to specify which character encoding we're converting from.
<?php
/** All the same stuff as before (up until where we closed our HTML tags). */
$TextJA = 'またはに、引用商業と引用するればい一般が特にししことも、
対応ただ、一切としては判断者の著作による法上の問題は満たすことで、
被紹介者も、可能の引用が満たすて事典を執筆しりてくださいたます。
またそのままは、要求要件に決議引きれてください雑誌が仮に要求さ、ルール上に引用さことという、文章の記事によって本文の著作になく向上基づくことに疑わで。
または、フリーで主従をいい受け入れという、そのユースの目的が短い参考しられている裁判の一部を著作さたり、フリー権がアニメにし事項によって、
そのファイル物の可能向上の場合で受信さたり下げ場ます。';
echo '<!doctype html><html lang="ja-JP"><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head><body>';
echo $TextJA . '<br /><br />';
echo $TextJA_SHIFTJIS = iconv('UTF-8', 'SHIFT-JIS', $TextJA);
/** Now we'll create a Demojibakefier instance. */
$Demojibakefier = new \Maikuolan\Common\Demojibakefier();
/**
* Now we'll run our SHIFT-JIS text (remember that we're pretending that we
* don't know which encoding it uses) through Demojibakefier's guard method.
* Echo a few linebreaks for better readability and the output of guard.
*/
echo '<br /><br />' . $Demojibakefier->guard($TextJA_SHIFTJIS);
/** And finally, our closing HTML tags. */
echo '</body></html>';
This time, it should produce something like this (note that the output of guard is the same as our original UTF-8 text):
またはに、引用商業と引用するればい一般が特にししことも、 対応ただ、一切としては判断者の著作による法上の問題は満たすことで、 被紹介者も、可能の引用が満たすて事典を執筆しりてくださいたます。 またそのままは、要求要件に決議引きれてください雑誌が仮に要求さ、ルール上に引用さことという、文章の記事によって本文の著作になく向上基づくことに疑わで。 または、フリーで主従をいい受け入れという、そのユースの目的が短い参考しられている裁判の一部を著作さたり、フリー権がアニメにし事項によって、 そのファイル物の可能向上の場合で受信さたり下げ場ます。
�܂��͂ɁA���p���Ƃƈ��p��������ʂ����ɂ������Ƃ��A �Ή������A��Ƃ��Ă͔��f�҂̒���ɂ��@��̖��͖��������ƂŁA ��Љ�҂��A��\�̈��p���������Ď��T�����M����Ă����������܂��B �܂����̂܂܂́A�v���v���Ɍ��c������Ă��������G�������ɗv�����A���[����Ɉ��p�����ƂƂ����A���͂̋L���ɂ���Ė{���̒���ɂȂ������Â����Ƃɋ^��ŁB �܂��́A�t���[�Ŏ�]����������Ƃ����A���̃��[�X�̖ړI���Z���Q�l�����Ă���ٔ��̈ꕔ�삳����A�t���[�����A�j���ɂ������ɂ���āA ���̃t�@�C�����̉�\����̏ꍇ�Ŏ�M�����艺����܂��B
またはに、引用商業と引用するればい一般が特にししことも、 対応ただ、一切としては判断者の著作による法上の問題は満たすことで、 被紹介者も、可能の引用が満たすて事典を執筆しりてくださいたます。 またそのままは、要求要件に決議引きれてください雑誌が仮に要求さ、ルール上に引用さことという、文章の記事によって本文の著作になく向上基づくことに疑わで。 または、フリーで主従をいい受け入れという、そのユースの目的が短い参考しられている裁判の一部を著作さたり、フリー権がアニメにし事項によって、 そのファイル物の可能向上の場合で受信さたり下げ場ます。
- Demojibakefier's constructor.
- supported method.
- checkConformity method.
- weigh method.
- dropVariants method.
- shannonEntropy method.
- normalise method.
- guard method.
- Last member.
- Len member.
- Segments member.
To use the Demojibakefier, you'll firstly need to instantiate it.
public function __construct(string $NormaliseTo = '');
Demojibakefier's constructor optionally accepts one parameter: The character encoding that it should use whenever trying to normalise data. When omitted, UTF-8 will be used.
After you've instantiated the Demojibakefier, you can start demojibakefying data by using the instance's normalise
or guard
methods.
Returns an array of all the character encoding types known to and suported by the Demojibakefier.
public function supported(): array;
Character encoding types currently known and suported by the Demojibakefier:
- UTF-8
- UTF-16BE
- UTF-16LE
- ISO-8859-1
- CP1252
- ISO-8859-2
- ISO-8859-3
- ISO-8859-4
- ISO-8859-5
- ISO-8859-6
- ISO-8859-7
- ISO-8859-8
- ISO-8859-9
- ISO-8859-10
- ISO-8859-11
- ISO-8859-13
- ISO-8859-14
- ISO-8859-15
- ISO-8859-16
- CP1250
- CP1251
- CP1253
- CP1254
- CP1255
- CP1256
- CP1257
- CP1258
- GB18030
- GB2312
- BIG5
- SHIFT-JIS
- JOHAB
- UCS-2
- UTF-32BE
- UTF-32LE
- UCS-4
- CP437
- CP737
- CP775
- CP850
- CP852
- CP855
- CP857
- CP860
- CP861
- CP862
- CP863
- CP864
- CP865
- CP866
- CP869
- CP874
- KOI8-RU
- KOI8-R
- KOI8-U
- KOI8-F
- KOI8-T
- CP037
- CP500
- CP858
- CP875
- CP1026
Note that the reliability of the Demojibakefier's ability to normalise strings, and of using it to convert a string between two particular character encoding types, can vary significantly, depending on the character encoding types in question, the length and nature of the string in question, among other factors. Note also that the Demojibakefier doesn't possess the same qualities as a linguistic translator, and isn't designed to test the intelligibility of strings beyond the conformity of their byte sequences to the various character encoding types supported by the class, or beyond the few rudimentary heuristics that it implements (such as the comparative likelihood of particular byte sequences occurring within the kinds of texts that typically utilise particular character encoding types). This means that an entirely unintelligible string could be regarded as already conformant, and therefore potentially not normalised, as long as its byte sequence conforms to that expected by the instance's target character encoding, or that an entirely unintelligible string could theoretically be produced by the Demojibakefier, as long as the provided string is not already conformant to the instance's target character encoding, but conforms to one or more of the other character encoding types supported by the class, passes all heuristics, and successfully reads unintelligibly in the character encoding types that it conforms with.
Checks for byte sequences that shouldn't normally appear in a specified character encoding (the second parameter) as a means of roughly guessing whether the string (the first parameter) likely conforms to the specified character encoding. The second parameter is optional, defaulting to instance's default character encoding when omitted (the character encoding provided to the constructor at instantiation, or UTF-8 when none was provided). Returns true when the string conforms (per specs), or false otherwise.
public function checkConformity(string $String, string $Encoding = ''): bool
Attempts to apply weighting to potential character encoding candidates based on the frequency/occurrence of specific byte sequences and lack thereof within a string. Method is private and thus shouldn't be called by the implementation.
private function weigh(string $String, array &$Arr);
Drops candidates belonging to encodings that are outdated subsets or variants of other encodings with valid candidates. Method is private and thus shouldn't be called by the implementation.
private function dropVariants(array &$Arr);
Calculates the shannon entropy of a string (the sole accepted parameter). This method isn't used by any current versions of the Demojibakefier, but its use is planned for a future version.
public function shannonEntropy(string $String);
Attempts to normalise a string (the sole accepted parameter), returning the string normalised, or the string verbatim when it can't be reliably, confidently returned normalised, when the string's byte sequence already conforms to the target character encoding, or when the string is empty.
public function normalise(string $String): string;
When normalise
is called, it immediately resets the Last
member and immediately populates the Len
member. The Last
member is then populated as soon as the Demojibakefier decides which character encoding it thinks the provided string uses (assuming it's able to come to a decision).
The Demojibakefier heavily relies upon PHP's iconv()
functionality in order to work as intended. If PHP's iconv()
functionality isn't available, the Demojibakefier won't be able to work properly, and attempting to call the normalise
method in such a situation will cause fatal errors to occur. The guard
method provides a means of avoiding that problem. It checks firstly whether iconv()
is available, and secondly whether the byte sequence of the provided string (the sole accepted parameter) already conforms to the instance's target character encoding (if it already conforms, the Demojibakefier doesn't need to do anything with the string anyway). If iconv()
isn't available, or if the provided string already conforms, the string is immediately returned verbatim. Otherwise (i.e., only if iconv()
is available, and if the string doesn't already conform), guard
calls normalise
to attempt to normalise the provided string and returns onward the return from normalise
. Calling the guard
method is therefore slightly safer than calling the normalise
method directly.
public function guard(string $String): string;
The Last
member is a string populated by the normalise
method, and can be used after calling normalise
or guard
, to determine the most recent character encoding that the Demojibakefier converted a string from.
public $Last = '';
Example usage:
<?php
/** Instantiate the Demojibakefier. */
$Demojibakefier = new \Maikuolan\Common\Demojibakefier();
/** Iterate through an arbitrary array of elements containing presumably wrongly encoded data. */
foreach ($Array as $Element) {
/** Provide each element to guard. */
$TestString = $Demojibakefier->guard($Element);
/** Check whether the result of guard is different to the original string, and whether Last has been populated. */
if ($TestString !== $Element && $Demojibakefier->Last) {
// Element was normalised using the encoding specified by Last (do here as you would accordingly).
} else {
// Nothing was normalised (do here as you would accordingly).
}
}
CIDRAM and phpMussel do something similar on the front-end logs page, to inform users when log entry fields have been transformed by the Demojibakefier.
The Len
member is an integer populated by the normalise
method, representing the total length of the provided string (it uses strlen()
internally to do this).
public $Len = -1;
Within a string being normalised, defines the maximum number of segments it can be split into to be normalised separately.
public $Segments = 65536;
Last Updated: 30 September 2020 (2020.09.30).