GitHub - vthriller/utf-p: Experimental character encoding inspired by UTF-8.

vthriller / utf-p Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Experimental character encoding inspired by UTF-8.

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
texts		texts
.gitignore		.gitignore
README		README
Setup.hs		Setup.hs
utf-p.cabal		utf-p.cabal
utfp.hs		utfp.hs

Repository files navigation

Experimental character encoding inspired by UTF-8.
Designed to be more compact for monolingual texts (think Wikipedia articles in national languages etc.) than UTF-8 itself.
Like UTF-8, it is backward-compatible with ASCII.


A quick comparison
------------------

file                     wc -m    wc -c     wc -c     ∆        Δ (%)   wc -c       wc -c        ∆      Δ (%)
                                  (utf-8)   (utf-p)                    (utf-8.xz)  (utf-p.xz)
------------------------------------------------------------------------------------------------------------
UTF-8-demo.txt             7627     14058    10483   -  3575   -25.43    5412        5132      -  280  -5.17
...utf8-index.html        52554     59219    55823   -  3396   - 5.73   23452       23176      -  276  -1.18
wikipedia-longest/ar:*   402840    591774   479566   -112208   -18.96  132508      128704      - 3804  -2.87
wikipedia-longest/el:*   327083    553302   403060   -150242   -27.15   46276       42380      - 3896  -8.42
wikipedia-longest/en:*   802361    802392   802401   +     9   + 0.00   55992       55992           0   0.00
wikipedia-longest/es:*   905967    920849   935063   + 14214   + 1.54  181392      182040      +  648  +0.36
wikipedia-longest/he:*   257226    261711   259072   -  2639   - 1.01   65608       65536      -   72  -0.11
wikipedia-longest/ja:*   301708    780918   605887   -175031   -22.41  168744      167344      - 1400  -0.83
wikipedia-longest/ru:*   423182    695691   522478   -173213   -24.90  160320      154192      - 6128  -3.82
wikipedia-longest/th:*   530370   1474454   569850   -904604   -61.35  183152      165828      -17324  -9.46
wikipedia-longest/vi:*   415971    487281   504498   + 17217   + 3.53  128600      129252      +  652  +0.51
wikipedia-longest/zh:*   190996    506612   480025   - 26587   - 5.25  184260      191344      + 7084  +3.84


Quick start
-----------

 $ cabal sandbox init
 $ cabal install --only-dependencies
 $ cabal build

Now you can convert from UTF-P to UTF-8:

 $ xzcat texts/UTF-8-demo.txt.utfp.xz | ./dist/build/utf-p/utf-p -d

…and vice versa:

 $ curl http://news.baidu.com/ | iconv -f gb2312 -t utf-8//IGNORE | ./dist/build/utf-p/utf-p


Note on prefix-free common symbols
----------------------------------

An attempt was made to use '00xx xxxx' to always represent characters from the first half of ASCII, because it contains commonly (and frequently) used symbols such as whitespace, digits and puntuation marks.
However, this is not a good idea, and it is so for a couple of reasons.

 * It actually makes prefix switches more frequent, because only a set of just 2⁶ = 64 symbols can be represented without changing a prefix bits, with only '01xx xxxx' bytes.
   Here's a prefix usage statistics for some non-latin languages:

    $ st6() { cut -d/ -f1 | tr -d \\n | python -c 'import sys; from collections import *; print(Counter(ord(i)>>6 for i in sys.stdin.read()))'; }
    $ st7() { cut -d/ -f1 | tr -d \\n | python -c 'import sys; from collections import *; print(Counter(ord(i)>>7 for i in sys.stdin.read()))'; }

    $ aspell -l el dump master | st6
    Counter({14: 2804010, 15: 1538518})
    $ aspell -l el dump master | st7
    Counter({7: 4342528})

    $ aspell -l ru dump master | st6
    Counter({16: 1745940, 17: 895764})
    $ aspell -l ru dump master | st7
    Counter({8: 2641704})

   Stats may vary greatly for different latin-based languages as well:

    $ aspell -l es dump master | st6
    Counter({1: 408450, 3: 10399})
    $ aspell -l es dump master | st7
    Counter({0: 408450, 1: 10399})

    $ aspell -l vi dump master | st6
    Counter({1: 16622, 123: 2157, 3: 1959, 122: 1504, 4: 426, 6: 391, 5: 45})
    $ aspell -l vi dump master | st7
    Counter({0: 16622, 61: 3661, 1: 1959, 2: 471, 3: 391})

   The amount of prefixes needed to represent typical japanese text is left as an exercise for the reader.

 * Reduction of prefix-setter appearance makes garbled texts less readable. For instance, for reading process started somewhere at the middle of text, the actual content can be recovered much quicker because of prefix switches that normally occur around whitespace and punctuation characters.