Skip to content

Latest commit

 

History

History
34 lines (30 loc) · 2.61 KB

encodingSchemes.md

File metadata and controls

34 lines (30 loc) · 2.61 KB

Character Encoding

  • ASCII
    • It defines 128 characters, which map to the numbers 0–127.
  • UNICODE
    • It defines (less than) 2^21 characters, which map to numbers 0–2^21 (though not all numbers are currently assigned, and some are reserved).
    • Unicode is a superset of ASCII, and the numbers 0–128 have the same meaning in ASCII as they have in Unicode. For example, the number 65 means "Latin capital 'A'".
    • The Unicode standard describes how characters are represented by code points. A code point is an integer value, usually denoted in base 16.
    • Because Unicode characters don't generally fit into one 8-bit byte, there are numerous ways of storing Unicode characters in byte sequences, such as UTF-32 and UTF-8; ways to encode Unicode strings as bytes.

  • UTF-8 uses the following rules:
    • If the code point is <128, it’s represented by the corresponding byte value.
    • If the code point is between 128 and 0x7ff, it’s turned into two byte values between 128 and 255.
    • Code points >0x7ff are turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255.
  • UTF-8 has several convenient properties:
    • It can handle any Unicode code point.
    • A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as strcpy() and sent through protocols that can’t handle zero bytes.
    • A string of ASCII text is also valid UTF-8 text.
    • UTF-8 is fairly compact; the majority of code points are turned into two bytes, and values less than 128 occupy only a single byte.
    • If bytes are corrupted or lost, it’s possible to determine the start of the next UTF-8-encoded code point and resynchronize. It’s also unlikely that random 8-bit data will look like valid UTF-8.
  • There’s also a UTF-16 encoding, but it’s less frequently used than UTF-8.
  • The Difference Between Unicode and UTF-8
    • Unicode is a character set. UTF-8 is encoding.
    • Unicode is a list of characters with unique decimal numbers (code points). A = 65, B = 66, C = 67, ....
      • This list of decimal numbers represent the string "hello": 104 101 108 108 111
    • Encoding is how these numbers are translated into binary numbers to be stored in a computer:
      • UTF-8 encoding will store "hello" like this (binary): 01101000 01100101 01101100 01101100 01101111
    • Encoding translates numbers into binary. Character sets translates characters to numbers.

Sources