Skip to content

Latest commit

 

History

History
58 lines (43 loc) · 1.57 KB

README.md

File metadata and controls

58 lines (43 loc) · 1.57 KB

UTF-8 Validator

Build Status

The UTF-8 validator reads chunks of bytes of arbitary length and outputs chunks containing only complete UTF-8 sequences. Sequences overlapping the chunk boundaries are joined. Invalid bytes and sequences are replaced with the replacement glyph � (0xFFFD).

The validator uses the checks suggested by Markus G. Kuhn http://www.cl.cam.ac.uk/~mgk25/ using the test file http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt.

The following is considered to be invalid:

  • Invalid initial bytes and detached continuation bytes
  • Incomplete sequences
  • Overlong glyph representations
  • Low and high surrogates
  • Glyphs in the "internal use area"

Example

size_t dataSize;
uint8_t* data = readFile("UTF-8-test.txt", &dataSize);

uint8_t buffer[4096];
utf8_validator validator = {0};

uint8_t* dataPtr = data;
size_t inSize = dataSize;
size_t outSize;

while (inSize) {
    outSize = sizeof(buffer);
    utf8_validate(&validator, &dataPtr, &inSize, buffer, &outSize);

    if (outSize) {
        handleChunk(buffer, outSize);
    }
}

The stream end is signalled by giving an empty chunk. This is to check for a possible truncation of the last sequence.

outSize = sizeof(buffer);
utf8_validate(&validator, NULL, NULL, buffer, &outSize);

if (outSize) {
    handleChunk(buffer, outSize);
}

The output buffer size should be as big as possible. The absolute minimum size is 72 bytes, which is really ineffective.