Skip to content

Decoding examples

Tom Honermann edited this page Jul 2, 2017 · 2 revisions

Decoding examples


Basic decoding

Text_view iterators produce characters (a class object with an associated character set and code point value) as their element type. In the following example, note that \u00F8 (LATIN SMALL LETTER O WITH STROKE) is encoded as UTF-8 using two code units (\xC3\xB8), but iterator based enumeration sees just the single code point.

using CT = utf8_encoding::character_type;
auto tv = make_text_view<utf8_encoding>(u8"J\u00F8erg is my friend");
auto it = tv.begin();
assert(*it++ == CT{0x004A}); // 'J'
assert(*it++ == CT{0x00F8}); // 'ø'
assert(*it++ == CT{0x0065}); // 'e'

The iterators and ranges that Text_view provides are compatible with the non-modifying sequence utilities provided by the standard C++ <algorithm> library. This enables use of standard algorithms to search encoded text.

it = std::find(tv.begin(), tv.end(), CT{0x00F8});
assert(it != tv.end());

The iterators provided by Text_view also provide access to the underlying code unit sequence.

auto base_it = it.base_range().begin();
assert(*base_it++ == '\xC3');
assert(*base_it++ == '\xB8');
assert(base_it == it.base_range().end());

Text_view ranges can be used in range-based for statements.

for (const auto &ch : tv) {
  ...
}

Error handling with exceptions

By default, exceptions are thrown when errors occur during decoding operations.

auto tv = make_text_view<utf8_encoding>("\xc2"); // Invalid UTF-8 code unit sequence.
auto it = tv.begin();
try {
  auto c = *it;  // Throws 'text_decode_error'.
} catch (text_decode_error &tde) {
  // Exception caught.
}

Error handling without exceptions

Text_view iterators allow checking for error conditions before exceptions are thrown.

auto tv = make_text_view<utf8_encoding>("\xc2"); // Invalid UTF-8 code unit sequence.
auto it = tv.begin();
assert(it.error_occurred());
decode_status ds = it.get_error();
assert(ds == decode_status::invalid_code_unit_sequence);
}

Error handling with substitutions

Text_view's error policies allow creating views and iterators that substitute a character set specific substitution character when errors are encountered in a code unit sequence. For example:

auto tv = make_text_view<utf8_encoding, text_permissive_error_policy>("\xc2"); // Invalid UTF-8 code unit sequence.
auto it = tv.begin();
assert(*it == CT{0xFFFD}); // Unicode substitution character.