Skip to content

Commit

Permalink
Merge pull request #15 from paulhuggett/explicit-conversion
Browse files Browse the repository at this point in the history
Move explicit conversion docs to their own page.
  • Loading branch information
paulhuggett authored Jan 12, 2024
2 parents f821631 + 6670221 commit c0c4318
Show file tree
Hide file tree
Showing 3 changed files with 58 additions and 141 deletions.
46 changes: 4 additions & 42 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,11 @@ C++ 17 [deprecated the standard library's `<codecvt>` header file](https://www.o

## Usage

There are broadly three ways to use the icubaby library:
There are three ways to use the icubaby library depending on your needs:

1. [C++ 20 Range Adaptor](#c-20-range-adaptor)
2. An iterator interface
3. Manually driving the conversion
3. [Converting one code-unit at a time](#converting-one-code-unit-at-a-time)

### C++ 20 Range Adaptor

Expand Down Expand Up @@ -54,7 +54,7 @@ it = t.end_cp (it);

The `icubaby::iterator<>` class offers a familiar output iterator for using a transcoder. Each code unit from the input encoding is written to the iterator and this writes the output encoding to a second iterator. This enables use to use standard algorithms such as [`std::copy`](https://en.cppreference.com/w/cpp/algorithm/copy) with the library.

### Manually Driving the Conversion
### Converting One Code-Unit at a Time

Let’s try converting a single Unicode emoji character 😀 (U+1F600 GRINNING FACE) expressed as four UTF-8 code units (0xF0, 0x9F, 0x98, 0x80) to UTF-16 (where it is the surrogate pair 0xD83D, 0xDE00).

Expand All @@ -68,45 +68,7 @@ for (auto cu: {0xF0, 0x9F, 0x98, 0x80}) {
it = t.end_cp (it);
~~~

The `out` vector will contain a two UTF-16 code units 0xD83D and 0xDE00.

#### Disecting this code

1. Define where and how the output should be written:

~~~cpp
std::vector<char16_t> out;
auto it = std::back_inserter (out);
~~~

For the purposes of this example, we write the encoded output to a `std::vector<char16_t>`. Use the container of your choice!

2. Create the transcoder instance:

~~~cpp
icubaby::t8_16 t;
~~~

[`transcoder<>`](#transcoder) is a template class which requires two arguments to define the input and output encoding. You may use `char8_t` (in C++ 20, or [`icubaby::char8`](#char8) in C++ 17 and later) for UTF-8, `char16_t` for UTF-16, and `char32_t` for UTF-32. For example, `icubaby::transcoder<char16_t, char32_t>` will convert from UTF-16 to UTF-32; `icubaby::transcoder<char8_t, char16_t>` will convert from UTF-8 to UTF-16.

There is a collection of [nine typedefs](#helper-types) to make this a little more compact. Each is named `icubaby::t_I_O` where I and O are 8, 16, or 32. For example, `icubaby::t16_32` is equivalent to `icubaby::transcoder<char16_t, char32_t>` and `icubaby::t8_16` means `icubaby::transcoder<char8_t, char16_t>`.

3. Pass each code unit and the output iterator to the transcoder.

~~~cpp
for (auto cu: {0xF0, 0x9F, 0x98, 0x80}) {
it = t (cu, it);
}
~~~

4. Tell the transcoder that we’ve reached the end of the input. This ensures that the sequence didn’t end part way through a code point.

~~~cpp
it = t.end_cp (it);
~~~

It’s only necessary to make a single call to `end_cp()` once *all* of the input has been fed to the transcoder.

The `out` vector will contain a two UTF-16 code units 0xD83D and 0xDE00. See the [explicit conversion documentation](https://paulhuggett.github.io/icubaby/explicit-conversion.html) for more details.

## API

Expand Down
52 changes: 52 additions & 0 deletions docs/explicit-conversion.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Explicit Conversion

Let’s try converting a single Unicode emoji character 😀 (U+1F600 GRINNING FACE) expressed as four UTF-8 code units (0xF0, 0x9F, 0x98, 0x80) to UTF-16 (where it is the surrogate pair 0xD83D, 0xDE00).

~~~cpp
std::vector<char16_t> out;
auto it = std::back_inserter (out);
icubaby::t8_16 t;
for (auto cu: {0xF0, 0x9F, 0x98, 0x80}) {
it = t (cu, it);
}
it = t.end_cp (it);
~~~

The `out` vector will contain a two UTF-16 code units 0xD83D and 0xDE00.

## Disecting this code

1. Define where and how the output should be written:

~~~cpp
std::vector<char16_t> out;
auto it = std::back_inserter (out);
~~~

For the purposes of this example, we write the encoded output to a `std::vector<char16_t>`. Use the container of your choice!

2. Create the transcoder instance:

~~~cpp
icubaby::t8_16 t;
~~~

[`transcoder<>`](#transcoder) is a template class which requires two arguments to define the input and output encoding. You may use `char8_t` (in C++ 20, or [`icubaby::char8`](#char8) in C++ 17 and later) for UTF-8, `char16_t` for UTF-16, and `char32_t` for UTF-32. For example, `icubaby::transcoder<char16_t, char32_t>` will convert from UTF-16 to UTF-32; `icubaby::transcoder<char8_t, char16_t>` will convert from UTF-8 to UTF-16.

There is a collection of [nine typedefs](#helper-types) to make this a little more compact. Each is named `icubaby::t_I_O` where I and O are 8, 16, or 32. For example, `icubaby::t16_32` is equivalent to `icubaby::transcoder<char16_t, char32_t>` and `icubaby::t8_16` means `icubaby::transcoder<char8_t, char16_t>`.

3. Pass each code unit and the output iterator to the transcoder.

~~~cpp
for (auto cu: {0xF0, 0x9F, 0x98, 0x80}) {
it = t (cu, it);
}
~~~

4. Tell the transcoder that we’ve reached the end of the input. This ensures that the sequence didn’t end part way through a code point.

~~~cpp
it = t.end_cp (it);
~~~

It’s only necessary to make a single call to `end_cp()` once *all* of the input has been fed to the transcoder.
101 changes: 2 additions & 99 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,105 +13,8 @@ C++ 17 [deprecated the standard library's `<codecvt>` header file](https://www.o
There are broadly three ways to use the icubaby library:

1. [C++ 20 Range Adaptor](cxx20-range-adaptor.md)
2. An iterator interface
3. Manually driving the conversion

### The Iterator Interface

~~~cpp
auto const in = std::vector{char8_t{0xF0}, char8_t{0x9F}, char8_t{0x98}, char8_t{0x80}};
std::vector<char16_t> out;
icubaby::t8_16 t;
auto it = icubaby::iterator{&t, std::back_inserter (out)};
for (auto cu: in) {
*(it++) = cu;
}
it = t.end_cp (it);
~~~

The `icubaby::iterator<>` class offers a familiar output iterator for using a transcoder. Each code unit from the input encoding is written to the iterator and this writes the output encoding to a second iterator. This enables use to use standard algorithms such as [`std::copy`](https://en.cppreference.com/w/cpp/algorithm/copy) with the library.

### Manually Driving the Conversion

Let’s try converting a single Unicode emoji character 😀 (U+1F600 GRINNING FACE) expressed as four UTF-8 code units (0xF0, 0x9F, 0x98, 0x80) to UTF-16 (where it is the surrogate pair 0xD83D, 0xDE00).

~~~cpp
std::vector<char16_t> out;
auto it = std::back_inserter (out);
icubaby::t8_16 t;
for (auto cu: {0xF0, 0x9F, 0x98, 0x80}) {
it = t (cu, it);
}
it = t.end_cp (it);
~~~

The `out` vector will contain a two UTF-16 code units 0xD83D and 0xDE00.

#### Disecting this code

1. Define where and how the output should be written:

~~~cpp
std::vector<char16_t> out;
auto it = std::back_inserter (out);
~~~

For the purposes of this example, we write the encoded output to a `std::vector<char16_t>`. Use the container of your choice!

2. Create the transcoder instance:

~~~cpp
icubaby::t8_16 t;
~~~

[`transcoder<>`](#transcoder) is a template class which requires two arguments to define the input and output encoding. You may use `char8_t` (in C++ 20, or [`icubaby::char8`](#char8) in C++ 17 and later) for UTF-8, `char16_t` for UTF-16, and `char32_t` for UTF-32. For example, `icubaby::transcoder<char16_t, char32_t>` will convert from UTF-16 to UTF-32; `icubaby::transcoder<char8_t, char16_t>` will convert from UTF-8 to UTF-16.

There is a collection of [nine typedefs](#helper-types) to make this a little more compact. Each is named `icubaby::t_I_O` where I and O are 8, 16, or 32. For example, `icubaby::t16_32` is equivalent to `icubaby::transcoder<char16_t, char32_t>` and `icubaby::t8_16` means `icubaby::transcoder<char8_t, char16_t>`.

3. Pass each code unit and the output iterator to the transcoder.

~~~cpp
for (auto cu: {0xF0, 0x9F, 0x98, 0x80}) {
it = t (cu, it);
}
~~~

4. Tell the transcoder that we’ve reached the end of the input. This ensures that the sequence didn’t end part way through a code point.

~~~cpp
it = t.end_cp (it);
~~~

It’s only necessary to make a single call to `end_cp()` once *all* of the input has been fed to the transcoder.

### An alternative: using icubaby::iterator

The `icubaby::iterator<>` class is an output iterator to which code units in the source encoding can be assigned. This will produce equivalent code units in the output encoding which are written to a second output iterator. This make it straightforward to use standard library algorithms such as [`std::copy()`](https://en.cppreference.com/w/cpp/algorithm/copy) or [`std::ranges::copy()`](https://en.cppreference.com/w/cpp/algorithm/ranges/copy) with the library.

For example:

~~~cpp
std::vector<char16_t> out;
icubaby::t8_16 t;
auto it = icubaby::iterator{&t, std::back_inserter (out)};
for (auto cu: {0xF0, 0x9F, 0x98, 0x80}) {
*(it++) = cu;
}
it = t.end_cp (it);
~~~

This code creates an instance of `icubaby::interator<>` named `it` which holds two values: a pointer to trancoder `t` and output interator (`std::back_insert_iterator` in this case). Assigning a series of code units from the input encoding to `it` result in the `out` vector being filled with equivalent code units in the output encoding.

The above code snippet loops over the contents of the `in` array one code unit at a time. We can use `std::ranges::copy()` to achieve the same effect:

~~~cpp
std::array<char8_t, 4> const in {0xF0, 0x9F, 0x98, 0x80};
std::vector<char16_t> out;
icubaby::t8_16 t;
auto it = std::ranges::copy (in, icubaby::iterator{&t, std::back_inserter (out)}).out;
it = t.end_cp (it);
~~~
2. [An iterator interface](iterator-interface.md). Enables use of iterator-based algorithms.
3. [Explicit Conversion](explicit-conversion.md). This drives the conversion one code-unit at a time.

## API

Expand Down

0 comments on commit c0c4318

Please sign in to comment.