Add UTF-8 support #19

pcbeard · 2023-01-20T11:06:33Z

This is my attempt to add support for UTF-8 input. This generalizes the reading and writing of characters and strings, using the Character type instead of representing characters as UInt8.

I've tested it on Arch Linux primarily, over SSH and directly using a KDE terminal application. All the unit tests pass, and terminal display and editing seem to work. It's possible with more work to simplify the buffer indexing, since it's now using an internal buffer of [Character] as the intermediate representation. This would allow using Int for character indexing, instead of having to use buffer.startIndex, .endIndex. However, I've left that unchanged for now, as that would be needed when using array slices.

This will simplify editing by allowing the use of Int indexes instead of having to use String.Index. Rename EditState.currentBuffer to .text, since it's a String, which is generated from .buffer. Redefine how .cursorPosition is computed, considering non-ASCII characters to occupy two positions (works with Linux and Windows terminals command line editing). This was experimentally determined. Change return type of LineNoise.readCharacter() to Character?. This allows more complex characters to be input, such as emoji, which may occupy up to 4 UTF-8 bytes per character. Implement a simple UTF-8 decoding state machine by inspecting the first input byte. If its value is < 128, then it's an ASCII character, which can be immediately returned. Otherwise the number of high order bits set in the byte is used to compute the number of expected bytes needed to assemble the full character. These are buffered in a Data object, and when complete, returned as a Character. Change LineNoise.output(character:) to correctly convert the character into the proper number of bytes to pass to write, using its .utf8 property. Change LineNoise.output(text:) to correctly convert a String to the proper number of bytes using its .utf8 property. Change LineNoise.getCursorXPosition(inputFile:outputFile:) to use [Character] for its local character buffer. Change LineNoise.handleCharacter(_:editState:) to use Character as its first parameter type, and use .asciiValue to compare with the various control characters in the ControlCharacters enum. This works because .asciiValue returns nil if the character isn't ASCII, instead of silently truncating to an ASCII character. Change LineNoise.insertCharacter(_:editState:) to use Character as its first parameter type. Change unit tests to use .buffer and .text properties consistently. In some cases, it makes sense to write strings to .text, and in other cases array values to .buffer.

pcbeard added 3 commits January 20, 2023 02:11

No need to cast to Int, nor to force-unwrap

4607688

Helper extensions to convert between [Character] and String

a70db4f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add UTF-8 support #19

Add UTF-8 support #19

pcbeard commented Jan 20, 2023 •

edited

Loading

Add UTF-8 support #19

Are you sure you want to change the base?

Add UTF-8 support #19

Conversation

pcbeard commented Jan 20, 2023 • edited Loading

pcbeard commented Jan 20, 2023 •

edited

Loading