Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add UTF-8 support #19

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Add UTF-8 support #19

wants to merge 3 commits into from

Conversation

pcbeard
Copy link

@pcbeard pcbeard commented Jan 20, 2023

This is my attempt to add support for UTF-8 input. This generalizes the reading and writing of characters and strings, using the Character type instead of representing characters as UInt8.

I've tested it on Arch Linux primarily, over SSH and directly using a KDE terminal application. All the unit tests pass, and terminal display and editing seem to work. It's possible with more work to simplify the buffer indexing, since it's now using an internal buffer of [Character] as the intermediate representation. This would allow using Int for character indexing, instead of having to use buffer.startIndex, .endIndex. However, I've left that unchanged for now, as that would be needed when using array slices.

This will simplify editing by allowing the use of Int indexes instead
of having to use String.Index.

Rename EditState.currentBuffer to .text, since it's a String, which is
generated from .buffer.

Redefine how .cursorPosition is computed, considering non-ASCII
characters to occupy two positions (works with Linux and Windows
terminals command line editing). This was experimentally determined.

Change return type of LineNoise.readCharacter() to Character?. This
allows more complex characters to be input, such as emoji, which may
occupy up to 4 UTF-8 bytes per character. Implement a simple UTF-8
decoding state machine by inspecting the first input byte. If its
value is < 128, then it's an ASCII character, which can be immediately
returned. Otherwise the number of high order bits set in the byte
is used to compute the number of expected bytes needed to assemble
the full character. These are buffered in a Data object, and when
complete, returned as a Character.

Change LineNoise.output(character:) to correctly convert the character
into the proper number of bytes to pass to write, using its .utf8
property.

Change LineNoise.output(text:) to correctly convert a String to the
proper number of bytes using its .utf8 property.

Change LineNoise.getCursorXPosition(inputFile:outputFile:) to use
[Character] for its local character buffer.

Change LineNoise.handleCharacter(_:editState:) to use Character as its
first parameter type, and use .asciiValue to compare with the various
control characters in the ControlCharacters enum. This works because
.asciiValue returns nil if the character isn't ASCII, instead of
silently truncating to an ASCII character.

Change LineNoise.insertCharacter(_:editState:) to use Character as its
first parameter type.

Change unit tests to use .buffer and .text properties consistently. In
some cases, it makes sense to write strings to .text, and in other
cases array values to .buffer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant