A C# Prefix Tree implementation with RTL language support and Unicode Normalization. The prefix tree is constructed via a builder and exposed to developers as an immutable data structure. The tree exposes only search semantics for runtime consumption.
Internally, the tree contains a node for each letter (UTF-32 codepoint) and termination conditions on words. For a small tree, this represents as:
Provides an implementation of a Prefix Tree API.
public List<string> FindMatches(string prefix)
This API takes a standard string and returns all strings that match the prefix.
In the tree pictured above, we would see the following results:
Search | Returned Words |
---|---|
A | A, AT, AM, ALL, ALT |
AL | ALL, ALT |
C | CAR |
-
Immutable. The PrefixTree, once constructed, is immutable. This allows consumption of the tree to be done without concerns around concurrency or async patterns. All mutating operations happen at object construction, allowing for the immutable pattern.
-
Unicode. The Prefix Tree is Unicode aware. All input strings are Unicode Form C normalized, and all comparisons are done with fully normalized strings. This is tested, both with and without normalization, using English, Russian and Arabic. Tests explicitly include Emoji's, UTF-16 Combining characters, and UTF-32 codepoints.
-
Right-to-Left tested (RTL). The code is both Left-To-Right and Right-To-Left tested. I've never written RTL code before, and was looking to see what was needed. Plumbing that all the way through was very interesting! To confirm code, normalization and comparisons are working properly the Arabic word file is the top 1000 frequent Arabic words and the RTL nature of the tests appears to be working.
Emoji's can be treated as words in many scenarios. A complex emoji may have mulitple UTF-32 Codepoints and is subject to composistion rules. For use in this Prefix Tree, Emoji's are treated as words, and may be encoded into the Prefix Tree and searched by prefix. A sample enoding of 3 Emoji's sharing compound prefixes is seen here:
The Unit Tests validate Emoji support through theory testing. An understanding of UTF-32, UTF-16 Surrogate Pairs, and grapheme's may be helpful to understand these tests. The mechanism by which Emoji's are composed is fascinating. The applicability of the Prefix Tree on emoji selection seems clear, although a purpose build Emoji version that understands the Codepoint atributes (selectors, sequences, modifiers, qualified & fully-qualified & non-qualifiied combinations) would be better.
- This code is inefficient. There's no caching, and memory use is unoptimized, and the obvious algorithmic changes around word frequency and common letter combinations is not examined. Nevertheless, I was able to load in the entire English Scrabble dictionary (it's in the tests) with no trouble.
- There are far more efficient representations of words than the "1 character per node" solution here (the ZIP / huffman coding algorithm comes immediately to mind). Given the public API surface and the immutable nature of the runtime, different implementations could be swapped out without impact.
The majority of Unit Tests are XUnit Theory Tests. These allow the same Assemble/Action/Assert code to be used with multiple data set. An example is here, where you can see the data being tested is encapsulated in the Unit Test attribute.
Here we see two different tests run using the same code logic. In the first test, a search pattern of "aal" is run, and expected to return an exact result set. In the 2nd example a pattern of "zymologi" is run and expected to return 5 results.
This approach lets me write data oriented tests to validate logic, which increases the quality of the tests.
This project works in Github Codespaces. To use Codespaces, be sure to install the .Net Core Test Explorer into the VS Code container.
Once the Test Exporer is installed, the tests can be viewed, run, and debugged all in the browser.