This is an interview code sample by Kinsey Roberts to meet the criteria set forth in challenge.md.
The problem sets forth the criteria for friendship in this social network of words. Two words are friends if they have a Levenshtein (edit) distance of one - that is, they differ by exactly one insertion, deletion, or substitution.
Once the social network has been built, the task is to find the size of the extended network of a given word in the network. There is no k-hop limit on the extended network, so "friends of friends of [...] of friends of friends" are included. This is simply the total number of nodes in the connected graph component containing the word.
My solution is optimized for the case of a single lookup, since that is what the problem required. If multiple words were to be looked up in the same dictionary, some additional optimizations could be leveraged. Instead of completing a new breadth-first search (BFS) from the query node each time, the connected components could be memoized. Then, when a query word comes in, first check to see if it's in any of the memoized connected components. If not, complete the BFS and memoize the resulting connected component.
The code compiles for both .NET Framework 4.0 and .NET Core 2.0. The latter should allow for cross-platform execution on Linux or macOS systems with .NET Core installed.
- If running Windows with .NET Framework 4.0 or greater installed:
cd wordbook.net.4.0\bin\Release
wordbook.exe
- If running Linux, or macOS with .NET Core 2.0 installed:
cd wordbook.net.core.2.0
dotnet run
Command line arguments can be used to customize the dictionary used, the word searched for, and whether to display the words in the extended network:
Note, the '/?' switch does not work with the dotnet run
command; it gets intercepted by .NET Core.
WORDBOOK [/?] [/V] [/D=X] [word]
[word] Specifies word to search for, default is LISTY
/V Verbose mode, lists words in extended network
/E Show elapsed times for subtasks
/D=X Specifies dictionary to search, default is D
D or F - dictionary.txt
H or 2 - half_dictionary.txt
Q or 4 - quarter_dictionary.txt
E or 8 - eight_dictionary.txt
S or T - very_small_test_dictionary.txt
/? Displays this help
My approach is two-step. First, scan the dictionary to build the graph of direct friendships. Then, to find the size of the social network for a given word, traverse the graph and count the extended friend network.
My initial approach involved pairwise comparison of all strings of appropriate length in order to build the direct friendship data. I worked through my initial approach on paper in pseudocode to make sure I had a good idea of how the algorithms worked.
I suspected from the beginning that using prefix tries would yield a (non-asymptotic) performance gain. I worked out a trie-based approach on paper using the very small test dictionary. My hunch was correct: switching to prefix tries allowed me to eliminate multiple strings at a time by ignoring subtrees with edit distance greater than one. Working the problem out on paper using the very small test dictionary also gave me the data I would need to test my code.
Three data structures are populated during the first pass:
string[] Words
- Contains all the words from the selected dictionary
- The zero-indexed line number in the file/array is used as the integer ID throughout the algorithms
- Prevents having to store multiple copies of the strings in memory
- Since dictionaries are alphabetical, this array will be used for binary search
- Only requires one copy of dictionary to be in memory, rather than creating a
Dictionary<string, int>
hashtable to lookup ID from word- The only lookup is to find the line number of the query word (e.g. "LISTY"), so even if the input weren't alphabetized, it would likely be worth the linear search through the file to avoid the memory overhead of the lookup hashtable
- Only requires one copy of dictionary to be in memory, rather than creating a
Dictionary<int, SingleEditDistanceTrie> TriesByLength
SingleEditDistanceTrie
is a modified prefix trieLength
must be set at creation and all strings must match the length- Provides specialized traversals to find a given words' peers (with Levenshtein distance = 1)
DictionarySet<int, int> Friendships
DictionarySet<K, V>
is a wrapper around aDictionary<K, HashSet<V>>
- Creates the inner HashSets as needed
- Handles null (and sanity) checks for an easier-to-read syntax
- Read dictionary file into
Words
- Iterate lines
- Identify direct friends by computing edit distances
- Search for single substitutions in
TriesByLength[line.Length]
- Search for single insertions in
TriesByLength[line.Length + 1]
- Search for single deletions in
TriesByLength[line.Length - 1]
- Search for single substitutions in
- When a direct friendship is found (two strings such that edit distance = 1), add to adjacency list
Friendships
(two entries, since friendships are reciprocal) - Add word to the appropriate trie in
TriesByLength
- Identify direct friends by computing edit distances
The second pass completes a breadth-first search in Friendships
beginning from the query word
- Using binary search, lookup the word ID (line number) of the word being searched for in
Words
- Add this word ID to a queue
- While the queue has items:
- Dequeue an item
- Add it to the extended network
- For each of its single-edit peers:
- If peer is not already in the extended network:
- Enqueue the peer to be processed
- If peer is not already in the extended network: