-
Notifications
You must be signed in to change notification settings - Fork 0
Vocabulary
Gathers and store information of word usage for each user. Primary tokens are lemmas or any other language agnostic canonical tokens. If possible, a usage count will be stored along with each token. This way a unique linguistic profile for every author can be collected.
Multiple dictionaries are used that group relevant information together:
Dictionary | Lookup key | Description |
---|---|---|
Dictionary | Lemma | Hash table of all lemmas found. Includes usage count and string list of authors who has used the lemma. |
Wordbook | Author | Hash table of all authors found. Includes string list of lemmas every author has used along with usage count. |
Entities of this data model are described in Google Protobuf file. Java classes are generated using Protobuf's protoc
command. Because of circular references, nested model objects have been replaced with string values of the lookup keys from the companion table.
For quick lookup, data is stored in hash tables. Lookup key is chosen to provide fastest lookup as possible. For storage size, data is normalized as much as possible.