Exploring Project Guteberg with natural language processing
This project digs through the Project Gutenberg 2010 DVD, assembling the information needed to make recommendations based on the texts. The information includes:
-
Metadata about the texts in the corpus, including title, author(s), subject(s), etext numbers, URLs within the Project Gutenberg repository, etc.
Another part of the metadata relates to the multiple formats in which Project Gutenberg texts are stored. Information is collected about the MIME format and location of each file for each text.
-
Part-of-speech data about each English text in the corpus that is provided in a text/plain format (for initial ease of development). This part-of-speech data is collected by parsing the texts with the Stanford Part-Of-Speech Tagger and counting the number of uses of each part-of-speech in the Penn Treebank tag set extended to include numbers and other punctuation. (Basically, what the SPOST comes up with using the
english-left3words-distsim.tagger
model.Parts-of-speech for each word in the text are counted, then divided by the total number of words in the text to produce a vector of POS usage for each text. Style recommendations are based on Euclidian distance between two such vectors.
-
Most-common-noun data about each English text in the corpus. While tagging, the nouns are lemmatized (reduced to a base form), accumulated, and counted, and the 200 most common are retained for each text. This set of nouns is assumed to describe the topics contained in the text.
Topic recommendations are based on the overlap between two text's sets of common nouns, using the Jaccard distance metric in the web services. (This approach may be changed at some point; it consumes a great deal of memory and is slower than other suggested approaches.)
If you are interested in a starting point in the code, likely the best would be the TagTodoList.java program, used to produce the raw datasets.
The Stanford POS Tagger is used because it seems to be more accurate than the other likely candidate, the Apache OpenNLP project. However, OpenNLP seems to be faster and to use less memory and it is unclear if the accuracy difference is important. As well, the Stanford POS Tagger likes to blow up on large texts.
In order to handle the last problem, each text is broken into smaller chunks on paragraph boundaries (or, at least, empty lines) and the results combined. Currently, this enables processing of the entire Project Gutenberg 2010 DVD in something like eight hours on my laptop.
There are number of scripts in the root directory for either running the Java programs correctly or intermediate processing of the data.
-
run-html-metadata
: given a directory of HTML Project Gutenberg metadata files (such as is on the DVD), create metadata and formats files. -
run-tag-todolist
: process a to-do list, collecting the part-of-speech and noun data. -
show-data
: using ashurbanipal's text-handling code, show the contents of a text file. (The Project Gutenberg header and footer should be stripped off.) -
show-text
: using command-line zip utilities, show the contents of a text file. -
style-lookup
: compute a recommendation list based on the part-of-speech data. -
topic-lookup
: using the noun data, compute a recommendation list. -
combined-lookup
: combine the results of bothstyle-lookup
andtopic-lookup
to produce an overall recommendation. -
clean-data
: create a copy of the data set containing only complete information. -
wordstore-to-bitset
: Convert the noun (word store) data to a bitset-style data file for use by the ashurbanipal.web project. -
pick-content-type.awk
: given a raw to-do list file, pick out the "best" entries as far as content type goes. -
join-tabs
: call thejoin
command line utility with the magic necessary to operate on tab-separated files.
-
ashurbanipal.web.ui: Javascript client UI to the ashurbanipal.web interfaces.
-
ashurbanipal.web: Java Servlet-based interface to Ashurbanipal data. This is obsolete in favor of:
-
rust_ashurbanipal_web: Rust-based HTTP interface to Ashurbanipal data.
Tommy M. McGuire wrote this.
GNU GPLv2 or later.