Text corpus analysis in R. Heavy lifting is done by the Corpus C library.
This is an R text processing package that currently does very little, but it does enough to be useful. The package exports seven main functions:
-
read_ndjson()
for reading data in newline-delimited JSON format; -
text_split()
for segmenting text into sentences or blocks of tokens; -
text_count()
for counting sentences or tokens; -
token_filter()
for specifying the process by which a text gets transformed into a token sequence (normalization, stemming, stop word removal, etc.); -
tokens()
for segmenting text into tokens, each of which is an instance of a particular type; -
term_counts()
for tabulating term occurrence frequencies (types or n-grams); -
term_matrix()
for computing a term frequency matrix (also known as a "document-by-term matrix").
The package also provides three new data types:
-
corpus_json
for storing JSON-encoded data; -
corpus_text
for storing text; -
corpus_token_filter
for specifying token pre-processing decisions.
That's it. There's no "key words in context," no part-of-speech tagging, no topic models, and no word vectors. Some of these features are planned for future releases.
Corpus is available on CRAN. To install the latest released version, run the following command in R:
install.packages("corpus")
To install the latest test version, run
install.packages("https://github.com/patperry/r-corpus/raw/master/dist/corpus_0.6.1.tar.gz", repos = NULL)
Note, the package uses a git submodules, so so install_git
and
install_github
won't work. See the section on
Building from source below if you want to install the development version.
Here are some performance comparisons for some basic operations. We compare against jsonlite version 1.3, stringi version 1.1.3, and quanteda version 0.9.9-50.
The first benchmark reads in a 286 MB data file, yelp-review.json
.
The raw data comes from the first round of the
Yelp Dataset Challence; it is stored in
newline-delimited JSON format.
# Using the corpus library:
system.time({
data <- read_ndjson("yelp-review.json", mmap=TRUE)
})
## user system elapsed
## 3.356 0.119 3.484
pryr::mem_used()
## 170 MB
# Using the jsonlite library:
system.time({
data <- jsonlite::stream_in(file("yelp-review.json"), verbose=FALSE)
})
## user system elapsed
## 36.788 0.588 37.522
pryr::mem_used()
## 335 MB
We are about 10 times faster than jsonlite, and we use about 50% of the RAM. How are we reading a 286 MB file in only 170 MB? We memory-map the text data, letting the operating system move data from the file to RAM whenever necessary. We store the addresses of the text strings, but we do not load text values into RAM until they are needed.
(If we specify mmap=FALSE
, the default, then we read the entire file
into memory; in this case, read_ndjson
is about 5 times faster than
jsonlite.)
The next benchmark segments each text into sentences, using the boundaries defined by Unicode Standard Annex #29, Section 5 with some special tailoring to handle abbreviations correctly and to not treat carriage returns and line feeds as paragraph separators.
# Using the corpus library:
system.time({
sents <- text_split(data, "sentences") # gets text from data$text
})
## user system elapsed
## 2.963 0.056 3.024
rm("data")
pryr::mem_used()
## 196 MB
# Using the stringi library (no special tailoring to the UAX 29 rules):
system.time({
sents <- stringi::stri_split_boundaries(data$text, type="sentence")
})
## user system elapsed
## 8.010 0.147 8.191
rm("data")
pryr::mem_used()
## 536 MB
This is not an apples-to-apples comparison, because corpus returns a
single data frame with all of the sentences while stringi returns a list of
character vectors. This difference in output formats gives corpus a
speed advantage. A second difference is that in the corpus benchmark,
the input text object is an array of memory-mapped values returned by the call
to read_ndjson
; in the stringi benchmark, the input text object is an
in-memory array returned by jsonlite. The stringi package
can read the values directly from RAM instead of from the hard drive.
Overall, corpus is about 2.7 times faster than stringi.
The next benchmark performs a series of operations to transform each text into a sequence of normalized tokens. Formally, the value of a token is called a "type". First, we segment each text into word tokens using the boundaries defined by Unicode Standard Annex #29, Section 4, and we discard white-space-only tokens. Next, we normalize the text into Unicode NFKC normal form, and we case fold the text (for most scripts, replacing uppercase with lowercase). Then, we remove default ignorable characters like zero-width spaces. We also "quote fold," replacing Unicode quotes (single, double, apostrophe) with ASCII single quote (').
system.time({
toks <- tokens(data)
})
## user system elapsed
## 7.408 0.195 7.612
rm("data")
pryr::mem_used()
## 451 MB
With stringi, it is more efficient to do the word segmentation after the normalization. Further, the stringi package handles punctuation and white space differently, so it gives slightly different results.
system.time({
ntext <- stringi::stri_trans_nfkc_casefold(data$text)
toks <- stringi::stri_extract_all_words(ntext)
})
## user system elapsed
## 13.565 0.272 13.864
rm("data", "ntext")
pryr::mem_used()
## 400 MB
Both corpus and stringi produce the same type of output: lists of
character arrays. As in the previous benchmark, corpus takes as input a
memory-mapped text object, while stringi takes as input in in-memory
character vector. We are 1.8 times faster than stringi
. We use slightly more
RAM, but only because we do not throw away punctuation-only tokens.
We are now ready to compute some text statistics. In the next benchmark, we
tokenize the texts and compute the 5 most common terms. While tokenizing, we
normalize the text as in the previous benchmark, stem the words, drop
punctuation and numbers, and drop common function words ("stop" words). We
specify this tokenization behavior in a token_filter
object that we pass
to the term_counts()
function.
system.time({
f <- token_filter(stemmer = "english", drop_punct = TRUE,
drop_number = TRUE, drop = stopwords("english"))
stats <- term_counts(data, f)
print(head(stats, n = 5))
})
## term count
## 1 place 238732
## 2 good 218029
## 3 food 192648
## 4 like 185951
## 5 get 170033
## user system elapsed
## 6.782 0.053 6.854
We compare against the quateda package. This isn't a fair comparison
because quanteda doesn't have a way of getting the term counts directly.
Instead, we compute a term matrix (a "document feature matrix" or "dfm" in
quanteda lingo) and then use quanteda's topfeatures()
function on
this matrix.
library("quanteda")
system.time({
x <- quanteda::dfm(data$text, stem = TRUE, remove_punct = TRUE,
remove_numbers = TRUE,
remove = quanteda::stopwords("english"))
print(topfeatures(x, n = 5))
})
## place good food like get
## 238732 218029 192648 185951 170033
## user system elapsed
## 52.325 3.258 55.628
We get similar results, but there are some differences between the corpus and the quanteda tokenization rules (as of quanteda version 0.9.9-50):
-
Quanteda has special rules for handling URLs, email addresses, and Twitter handles, but corpus does not.
-
Corpus interprets tokens starting with digits like
5pm
,70s
, and110th
as numbers, but quanteda does not. -
Quanteda removes words before stemming; corpus removes words after stemming and by default does not stem words in the
drop
list. This leads to differences in the output. The word "others", for example, is not in the stop word list, but it stems to "other", a stop word. Corpus will drop all instances of "other" and "others". Quanteda will drop all instances of "other" but not of "others" (it replaces the latter with non-dropped "other" tokens). -
Quanteda has some other behavior that I do not understand.
Despite these differences, we agree with quanteda on the most frequent terms, and we are about 8 times faster.
For the final benchmark, we compute a term frequency matrix (also known as a "document-by-term" matrix). We perform the same pre-processing steps as before, but we only retain terms occurring at least five times in the corpus.
system.time({
# compute frequencies for terms appearing at least 5 times
f <- token_filter(stemmer = "english", drop_punct = TRUE,
drop_number = TRUE, drop = stopwords("english"))
stats <- term_counts(data, f, min = 5)
# compute the frequency matrix for the selected terms
terms <- stats$term
x <- term_matrix(data, f, select = terms)
})
## user system elapsed
## 16.464 1.067 17.648
Note that we first set the properties of the text filter, f
, and then
we build the term matrix. The token filter records all of the
pre-processing. If we need to, we we can call term_matrix
on a new text
vector using the same filter and select argument to get a term matrix with
the same columns as x
.
The quanteda package does not have an object analogous to a text filter. We construct the full term matrix, then drop columns for low-frequency terms:
system.time({
x <- quanteda::dfm(data$text, stem = TRUE, remove_punct = TRUE,
remove_numbers = TRUE,
remove = quanteda::stopwords("english"))
x <- quanteda::dfm_trim(x, min_count = 5)
})
## user system elapsed
## 62.047 4.319 66.576
Corpus is about 3.5 times faster than quanteda. If we really care about efficiency, we can widen the gap by performing the following sequence of operations instead:
system.time({
# compute all term frequencies
f <- token_filter(stemmer = "english", drop_punct = TRUE,
drop_number = TRUE, drop = stopwords("english"))
# compute the frequency matrix for the selected terms
x <- term_matrix(data, f)
x <- x[, Matrix::colSums(x) >= 5, drop = FALSE]
})
## user system elapsed
## 9.967 1.183 11.203
The latter approach is almost 6 times faster than quanteda but it's
more error-prone and I do not recommend it. In the first approach,
the pre-processing and feature selection decisions come before the term
matrix. (Also, as a side benefit, using the first approach, the columns
of x
get ordered in descending order according to overall term count.)
To install the latest development version of the package, run the following sequence of commands in R:
local({
dir <- tempfile()
cmd <- paste("git clone --recursive",
shQuote("https://github.com/patperry/r-corpus.git"),
shQuote(dir))
system(cmd)
devtools::install(dir)
# optional: run the tests
# must be in C locale for consistent string sorting
collate <- Sys.getlocale("LC_COLLATE")
Sys.setlocale("LC_COLLATE", "C")
devtools::test(dir)
Sys.setlocale("LC_COLLATE", collate) # restore the original locale
# optional: remove the temporary files
unlink(dir, recursive = TRUE)
})
Note that the package uses a git submodule, so you cannot use
devtools::install_github
to install it.
To obtain the source code, clone the repository and the submodules:
git clone --recursive https://github.com/patperry/r-corpus.git
The --recursive
flag is to make sure that the corpus library also gets
cloned. If you forget the --recursive
flag, you can manually clone
the submodule with the following commands:
cd r-corpus
git submodule update --init
There are no other dependencies.
Corpus is released under the Apache Licence, Version 2.0.