Skip to content
/ biglda Public

Tools for fast LDA topic modelling for big corpora

Notifications You must be signed in to change notification settings

PolMine/biglda

Repository files navigation

Introducing the biglda package

License: GPL v3 R build status codecov

Package for fast LDA topic modelling for big corpora.

Topic modelling is an established and useful unsupervised learning technique for detecting topics in large collections of text. Among the many variants of topic modelling and innovations that are been tried, the classic ‘Latent Dirichlet Allocation’ (LDA) algorithm remains to be a powerful tool that yields good results. Yet computing LDA topic models requires significant system resources, particularly when corpora grow large. The biglda package addresses three specific issues that may be outright obstacles to using LDA topic models productively and in a state-of-the-art fashion.

The package addresses issues of computing time and RAM limitations at the different relevant stages when working with topic models:

  • Data preparation and import: Issues of performance and memory efficiency start to arise when preparing input data for a training algorithm.

  • Fitting topic models: Computing time for a single topic model may be extensive. It is good practice to evaluate a set of topic models for hyperparameter optimization. So you need a fast implementation of LDA topic modelling to be able to fit a set of topic models within reasonable time.

  • To cost of interfacing: R is very productive for developing your analysis, but the best implementations of LDA topic modelling are written in Python, Java and C. Transferring the data between programming languages is a potential bottleneck and can be quite slow.

  • Evaluating topic models: Issues performance and memory efficiency are relevant when computing indicators to assess topic models may be considerable. Computations require mathematical operations with really large matrices. Memory may be exhausted easily.

More plastically spoken: As you work with increasingly large data, fitting and evaluating topic models can bring hours and days you wait for a result to emerge, just to see that the process crashes because memory has been exhausted.

The biglda package addresses performance and memory efficiency issues for R users at all stages. The ParallelTopicModel class from the Machine Learning for Language Toolkit ‘mallet’ offers a fast implementation of an LDA topic model that yields good results. The purpose of the ‘biglda’ package is to offer a seamless and fast interface to the Java classes of ‘mallet’ so that the multicore implementation of the LDA algorithm can be used. The as_LDA() function can be used to map the mallet model on the LDA_Gibbs class from the widely used topicanalysis package.

Installation

‘biglda’ package

The biglda package is a GitHub-only package at this time. Install the stable version as follows.

remotes::install_github("PolMine/biglda")

Install the development version of the package as follows.

remotes::install_github("PolMine/biglda", ref = "dev")

Installing Mallet

The focus of the biglda package is to offer a seamless interface to Mallet. The biglda::mallet_install() function installs Mallet (v202108) in a directory within the package. The big disadvantage of this installation mechanism is that the Mallet installation is overwritten whenever you install a new version of biglda, and Mallet needs to be installed anew. This is why installing Mallet in the storage location used by your system for add-on software (such as “/opt” on Linux/macOS) is recommended.

We strongly recommend to install the latest version of Mallet (v202108 or higher). Among others, it includes a new $printDenseDocumentTopics() method of the ParallelTopicModel used by biglda::save_document_topics(), improving the efficiency of moving data from Java/Mallet to R significantly.

Note that v202108 is a “serialization-breaking release”. Instance files and binary models prepared using previous versions cannot be processed with this version and may have to be rebuilt.

On Linux and macOS machines, you may use the following lines of code for installing Mallet. Note that admin privileges (“sudo”) may be required.

mkdir /opt/mallet
cd /opt/mallet
wget https://github.com/mimno/Mallet/releases/download/v202108/Mallet-202108-bin.tar.gz
tar xzfv Mallet-202108-bin.tar.gz
rm Mallet-202108-bin.tar.gz

Using Mallet with the interface offered by biglda requires a working installation of the rJava package. Unfortunately, it happens indeed that installing rJava causes headaches. A solution that works very often is to reconfigure the R-Java intervace running the following command on the command line.

R CMD javareconf

It goes beyond this introduction to list all potential solutions for rJava problems The prospect that Mallet is a very good and efficient tool may serve as a motivation.

Installing Gensim

The equivalent to rJava as an interface to Java is the reticulate package as an interface for running Python commands from R. The reticulate package needs to be installed and loaded.

install.packages("reticulate")
library(reticulate)

The reticulate package typically runs Python within a Conda environment. The easiest way is to use install_miniconda() and to install Gensim as follows.

reticulate::install_miniconda()
reticulate::conda_install(packages = "gensim")

Using biglda

Loading biglda

Topic modelling with Mallet

Note that it is not possible to use the R packages “biglda” and “mallet” in parallel. If “mallet” is loaded, it will put its Java Archive on the classpath of the Java Virtual Machine (JVM), making the latest version of the ParallelTopicModel class inaccessible and causing errors that may be difficult to understand. Therefore, a warning will be issued when “biglda” may detect that the JAR included in the “mallet” package is in the classpath.

options(java.parameters = "-Xmx8g")
Sys.setenv(MALLET_DIR = "/opt/mallet/Mallet-202108")
library(biglda)
## Mallet version: v202108

## JVM memory allocated: 7.1 Gb

When using Mallet for topic modelling, the first step is to prepare a so-called “instance list” with information on the input documents. In this example, we use the AssociatedPress dataset included in the topicmodels package.

data("AssociatedPress", package = "topicmodels")
instance_list <- as.instance_list(AssociatedPress, verbose = FALSE)

We then instantiate a BigTopicModel class object with hyperparameters. This class is a super class of the ParallelTopicModel class, the efficient worker of Mallet topic modelling. The superclass adds methods to this class that greatly speed up transferring data at the R-Java interface.

BTM <- BigTopicModel(n_topics = 100L, alpha_sum = 5.1, beta = 0.1)

This result (object BTM) is a Java object of class “jobjRef”. Methods of this (Java) class are accessible from R, and are used to configure the topic modelling engine. The first step is to add the instance list.

BTM$addInstances(instance_list)

We then set the number of iterations to 1000 and just use one thread core. Note that using multiple cores speeds up topic modelling - see performance assessment below.

BTM$setNumIterations(100L)
BTM$setNumThreads(1L)

Finally, we control the verbosity of the engine. By default, Mallet issues a status message every 10 iterations and reports the top words for topics every 100 iterations. For the purpose of rendering this document, we turn these progress reports off.

BTM$setTopicDisplay(0L, 0L) # no intermediate report on topics
BTM$logger$setLevel(rJava::J("java.util.logging.Level")$OFF) # remain silent

We now fit the topic model and report the time that has elapsed for fitting the model.

started <- Sys.time()
BTM$estimate()
Sys.time() - started
## Time difference of 4.485914 secs

The package includes optimized functionality for evaluating the topic model. Metrics are computed as follows.

lda <- as_LDA(BTM, verbose = FALSE)
N <- BTM$getDocLengthCounts()
data.frame(
  arun2010 = BigArun2010(beta = B(lda), gamma = G(lda), doclengths = N),
  cao2009 = BigCao2009(X = B(lda)),
  deveaud2014 = BigDeveaud2014(beta = B(lda))
)
##   arun2010    cao2009 deveaud2014
## 1 3630.858 0.06176298    1.480718

Gensim

To use Gensim for topic modelling, first load the reticulate package and import gensim into the Python session.

library(reticulate)
gensim <- reticulate::import("gensim")

The bottleneck for using Gensim for topic modelling from R is the interface between the R and the Python session. We assume that a DocumentTermMatrix has been prepared. The biglda package offers the functions dtm_as_bow() and dtm_as_dictionary() to prepare the input data structures required by Gensim topic modelling.

bow <- dtm_as_bow(AssociatedPress)
dict <- dtm_as_dictionary(AssociatedPress)

We use the multicore implementation of LDA topic modelling and use all cores but two.

threads <- parallel::detectCores() - 2L

This is the genuine Python part - running Gensim.

gensim_model <- gensim$models$ldamulticore$LdaModel(
# gensim_model <- gensim$models$ldamulticore$LdaMulticore(
  corpus = py$corpus,
  id2word = py$dictionary,
  num_topics = 100L,
  iterations = 100L,
  per_word_topics = FALSE
#  workers = as.integer(threads) # required to be integer
)

Use the as_LDA() method to get back data from Pyhton/Gensim to R.

lda <- as_LDA(gensim_model, dtm = AssociatedPress)
## ℹ Insantiate basic LDA_Gensim S4 class✔ Insantiate basic LDA_Gensim S4 class [125ms]
## ℹ assign dictionary (slot 'terms')✔ assign dictionary (slot 'terms') [17ms]
## ℹ assign word-topic distribution matrix (slot 'beta')✔ assign word-topic distribution matrix (slot 'beta') [16ms]
## ℹ assign topic distribution for each document (slot 'gamma')✔ assign topic distribution for each document (slot 'gamma') [12s]

Performance

To convey that biglda is fast, we fit topic models with Mallet with different numbers of cores and compare computing time with topic modelling with topicmodels::LDA(). The corpus used is AssociatedPress as represented by a DocumentTermMatrix included in the topicmodel package. Here, we just run 100 iterations for a limited set of k topics. In “real life”, you would have more iterations (typically 1000-2000), but this setup is sufficiently informative on performance.

So we start with the general settings.

k <- 100L
iterations <- 100L
n_cores_max <- parallel::detectCores() - 2L # use all cores but two
data("AssociatedPress", package = "topicmodels")
report <- list()

We then fit topic models with Mallet using a increasing number of cores.

library(biglda)

instance_list <- as.instance_list(AssociatedPress, verbose = FALSE)

for (cores in 1L:n_cores_max){
  mallet_started <- Sys.time()

  BTM <- BigTopicModel(n_topics = k, alpha_sum = 5.1, beta = 0.1)
  BTM$addInstances(instance_list)
  BTM$setNumThreads(cores)
  BTM$setNumIterations(iterations)
  BTM$setTopicDisplay(0L, 0L) # no intermediate report on topics
  BTM$logger$setLevel(rJava::J("java.util.logging.Level")$OFF) # remain silent
  
  BTM$estimate()
  
  run <- sprintf("mallet_%d", cores)
  report[[as.character(cores)]] <- data.frame(
    tool = run,
    time = as.numeric(difftime(Sys.time(), mallet_started, units = "mins"))
  )
}

And we run the classic implementation of LDA topic topic modelling in the topicmodels package.

library(topicmodels)

topicmodels_started <- Sys.time()

lda_model <- LDA(
  AssociatedPress,
  k = k,
  method = "Gibbs",
  control = list(iter = iterations)
)

report[["topicmodels"]] <- data.frame(
  tool = "topicmodels",
  time = difftime(Sys.time(), topicmodels_started, units = "mins")
)

The following chart reports the elapsed time for LDA topic modelling.

These are the takeaways from the benchmarks:

  • Topic modelling with Mallet is fast and even more so with several cores. So for large data, biglda can make the difference, whether it takes a week or a day for fit the topic model.

  • Note that we did not an evaluation of stm::stm() (structural topic model) here, which has become a widely used state-of-the-art algorithm. The stm package offers very rich analytical possibilities and the ability to include document metadata goes significantly beyond classic LDA topic modelling. But stm::stm() is significantly slower than topicmodels::LDA(). The stm package does not address big data scenarios very well, this is the specialization of the biglda package.

  • The great reputation of Gensim notwithstanding, benchmarks for Gensim are significantly weaker than what we see for mallet and topicmodels. The reticulate interface may be the bottleneck: We do not yet know.

Metrics

The package implements the metrics for topic models of the ldatuning package (Griffiths2004 still missing) using RcppArmadillo. Based on the following code used for benchmarking, the following chart conveys that the biglda funtions BigArun2010(), BigCao2009() and BigDeveaud2014() are significantly faster than their counterparts in the ldatuning package.

library(ldatuning)

metrics_timing <- data.frame(
  pkg = c(rep("ldatuning", times = 3), rep("biglda", times = 3)),
  metric = c(rep(c("Arun2010", "CaoJuan2009", "Deveaud2009"), times = 2)),
  time = c(
    system.time(Arun2010(models = list(lda_model), dtm = AssociatedPress))[3],
    system.time(CaoJuan2009(models = list(lda_model)))[3],
    system.time(Deveaud2014(models = list(lda_model)))[3],
    system.time(BigArun2010(
      beta = B(lda_model),
      gamma = G(lda_model),
      doclengths = BTM$getDocLengthCounts()))[3],
    system.time(BigCao2009(B(lda_model)))[3],
    system.time(BigDeveaud2014(B(lda_model)))[3]
  )
)

ggplot(metrics_timing, aes(fill = pkg, y = time, x = metric)) + 
    geom_bar(position = "dodge", stat = "identity")

Note that apart from speed, the RcppArmadillo/C++ implementation is much more memory efficient. Computations of metrics that may fail with ldatuning functionality because memory is exhausted will often work fast and successfully with the biglda package.

R Markdown templates for topic modelling

The examples here and in the package documentation including vignettes need to be minimal working examples to keep computing time low when building the package. In real-life scenarios, we recommend to use scripts for topic modelling on large datasets and to execute the script from the command line. R Markdown are an ideal alternative to plain R scripts that can be run with a shell command such as ``.

  • Mallet
  • Gensim
  • Gensim2

Related work

The R package “mallet” has been around for a while and is the traditional interface to the “mallet” Java package for R users. However, the a RTopicModel class is used as an interface to the ParallelTopicModel class, hiding the method to define the number of cores to be used from the R used, thus limiting the potential to speed up the computation of a topic model using multiple threads. What is more, functionality for full access to the computed model is hidden, inhibiting the extraction of the information the mallet LDA topic model that is required to map the topic model on the LDA_Gibbs topic model.

Discussion

The biglda package is designed for heavy-duty work with large data sets. If you work with smaller data, other approaches and implementations that are established and that may be easier to use and to install (the rJava Blues!)
do a very good job indeed. But we are not just happy to also have biglda. Jobs with big data may just take too long or simply fail because memory is exhausted. We think that biglda pushes the frontier of being able of productively using topic modelling in the R realm.

(Alternative: Command line use of Mallet. Quarto.)

Useful sources

David Minmo, “The Details: Training and Validating Big Models on Big Data”: http://journalofdigitalhumanities.org/2-1/the-details-by-david-mimno/

About

Tools for fast LDA topic modelling for big corpora

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published