-
Notifications
You must be signed in to change notification settings - Fork 8
sentometrics
The R package sentometrics is a recently created toolbox for textual sentiment computation, aggregation of the sentiment into time series and (sparse) regression-based prediction. It was developed during Google Summer of Code 2017 and released on CRAN in November 2017. It is unique in the way it covers sentiment analysis, aggregation and prediction based on the sentiment time series in one integrated framework. A Google Summer of Code 2019 project would allow to (i) robustify the current code base, and (ii) add a set of interesting extensions to perform more diverse sentiment analysis and validation.
One of the core extensions targeted entails further improving the speed and breadth of the textual sentiment computation. Computing sentiment of many texts for many documents is a cumbersome task. Most of the text mining packages offer a way to calculate sentiment, but two problems persist: it is often not straightforward (multiple manipulations are required), and none of the implementations is optimized for speed in a lower-level programming language. The current implementation of the easy-to-use compute_sentiment() function in sentometrics is Rcpp-based, but needs to account for more complexity without losing too much of its execution speed. There is a considerable degree of flexibility in how to compute sentiment hardly accounted for in any R package. The current sentiment calculation in sentometrics takes a lexicon-based approach and considers several options in terms of document-level aggregation. To go further, it needs to improve the integration of linguistic complexities (such as valence shifters, n-grams, and sentences) in the lexicon-based approach, and a way to apply machine learning sentiment classifiers.
The modelling component of the package also must be expanded. Sparse regression is useful in many setups, but not in all. An easy interface needs to be added for simple linear and logistic regression, and a couple of well-known machine learning algorithms.
The current vignette (Ardia, Bluteau, Borms, and Boudt, 2019, "The R Package Sentometrics to Compute, Aggregate and Predict with Textual Sentiment”) is available on SSRN: http://dx.doi.org/10.2139/ssrn.3067734.
The quanteda package is used as the text mining backend in sentometrics. It offers many great tools, but a clear-cut approach to textual sentiment analysis is not part of them.
The sentimentr package has to date implemented the most complex textual sentiment calculation in R, accounting for several linguistic intricacies. Its downside, however, is that it becomes slow for the number of documents sentometrics desires to address (x0,000 - x00,000). The meanr package, on the other hand, provides a barebone implementation in C of a lexicon-based sentiment calculation but cannot handle any flexibility. The sentometrics package wants to combine the best of both worlds: complexity and speed.
The task of the student is to improve the R package sentometrics doing at least the following:
- Extend the corpus/texts input argument of the compute_sentiment() function to a number of other existing corpus formats in R, for example a compute_sentiment.SimpleCorpus() function for the SimpleCorpus object from the tm package. This could be simplified partly by making the to_sentocorpus() function a method for quanteda, tm and other corpus types (for example, wrapping around quanteda's conversion functions).
- Allow for an 'fn' argument to compute sentiment based on any separate function. This entails the definition of the input and output of the function allowed in 'fn', such that it complies to the current sentiment calculation engine(s). A typical usage of 'fn' would be to pass a trained (binary) classifier.
- Expand the number of possible weighting schemes for within-document ('howWithin' in the ctr_agg() function) and across-document ('howDocs') sentiment aggregation: term frequency-inverse document frequency (tf-idf), inverse proportional, inverse U-shaped, Almon weights, and others. In general, allow for any weighting function that is properly defined (which is already possible for across-time aggregation, see 'howTime'). For the across-document aggregation, the function could consist of an interval index to do aggregation at infrequent times.
- Expand the sentiment analysis from unigrams to n-grams in the lexicons, valence shifters lists, and tokenization scheme. This will enable to integrate more linguistic features in the lexicon-based approach, à la sentimentr. It should be fully implemented using Rcpp. This could include an extension to explicit sentence-based sentiment calculation.
- Allow for multi-language sentiment analysis using only one function call. This includes integrating language as metadata (typically in a sentocorpus object), directing texts to the language-agnostic lexicons and valence shifters for calculation, and formatting the sentiment output such that the subsequent aggregation considers the language used.
- Write a measures_update() function that updates an existing sentomeasures object with a new corpus chunk.
- Write a generate_lexicon() function that can (i) train a lexicon based on annotated data or using a regression approach (see the SentimentAnalysis package), or (ii) expand an existing lexicon in an unsupervised way. The output should be a sentolexicons object immediately useful for sentiment calculation.
- Improve the parallelization setup for the sento_model() function (e.g., not unnecessarily copying objects across multiple clusters).
- Integrate either explicitly a topic modelling functionality into the add_features() function, or make it simpler to convert topic model output from for instance the stm package into corpus features.
- Write a sentocorpus_summarize() function (or use the summary() generic) that provides easily interpretable and plottable information regarding the number of documents spread out over time, over features, average feature statistics, and so on.
- Add relevant econometric tools which integrate textual sentiment time series as variables. This includes: non-sparse linear and logistic models (using the lm() and glm() functions, or faster Rcpp counterparts; possibly adding a 'do.sparse' argument in the ctr_model() control function), principal component analysis (PCA) on sentiment objects or sentiment measures outputs, and prediction based on ensemble machine learning algorithms, such as support vector machines or random forest. At the same time, the sento_model() function needs to be reconsidered to accommodate for the modelling alternatives added.
- In addition to the programmatic implementations, the documentation and unit tests of the package need to evolve accordingly. A nice bonus would be to prepare a package tutorial on DataCamp (see https://www.datacamp.com/community/tutorials).
All implementations should account for proper parallelization and memory management, and be as fast as possible.
The R package sentometrics, as initiated during Google Summer of Code 2017, has the ambition to become the go-to package for textual sentiment calculation, aggregation and modelling. Altogether, the proposed enhancements to sentometrics would further ensure the R community access to a user-friendly, fast and flexible package to gain informative sentiment insights from large collections of texts.
Samuel Borms, PhD Researcher, University of Neuchâtel and Vrije Universiteit Brussel
David Ardia, Assistant Professor of Finance, University of Neuchâtel and HEC Montreal
Keven Bluteau, PhD Researcher, University of Neuchâtel and Vrije Universiteit Brussel
Applicants have to be able to show that they have:
- Familiarity with the sentometrics R package;
- Familiarity with textual sentiment analysis, time series, and machine learning;
- Familiarity with packages quanteda and sentimentr;
- A good working knowledge of programming in R, Rcpp and C++;
- A good working knowledge of devtools for the construction of package development;
- Good coding standards (Google's C++ and R style guide).
Students should show their motivation by showcasing the points below:
- Easy: Load the sentometrics package, create a corpus with a set of self-collected texts, add several features, construct a few sentiment measures and plot the results.
- Medium: Take the built-in corpus from the sentometrics package and apply a topic model to it using one of the existing text mining packages. Report on the methodology chosen, the parameters you had to deal with and the topics obtained.
- Hard: Create a lexicon based on a regression approach of choice for a continuous target variable, use it to construct sentiment time series and use these time series to predict the same target variable fully out-of-sample. Compare the performance against using a standard lexicon.
Students, please post a link to your test results here.