Skip to content

Sentometrics: An integrated framework for text based multivariate time series modeling and forecasting

Sam Borms edited this page Nov 4, 2017 · 7 revisions

Background

Forecasting economic and financial variables based on the sentiment expressed in the qualitative information available in a large number of texts is a challenging endeavor. The first step of transforming the words in the texts into a meaningful sentiment qualification is available in R in several packages. The sentometrics package is about the multivariate time series modeling needed to predict the variable of interest based on the possibly thousands of sentiment values that are potentially relevant. It requires to aggregate cross-sectionally and through time in an optimized manner that is still transparent.

There is a voluminous literature on methods of sentiment qualification, sentiment aggregation and the use of sentiment to forecast. Researchers present their contribution often by focusing on only one component, while the practical use of text-based forecasting by deciphering the sentiment in the text that is informative of the variable to forecast requires an integrated approach that can handle with the large dimensionality of the possible sentiment values that can be constructed.

Ardia, Boudt and Bluteau have built a first version of the code to achieve this. The objective of the Sentometrics GSoC project is to transform this code into a user-friendly package called “Sentometrics”.

The Sentometrics package should provide an integrated framework for textual sentiment-based forecasting using various types of weighting approaches (across libraries, across words in a text, across time) and sparse regression approaches to map the large number of possible sentiment values that can be obtained into an optimal forecast based a variety of objective functions that can be chosen.

Related work

The package provides a unique framework to compute the many types of sentiment measures. The final sentiment measure is an aggregation of the many sentiment measures that exist. The R package sentometrics starts with the end results of the existing packages textmining, tm and tidytext, and is thus complementary to them. Other packages: RSentiment and sentimentr.

Details of your coding project

Create the R package Sentometrics which has four types of functionality.

First, it can produce attributes to the texts included in the corpus of text provided by the user. The attributes consist of:

  1. content in terms of keywords used in the text;
  2. data and time of publication;
  3. structure of the text; and
  4. the sentiment attached to each word in the text based on various word lists.

Second, it has functionalities to aggregate:

  1. the words into a sentiment per text;
  2. the sentiment across texts based on the text attributes (sentiment value, topic, ...);
  3. the various sentiment measures obtained using different methodology for the same text; and
  4. the sentiment at different times (exponentially weighted, almon weights, ...).

The aggregation functionality can be user set (e.g. specify equal or exponentially weighted) and/or optimized to achieve the highest forecasting performance.

Third, it needs to have functionalities to evaluate the aggregation in terms of forecasting accuracy, as well as clear output in terms of estimated model and validation.

The package needs to be well documented in terms of test cases and a vignette.

Expected impact

The package Sentometrics has the ambition to become the go-to package for text-based forecasting. It is an integrated package that should make this analysis as simple as automated ARIMA forecasting using the Box-Jenkins approach: Provide the corpus with time stamps, provide the variable to forecast and an optimized text based forecasting model with predictions and an interpretable estimated forecasting equation will result.

Mentors

Kris Boudt, David Ardia and Keven Bluteau.

Tests

Applicants have to be able to show that they have:

  • A good working knowledge of programming in R, Rcpp and C++;
  • A good working knowledge of Roxygen for the documentation;
  • A good working knowledge of knitr/LaTeX for the vignette;
  • Familiarities with the construction of R packages;
  • Good coding standards (Google’s C++ and R style guide);
  • Familiarities with text sentiment analysis;
  • Familiarities with packages textmining, tm and/or tidytext.

Students should show their motivation by following the points below:

  • Easy: Think of an application in which text based sentiment can be useful to predict or nowcast a variable. Create a function to calculate the positive and negative sentiment of a set of text with a time dimension using the tm package. Plot the positive and negative sentiment over time and analyze the time series properties (eg autocorrelation). Show that the sentiment obtained can be used to predict the variable of interest - consider single-variate forecasting models and then standard AR-X models with X the the sentiment variable. How does the calculation method affect the predictive power?
  • Medium: Take texts from various origins that are expected to have predictive power for the variable of interest. Compute multiple sentiment indicators and integrate them in your forecasting model. Consider sophistications of the sentiment extraction. Depending on the method, you are extracting other signals in sentiment about the variable. Include them in the forecasting model.
  • Hard: Take the case of a large number of potentially useful sentiment indicators obtained by aggregating across an even larger number of texts and use these to forecast the variable of interest. The forecasting model needs to take into account that the number of predictors is potentially larger than the time series dimension. I would stick to the linear setup but non-linear extensions are to be considered in the package.

Solutions of tests

Students, please post a link to your test results here.

Clone this wiki locally