https://www.yelp.com/dataset_challenge (9th release)
Topic modeling has become become widely used in the natural language processing domain to understand the latent variables that people take into account when they write texts. Those variables are of particular interest since they have a social and economic effect on many businesses, such as online reviews. However, this is not an easy task, since they correspond to an unsupervised clustering problem that can be designed and interpreted in many different ways. For this project we present new designs to analyze online Yelp reviews on top of 3 topic models that correspond, to the best of our knowledge, to the state-of-the-art.
Derived from the Latent Dirichlet Allocation (LDA), we consider these variations since they incorporate metadata and bayesian network structures for specific data types. Most of those current studies are highly focused on theoretical aspects, and attempt to beat the state-of-the-art results based on perplexity and likelihood. There are other works with an economic impact but they do not take into consideration all the current models that can be applied. Thus, by using Yelp reviews as an input to enhance business performance, we analyze the state-of-the-art models, including features to capture the overall background/aspects of a business and its relation with the client.
We encourage the reader to keep further investigation on all the possible knowledge-base that might be helpful to solve the topic modeling interpretation problem.
- Clean the data: PreProcessingV2.py
- Run the models: main.py
- Analysis of results: analyze_results.py
- Results_randReviews_10topics.ipynb
- Results5k topics -databases.ipynb
Center for Data Science / Stern School of Business
New York University
May, 2017
Prof. Foster Provost