Skip to content

mariazm/DS_Seminar_Project_Public

Repository files navigation

Public Project for the Research Seminar in Data Science

Data

https://www.yelp.com/dataset_challenge (9th release)

Abstract

Topic modeling has become become widely used in the natural language processing domain to understand the latent variables that people take into account when they write texts. Those variables are of particular interest since they have a social and economic effect on many businesses, such as online reviews. However, this is not an easy task, since they correspond to an unsupervised clustering problem that can be designed and interpreted in many different ways. For this project we present new designs to analyze online Yelp reviews on top of 3 topic models that correspond, to the best of our knowledge, to the state-of-the-art.

Derived from the Latent Dirichlet Allocation (LDA), we consider these variations since they incorporate metadata and bayesian network structures for specific data types. Most of those current studies are highly focused on theoretical aspects, and attempt to beat the state-of-the-art results based on perplexity and likelihood. There are other works with an economic impact but they do not take into consideration all the current models that can be applied. Thus, by using Yelp reviews as an input to enhance business performance, we analyze the state-of-the-art models, including features to capture the overall background/aspects of a business and its relation with the client.

We encourage the reader to keep further investigation on all the possible knowledge-base that might be helpful to solve the topic modeling interpretation problem.

Python Scripts

  1. Clean the data: PreProcessingV2.py
  2. Run the models: main.py
  3. Analysis of results: analyze_results.py

Examples in Jupyter notebooks

  1. Results_randReviews_10topics.ipynb
  2. Results5k topics -databases.ipynb

Maria Leonor Zamora Maass.

Luisa Eugenia Quispe Ortiz.

Center for Data Science / Stern School of Business

New York University

May, 2017

Prof. Foster Provost

About

Public Project for the Research Seminar in Data Science

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published