Public Project for the Research Seminar in Data Science

Data

https://www.yelp.com/dataset_challenge (9th release)

Abstract

Topic modeling has become become widely used in the natural language processing domain to understand the latent variables that people take into account when they write texts. Those variables are of particular interest since they have a social and economic effect on many businesses, such as online reviews. However, this is not an easy task, since they correspond to an unsupervised clustering problem that can be designed and interpreted in many different ways. For this project we present new designs to analyze online Yelp reviews on top of 3 topic models that correspond, to the best of our knowledge, to the state-of-the-art.

Derived from the Latent Dirichlet Allocation (LDA), we consider these variations since they incorporate metadata and bayesian network structures for specific data types. Most of those current studies are highly focused on theoretical aspects, and attempt to beat the state-of-the-art results based on perplexity and likelihood. There are other works with an economic impact but they do not take into consideration all the current models that can be applied. Thus, by using Yelp reviews as an input to enhance business performance, we analyze the state-of-the-art models, including features to capture the overall background/aspects of a business and its relation with the client.

We encourage the reader to keep further investigation on all the possible knowledge-base that might be helpful to solve the topic modeling interpretation problem.

Python Scripts

Clean the data: PreProcessingV2.py
Run the models: main.py
Analysis of results: analyze_results.py

Examples in Jupyter notebooks

Results_randReviews_10topics.ipynb
Results5k topics -databases.ipynb

Maria Leonor Zamora Maass.

Luisa Eugenia Quispe Ortiz.

Center for Data Science / Stern School of Business

New York University

May, 2017

Prof. Foster Provost

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
PreProcessingV2.ipynb		PreProcessingV2.ipynb
PreProcessingV2.py		PreProcessingV2.py
README.md		README.md
Results5k topics -databases.ipynb		Results5k topics -databases.ipynb
Results_randReviews_10topics.ipynb		Results_randReviews_10topics.ipynb
analyze_results.py		analyze_results.py
dmr.py		dmr.py
lda.py		lda.py
main.py		main.py
slda.py		slda.py
stop_words_nltk.p		stop_words_nltk.p
vocabulary.py		vocabulary.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Public Project for the Research Seminar in Data Science

Data

Abstract

Python Scripts

Examples in Jupyter notebooks

Maria Leonor Zamora Maass.

Luisa Eugenia Quispe Ortiz.

About

Releases

Packages

Languages

mariazm/DS_Seminar_Project_Public

Folders and files

Latest commit

History

Repository files navigation

Public Project for the Research Seminar in Data Science

Data

Abstract

Python Scripts

Examples in Jupyter notebooks

Maria Leonor Zamora Maass.

Luisa Eugenia Quispe Ortiz.

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages