LDA Topic Models as Supervised Classification Inputs

The Data

This experiment uses Yelp's publicly available restaurant review data (6,685,900 reviews across 192,609 businesses).

I've written instructions for setting up your own DB and loading Yelp data below. However, the pre-processing output from preprocess.py was compact enough that I could include the rev_train.pkl and rev_test.pkl files in the /data directory. Thus you can skip the DB setup sections below and just use those if desired, then explore the LDA experiment using Notebooks #2 (train corpus) and #3 (test corpus).

Download JSON and Setup Mongo

Yelp data is in raw JSON here: https://www.yelp.com/dataset
Install Mongo locally if needed via instructions here: https://docs.mongodb.com/manual/tutorial/

Mongo Creation

You'll need to start mongo as a foreground service. Generally this can be done via mongod --config /usr/local/etc/mongod.conf, but if you installed Mongo via Brew on Mac you can alternatively use: brew services start mongodb
From directory where you extracted Yelp JSON, run the following commands: mongoimport --db yelp --collection review review.json and mongoimport --db yelp --collection business business.json. Those are the only two portions of the Yelp dataset I used for this experiment.

Mongo Load Script

I've created two helper scripts to load data from Mongo and Pickle into DataFrame objects. If you want to follow along with the LDA experiments and fork your own, just run the following 2 scripts from terminal. Assuming you're in the mongo_load directory of this repo:

python business_load.py
python reviews_load.py

That will create two pickle .pkl dataframe objects within the mongo_load directory, and we'll use those as a basis for the rest of the project. They're filtered to a specific subset of columns.

Mongo Load Script - Alternate

In lieu of using the two helper scripts above, you could likely just use the pandas read_json function outlined here to create the DataFrames: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html.

However, I haven't tested that, and if you're an experienced Mongo user there's likely more flexibility in just running your own DB for this data.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
data		data
mongo-load		mongo-load
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
preprocess.py		preprocess.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LDA Topic Models as Supervised Classification Inputs

The Data

Download JSON and Setup Mongo

Mongo Creation

Mongo Load Script

Mongo Load Script - Alternate

About

Releases

Packages

Languages

marcmuon/nlp_yelp_review_unsupervised

Folders and files

Latest commit

History

Repository files navigation

LDA Topic Models as Supervised Classification Inputs

The Data

Download JSON and Setup Mongo

Mongo Creation

Mongo Load Script

Mongo Load Script - Alternate

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages