Skip to content

Code to accompany the paper "Uncertainty-aware generative models for inferring document class prevalence" EMNLP 2018

Notifications You must be signed in to change notification settings

slanglab/doc_prevalence

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

doc_prevalence

Code to accompany the paper: Keith and O'Connor. "Uncertainty-aware generative models for inferring document class prevalence." EMNLP, 2018.

For more information, see the research project's website. If you would like to use our model in your own work, we suggest our freq-e software package.

If you use this code please cite the paper:

@inproceedings{keith18uncertainty,
author = {Keith, Katherine A. and O'Connor, Brendan}, 
title = {Uncertainty-aware generative models for inferring document class prevalence},
booktitle = {{EMNLP}},
year = 2018}

Code to replicate experiments in the paper

Setting up Yelp dataset

This does the following pre-processing:

  • tokenizes using NLK
  • unigrams
  • lowercase
  • separates into classes (class=0 is stars=<3, class=1 is stars >3)
  • prunes vocab so any vocab in the training data that is in <5 docs we don't use; also we use this same vocab size for the LSTM and map the pruned vocab to OOV symbols

Output is train/train_all.json with dictionary keys

  • "reivew_id" : review id from the original doc

  • "class" : class=0 is stars=<3, class=1 is stars >3

  • "date" : original date

  • "toks" : dicionary of token counts

train/test split

  • First ignore all businesses that have less than 200 reviews
  • Choose 500 test and 500 dev businesses by weighted random sampling
    • weighted random sampling is by the number of docs a business has

10-fold cross val; 10 re-samplings (Table 1, Fig 3)

#(1) ASSEMBLE DATA 
yelp_data/go_maketrain_trials.sh #natural setting 
yelp_data/go_maketrain_trials_prop.sh #synthetic setting proportion of 0.1

#(2) TRAIN MODELS 
code/train_all/go_train_master.sh

#(3) EVALUATE MODELS ON THE TEST SET
code/eval/baselines.py 
code/eval/go_eval_master.sh 

#(4) ANALYZE RESULTS 
code/eval/analysis.py  
graphs/final_mae_plots.ipynb

Training proportion experiments (10 resamplings of training data) (Fig 5a)

#(1) ASSEMBLE DATA 
yelp_data/go_train_prop_all.sh

#(2) TRAIN MODELS 
code/train_all/go_training_prop_all.sh

#(3) EVAL MODELS ON THE TEST SET 
code/eval/go_eval_train_prop_all.sh

#(4) ANALYZE RESULTS
code/eval/analysis_train_prop.py
graphs/train_prop10trains.ipynb

Training size experiments (10 resamplings of training data) (Fig 5b)

#(1) ASSEMBLE DATA 
yelp_data/go_trainsize_trials.sh

#(2) TRAIN
code/train_all/go_trainsize_trials.sh

#(3) EVAL
code/eval/go_trainsize_all_trials.sh

#(4) ANALYSIS
code/eval/analysis_train_size.py
graphs/trainsize.ipynb 

Readme experiments

  • First, you will need to download the ReadMe R package
  • Make sure you change your home path manually in /readme_our_experiments/coderun_readme.R
  • Then run:
code/readme_our_experiments/go_readme.sh

About

Code to accompany the paper "Uncertainty-aware generative models for inferring document class prevalence" EMNLP 2018

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published