chmullig's Kaggle Essay Code

For http://inclass.kaggle.com/c/columbia-university-introduction-to-data-science-fall-2012, as part of the class http://columbiadatascience.wordpress.com.

Implements a few models using R and python.

Requirements:

Python (only tested with 2.7)
- nltk
- scikit-learn
- pandas
- PyEnchant
R
- RandomForest
- gbm
- plyr
- MASS
- ggplot2 (soft requirement)
- reshape (soft requirement)

Features Created/Used

number of characters
numer of sentances
number of words
number of syllables
number of distinct words
words / sentances
characters / words
syllabels / words
spell_mistakes
correctly spelled words / total words
flag for starting with dear
flag if has semicolon
flag if has exclamation point
flag if has question mark
number of double quotes
flag if has at least 2 double quotes
flag indicating whether proper quote punctuation is more common or not (1 if ." is more common than "., -1 if less common, 0 if tied/neither)
counts of parts of speech (from NLTK)
rollups for number of nouns, verbs, adjectivs, adverbs, superlatives
flag for ending with a preposition
counts of the NER words (eg number of times they used @MONEY)
TF-IDF word and bigram frequencies that were then PCA'd down to 50 cells.

Models Used

First model was OLS linear regression using a subset of the variables. I trained 5 models, one per essay set, with identical formulas. Shockingly good.
Second model was Random Forest regression, again 5 models. Using more variables.
Third model was GBM, same formula as random forest, using 5 models.

Also tried doing rfm and gbm with one model using set as a predictor, but it didn't seem to perform as well.

Basic workflow in buildModel.sh.

Run basic_tags.py on test.tsv and train.tsv. This creates almost all the features/tags/variables we need to use
Run add_tfidf.py train_tagged.csv test_tagged.csv 50` to create tf-idf word vectors for each essay, and PCA down to a more usable 50 variables.
Run the R script basicModel.R to create and predict models.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
ASAP-AES @ 11bbb80		ASAP-AES @ 11bbb80
.gitignore		.gitignore
.gitmodules		.gitmodules
Essay Set 1--ReadMeFirst.docx		Essay Set 1--ReadMeFirst.docx
Essay Set 2--ReadMeFirst.docx		Essay Set 2--ReadMeFirst.docx
Essay Set 3--ReadMeFirst.docx		Essay Set 3--ReadMeFirst.docx
Essay Set 4--ReadMeFirst.docx		Essay Set 4--ReadMeFirst.docx
Essay Set 5--ReadMeFirst.docx		Essay Set 5--ReadMeFirst.docx
README.md		README.md
TODO		TODO
add_tfidf.py		add_tfidf.py
addtags.py		addtags.py
basicModel.R		basicModel.R
basic_tags.py		basic_tags.py
buildModel.sh		buildModel.sh
caretStuff.R		caretStuff.R
columbia-university-introduction-to-data-science-fall-2012_public_leaderboard.csv		columbia-university-introduction-to-data-science-fall-2012_public_leaderboard.csv
com.chmullig.DataScienceLeaderboardFetcher.plist		com.chmullig.DataScienceLeaderboardFetcher.plist
datascience_leaderboard.png		datascience_leaderboard.png
datascience_leaderboard_closeup.png		datascience_leaderboard_closeup.png
graph.html		graph.html
leaderGrabber.py		leaderGrabber.py
makegraph.R		makegraph.R
makegraph_closeup.R		makegraph_closeup.R
pos_dict.py		pos_dict.py
prep.sh		prep.sh
sample_submission_file.csv		sample_submission_file.csv
score.py		score.py
syllables.py		syllables.py
syllables_buildpickle.py		syllables_buildpickle.py
test.tsv		test.tsv
test_tagged_tfidf.csv		test_tagged_tfidf.csv
testing_predicted_gbm.csv		testing_predicted_gbm.csv
testing_predicted_gbma.csv		testing_predicted_gbma.csv
testing_predicted_lm.csv		testing_predicted_lm.csv
testing_predicted_rf.csv		testing_predicted_rf.csv
testing_predicted_rfa.csv		testing_predicted_rfa.csv
train.tsv		train.tsv
train_tagged_tfidf.csv		train_tagged_tfidf.csv
validateSubmission.py		validateSubmission.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

chmullig's Kaggle Essay Code

Requirements:

Features Created/Used

Models Used

Basic workflow in buildModel.sh.

About

Releases

Packages

Languages

chmullig/datascience-aes

Folders and files

Latest commit

History

Repository files navigation

chmullig's Kaggle Essay Code

Requirements:

Features Created/Used

Models Used

Basic workflow in buildModel.sh.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages