grad-statement-analysis

This repository comprises the primary files used to gather and analyze 3,417 Statement of Purpose/Personal Statement/Letter of Intent style documents.

The documents were scraped from a public forum in which a prospective applicant (referred to as 'OP' throughout the analysis) posts their document for other users to review. Nearly all of the statements have at least one response, although some have many more. In total, 11,985 individual text documents were analyzed, consisting of ‘OP’ posts, ‘OP’ self-responses, and critiques.

File Descriptions

Notebooks

prelim_analysis.ipynb - The initial, general-purpose notebook used in the analysis. It features basic EDA, sentiment analysis, FastText and Doc2vec embeddings, and LDA, NMF, and LSA models. [nbviewer]
preprocessing.ipynb - Demonstrates the multiple approaches taken to preprocess the text into various forms to meet the needs of particular models. [nbviewer]
kpe_summarization.ipynb - Applies several forms of Key-phase Extraction and text summarization to user feedback and uses basic heuristics to find commonalities across the documents. [nbviewer]
exploration.ipynb - Supplemental exploratory data analysis that aims to answer questions tangential to main analysis motivations through data visualization [nbviewer]
lang_models.ipynb - Builds ULMFiT language models on the subsets of the corpus. [nbviewer]

Note: The nbextension Freeze was used liberally throughout each notebook. Without this, notebooks will likely not function sequentially.

Python Files

FaiText.py - A minimally modified version of fast.ai’s text transforms.
HTMLutils.py - Custom logic for parsing HTML tags into tokens, meant to be used in concert with FaiText.

Scrapy Files

grad_scrape/../SopSpider.py - The main scrapping mechanism, a scrapy spider.
grad_scrape/../items.py - Sets up items categories to allow for a cleaner gathering process during the crawl.
grad_scrape/../pipelines.py - Instructs the spider to output the scraped results to a csv or json file rather than writing to console.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

grad-statement-analysis

File Descriptions

Notebooks

Python Files

Scrapy Files

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
grad_scrape		grad_scrape
.gitignore		.gitignore
FaiText.py		FaiText.py
HTMLutils.py		HTMLutils.py
README.md		README.md
exploration.ipynb		exploration.ipynb
kpe_summarization.ipynb		kpe_summarization.ipynb
lang_models.ipynb		lang_models.ipynb
prelim_analysis.ipynb		prelim_analysis.ipynb
preprocessing.ipynb		preprocessing.ipynb

Rypo/grad-statement-analysis

Folders and files

Latest commit

History

Repository files navigation

grad-statement-analysis

File Descriptions

Notebooks

Python Files

Scrapy Files

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages