This repository comprises the primary files used to gather and analyze 3,417 Statement of Purpose/Personal Statement/Letter of Intent style documents.
The documents were scraped from a public forum in which a prospective applicant (referred to as 'OP' throughout the analysis) posts their document for other users to review. Nearly all of the statements have at least one response, although some have many more. In total, 11,985 individual text documents were analyzed, consisting of ‘OP’ posts, ‘OP’ self-responses, and critiques.
- prelim_analysis.ipynb - The initial, general-purpose notebook used in the analysis. It features basic EDA, sentiment analysis, FastText and Doc2vec embeddings, and LDA, NMF, and LSA models. [nbviewer]
- preprocessing.ipynb - Demonstrates the multiple approaches taken to preprocess the text into various forms to meet the needs of particular models. [nbviewer]
- kpe_summarization.ipynb - Applies several forms of Key-phase Extraction and text summarization to user feedback and uses basic heuristics to find commonalities across the documents. [nbviewer]
- exploration.ipynb - Supplemental exploratory data analysis that aims to answer questions tangential to main analysis motivations through data visualization [nbviewer]
- lang_models.ipynb - Builds ULMFiT language models on the subsets of the corpus. [nbviewer]
Note: The nbextension Freeze was used liberally throughout each notebook. Without this, notebooks will likely not function sequentially.
- FaiText.py - A minimally modified version of fast.ai’s text transforms.
- HTMLutils.py - Custom logic for parsing HTML tags into tokens, meant to be used in concert with FaiText.
- grad_scrape/../SopSpider.py - The main scrapping mechanism, a scrapy spider.
- grad_scrape/../items.py - Sets up items categories to allow for a cleaner gathering process during the crawl.
- grad_scrape/../pipelines.py - Instructs the spider to output the scraped results to a csv or json file rather than writing to console.