This is an utility that allows you to collect movie scripts from several sources and create a database of ~2.3k movie scripts as .txt
files along with the metadata for the movies.
There are three steps to the whole process:
- Collect data from various sources - Scrape websites for scripts in HTML, txt, doc or pdf format
- Remove duplicates from different sources - Automatically remove as many duplicates from different sources as possible
- Collect metadata - Get metadata about the scripts for additional processing
- Parse Scripts - Convert scripts into lines with just Character => dialogue
- Install all dependencies using
pip install -r requirements.txt
. - Collect all the scripts:
python get_scripts.py
. This might take a while(2+ hrs). - Remove duplicates and empty files:
python clean_files.py
. - Collect metadata from TMDb and OMDb:
python get_metadata.py
. - Parse scripts:
python parse_files.py
.
The sources that scripts are collected from are:
The script for parsing the movie scripts come from this paper: Linguistic analysis of differences in portrayal of movie characters, in: Proceedings of Association for Computational Linguistics, Vancouver, Canada, 2017
and the code can be found here: https://github.com/usc-sail/mica-text-script-parser