Parse Reddit's /r/soccer to associate adjectives with soccer teams. Given an archive of comments, find out what adjectives best describe teams.
- Install Ruby.
- Install bundler: http://bundler.io/
- From the root directory, run
./scripts/run.rb --input-file INPUT-FILE --config-file CONFIG-FILE [--phases PHASES] [--debug]
. The input file must be a .csv with comment body in the first column and comment id in the second.
Example: ./scripts/run.rb --input-file input/sample.csv --config-file config/teams.yaml
. Note that the output for it is likely to be empty because there are too few adjectives in the sample input, and they are likely to be excluded by the popularity filter.
Real data can be downloaded from https://bigquery.cloud.google.com/dataset/fh-bigquery:reddit_comments
Query tables with SELECT body, id FROM <table-name> WHERE subreddit = 'soccer'
.
Count team name/adjective pairs used in the same sentence.
- Filter out blacklisted adjectives (nationalities, colors, ...).
- Exclude N most popular adjectives: they are too generic.
- Score adjectives. Promote somewhat unusual words.
- Keep only M adjectives per team.
Export results to .csv files per league.