Skip to content

Latest commit

 

History

History
188 lines (140 loc) · 10.1 KB

README.md

File metadata and controls

188 lines (140 loc) · 10.1 KB

subreddit-comments-dl

Gitmoji

Download all the text comments from a subreddit

Use the script subreddit_downloader.py multiple times to download the data.
Then run the script dataset_builder.py for create a unique dataset.

🖱 More info on website and medium.

🚀 Usage

Basic usage to download submissions and relative comments from subreddit AskReddit and News:

# Use python 3.8.5

# Install the dependencies
pip install -r requirements.txt

# Download the AskReddit comments of the last 30 submissions
python src/subreddit_downloader.py AskReddit --batch-size 10 --laps 3 --reddit-id <reddit_id> --reddit-secret <reddit_secret> --reddit-username <reddit_username>

# Download the News comments after 1 January 2020
python src/subreddit_downloader.py News --batch-size 512 --laps 3 --reddit-id <reddit_id> --reddit-secret <reddit_secret> --reddit-username <reddit_username> --utc-after 1609459201

# Build the dataset, the results will be under `./dataset/` path
python src/dataset_builder.py 

ℹ️ Where I can get the reddit parameters?

Parameter name Description How get it Example of the value
reddit_id The Client ID generated from the apps page Official guide 40oK80pF8ac3Cn
reddit_secret The secret generated from the apps page Copy the value as showed here 9KEUOE7pi8dsjs9507asdeurowGCcg
reddit_username The reddit account name The name you use for log in pistoSniffer

⬇️ Output

A new folder with two csv files are created from dataset_builder.py, the script have some features:

  • Remove rows with same id
  • Have a caching_size parameter to don't store all dataset in RAM

They have the following structure:

submissions.csv

Each row is a submission of a specific subreddit and id field is unique across the dataset (PK).

Column name Description Example
subreddit Name of the subreddit MTB
id Unique identifier of the submission lhr2bo
created_utc UTC when submission was created 1613068060
title Title of the submission Must ride So...
selftext Text off the submission What are the best trails to ride in...
full_link Reddit unique link to the submission https://www.reddit.com/r/MTB/comments/lhr2bo/must_ride_so_cali_trails/

comments.csv

Each row is a comment under a submission of a specific subreddit and id field is unique across the dataset (PK).

Column name Description Example
subreddit Name of the subreddit News
id Unique identifier of the comment gmz45xo
submission_id Id of the comment main submission lhr2bo
body Text of the comment We're past the point...
created_utc UTC when comment was created 1613072734
parent_id Id of the parent in a tree structure t3_lhssi4
permalink Reddit unique link to the comment /r/news/comments/lhssi4/air_force_wants_to_know_if_key_pacific_airfield/gmz45xo/

📖 Glossary

  • subreddit: section of reddit website focused on a particular topic

  • submission: the post that appear in each subreddit. When you open a subreddit page, all the posts you see. Each submission has a tree of _ comments_

  • comment: text wrote by a reddit user under a submission inside a subreddit

    • The main goal of this repository is to gather the comments belong to the subreddit

✍️ Notes and Q&A

  • Under the hood the script use pushshift to gather submissions id, and praw for collect the submissions comments
    • With this approach we require fewer data to pushshift
    • Due to the usage of praw API, the reddit credentials are required
  • More info about the subreddit_downloader.py script under the --help command:
  • Other packages:
    • psaw: Python Pushshift.io API Wrapper
  • [?] Data empty CSV:
    • Sometimes we have an empty csv under /data/<subreddit>/<timestamp>/comments/xxx.csv
    • This behaviour is due of a batch of submissions that don't have comments, you can check this opening the /data/<subreddit>/<timestamp>/submissions/xxx.csv equivalent file (same xxx.csv name) and open the submission link
  • [?] The program stuck and don't run:
    • Call the program with --debug flag to get in which submission the program is freezing
    • Very probably the program is blocked on a submission that contains 10k> comments, and the praw API need to make a lot of requests to gather all the data (thus require a lot of time).
    • If you don't want to wait, or you want more control over the quantity of comments fetched per single submission, use the --comments-cap parameter.
    • If provided, the system requires new comments comments_cap times to the praw API, and don't download all comments.
      • More high the value, more comments will be downloaded
      • Set to 0 to download only the comments showed on the first page of the submission
      • Set to 64 to be enough sure that the system will download a good amount of data
      • Tune the parameter as your favor
python src/subreddit_downloader.py --help
Usage: subreddit_downloader.py [OPTIONS] SUBREDDIT

  Download all the submissions and relative comments from a subreddit.

Arguments:
  SUBREDDIT  The subreddit name  [required]

Options:
  --output-dir TEXT       Optional output directory  [default: ./data/]
  --batch-size INTEGER    Request `batch_size` submission per time  [default:
                          10]

  --laps INTEGER          How many times request `batch_size` reddit
                          submissions  [default: 3]

  --reddit-id TEXT        Reddit client_id, visit https://github.com/reddit-
                          archive/reddit/wiki/OAuth2  [required]

  --reddit-secret TEXT    Reddit client_secret, visit
                          https://github.com/reddit-archive/reddit/wiki/OAuth2
                          [required]

  --reddit-username TEXT  Reddit username, used for build the `user_agent`
                          string, visit https://github.com/reddit-
                          archive/reddit/wiki/API  [required]

  --utc-after TEXT        Fetch the submissions before this UTC date
  --utc-before TEXT       Fetch the submissions before this UTC date
  --comments-cap INTEGER  Some submissions have 10k> nested comments and stuck
                          the praw API call.If provided, the system requires
                          new comments `comments_cap` times to the praw
                          API.`comments_cap` under the hood will be passed
                          directly to `replace_more` function as `limit`
                          parameter. For more info see the README and visit ht
                          tps://asyncpraw.readthedocs.io/en/latest/code_overvi
                          ew/other/commentforest.html#asyncpraw.models.comment
                          _forest.CommentForest.replace_more.

  --debug / --no-debug    Enable debug logging  [default: False]
  --install-completion    Install completion for the current shell.
  --show-completion       Show completion for the current shell, to copy it or
                          customize the installation.

  --help                  Show this message and exit.

💤 TODO

dataset_builder.py

  • store some dataset info (subreddit, max/min utc/human, n^ lines)

subreddit_downloader.py

  • use async function if possible to gather more data concurrently
  • load user credentials in subreddit_downloader.py from local config file
  • store/log the utc and human datetime
  • use case: download all data from X datetime until now
    • early stopping if no new data fetched
  • refactory of dataset_builder.py:_rows_parser: find a more efficient approach to check id duplicates
    • maybe switch to use pandas as matrix manager
  • should switch to use psaw?