Skip to content

Data collection scripts for analysis of Reddit

License

Notifications You must be signed in to change notification settings

RedditEpidemicAnalysis/data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 

Repository files navigation

Reddit Analysis on COVID-19

PowerShell CC0 license

Below can be found a list of data retrieval scripts that help make this work possible. The pathing can be changed to any desired location.

  1. Open an ADMIN PowerShell window install the CLI tools.
    pip install psaw
    pip install python-dateutil
    
  2. Open a PowerShell window
    $subreddits = @('COVID19positive', 'COVID19_support')
    $months = @(
       '2020-02', '2020-03', '2020-04', '2020-05', '2020-06', '2020-07', '2020-08', '2020-09', 
       '2020-10', '2020-11', '2020-12', '2021-01', '2021-02', '2021-03')
    for($j = 0; $j -lt $subreddits.length; $j++) {
       for($i = 0; $i -lt $months.length; $i++) {
          $subreddit = $subreddits[$j]
          $month = [datetime]::ParseExact($months[0], 'yyyy-MM', $null)
          $after = $month.ToString('yyyy-MM-dd HH:mm:ss')
          $before = $month.AddMonths(1).AddSeconds(-1).ToString('yyyy-MM-dd HH:mm:ss')
          $month = $month.ToString('yyyy-MM')
          Write-host "$subreddit ($after, $before)"
          psaw `
             -s $subreddit `
             -l 1000000 `
             --format json `
             --after  $after `
             --before $before `
             -f id,created_utc,author,title,selftext `
             -o "d:/datasets/reddit/$subreddit.submissions.$month.json" `
             --prettify --verbose `
             submissions
          Start-Sleep -s 20
          psaw `
             -s $subreddit `
             -l 1000000 `
             --format json `
             --after $after `
             --before $before `
             -f id,parent_id,created_utc,author,body `
             -o "d:/datasets/reddit/$subreddit.comments.$month.json" `
             --prettify --verbose `
             comments
          Start-Sleep -s 20
       }}
    

Citation

If you use the dataset in academic work, please consider citing it based on the original source. All the .tar.gzs in Releases are a cache of a cache.

@misc{baumgartner2021,
  title={Historical submissions from /r/COVID19positive},
  author={Jason Baumgartner},
  year={2021},
  publisher={pushshift.io},
  url={https://pushshift.io/},
  urldate={2021-01-07}
}
@misc{marx2021python,
  title={Python Pushshift.io API Wrapper},
  author={David Marx},
  year={2021},
  url={https://github.com/dmarx/psaw},
  urldate={2021-01-07}
}