Paper Scraper is a Python script that downloads images from the user's saved category on Reddit.
Posts linking directly to an image or imgur page will be downloaded and unsaved; all other posts will be ignored.
- Python 3.11
- A reddit account
-
Clone this repository:
git clone https://github.com/samlowe106/PaperScraper.git
-
(Optional) Create a virtual environment with
python -m venv [PATH]
and activate that virtual environment withsource [PATH]/bin/activate
-
Install all requirements:
pip install -r requirements.txt
-
Install pre-commit via
pre-commit install
. Ensure pre-commit is working by runningpre-commit run --all-files
-
Go to your app preferences on Reddit
-
Create a new app and choose script as the app type
-
Go to your Applications setting on imgur
-
Create a new app
-
Configure your Python environment variables, adding the reddit client ID as "REDDIT_CLIENT_ID", reddit client secret as "REDDIT_CLIENT_SECRET", and imgur client ID as "IMGUR_CLIENT_ID"
Paper Scraper uses getpass to securely read in passwords, so it's incompatible with Python consoles like those in PyCharm. For that reason, it's recommended to run it from the Terminal or Command Line using python main.py
Paper Scraper also comes with a handful of flags, which can be found by running Paper Scraper with the --help
flag.
Paper Scraper is fairly simple. After basic argument parsing is done, the program has two major steps:
First, reddit submissions are fetched from reddit via PRAW. PRAW provides submissions through "listing generators", which Paper Scraper wraps with from_saved
and from_subreddit
functions. These provide submissions as SubmissionWrapper
objects to provide a simpler API for interacting with submissions and managing Paper Scraper-related data.
The url that each SubmissionWrapper
links to is asynchronously scraped by parser objects (flickr_parser
, imgur_parser
, and single_image_parser
) in a strategy pattern. If any of the and appended to the SubmissionWrapper.urls
field. If the urls couldn't be accessed, the parsers couldn't find any urls, or if the post fails some other criteria specified in the command line arguments, the SubmissionWrapper
is filtered out of the batch. This process repeats until a batch of valid SubmissionWrapper
s of the desired size is created, or the underlying generator runs out of new posts.
After a batch of valid SubmissionWrapper
s is created, each of the images linked to in the SubmissionWrapper.urls
field are downloaded asynchronously and the resulting files are saved. If the organize
flag is specified, images are also sorted into subdirectories by subreddit. Post data is written to a log file, and the program ends.
Paper Scraper is licensed under the MIT license.