Paper Scraper for Reddit

Paper Scraper is a Python script that downloads images from the user's saved category on Reddit.

Posts linking directly to an image or imgur page will be downloaded and unsaved; all other posts will be ignored.

Requirements

Python 3.11
A reddit account

Summary

Installation
Usage
Technical Overview
License

Installation

Clone this repository: git clone https://github.com/samlowe106/PaperScraper.git
(Optional) Create a virtual environment with python -m venv [PATH] and activate that virtual environment with source [PATH]/bin/activate
Install all requirements: pip install -r requirements.txt
Install pre-commit via pre-commit install. Ensure pre-commit is working by running pre-commit run --all-files
Go to your app preferences on Reddit
Create a new app and choose script as the app type
Go to your Applications setting on imgur
Create a new app
Configure your Python environment variables, adding the reddit client ID as "REDDIT_CLIENT_ID", reddit client secret as "REDDIT_CLIENT_SECRET", and imgur client ID as "IMGUR_CLIENT_ID"

Usage

Paper Scraper uses getpass to securely read in passwords, so it's incompatible with Python consoles like those in PyCharm. For that reason, it's recommended to run it from the Terminal or Command Line using python main.py

Paper Scraper also comes with a handful of flags, which can be found by running Paper Scraper with the --help flag.

Technical Overview

Paper Scraper is fairly simple. After basic argument parsing is done, the program has two major steps:

Batch parsing

First, reddit submissions are fetched from reddit via PRAW. PRAW provides submissions through "listing generators", which Paper Scraper wraps with from_saved and from_subreddit functions. These provide submissions as SubmissionWrapper objects to provide a simpler API for interacting with submissions and managing Paper Scraper-related data.

The url that each SubmissionWrapper links to is asynchronously scraped by parser objects (flickr_parser, imgur_parser, and single_image_parser) in a strategy pattern. If any of the and appended to the SubmissionWrapper.urls field. If the urls couldn't be accessed, the parsers couldn't find any urls, or if the post fails some other criteria specified in the command line arguments, the SubmissionWrapper is filtered out of the batch. This process repeats until a batch of valid SubmissionWrappers of the desired size is created, or the underlying generator runs out of new posts.

Batch downloading

After a batch of valid SubmissionWrappers is created, each of the images linked to in the SubmissionWrapper.urls field are downloaded asynchronously and the resulting files are saved. If the organize flag is specified, images are also sorted into subdirectories by subreddit. Post data is written to a log file, and the program ends.

License

Paper Scraper is licensed under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 423 Commits
.github/workflows		.github/workflows
core		core
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.test.env		.test.env
LICENSE		LICENSE
README.md		README.md
main.py		main.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
todo.md		todo.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Paper Scraper for Reddit

Requirements

Summary

Installation

Usage

Technical Overview

Batch parsing

Batch downloading

License

About

Releases

Packages

Contributors 2

Languages

License

samlowe106/PaperScraper

Folders and files

Latest commit

History

Repository files navigation

Paper Scraper for Reddit

Requirements

Summary

Installation

Usage

Technical Overview

Batch parsing

Batch downloading

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages