This project is an extension of the Instagram scraper built by rarcega.
It is designed to organize the scraped instagram data neatly in AWS S3, according to this structure:
S3_BUCKET_NAME/
|
|-- instagram/
|-- TARGET_USER
|-- full-metadata.json: Contains metadata for entire operation
|-- [POST_ID_X]
|-- [POST_ID_X].jpg: Image of the post
|-- summary.json: Key information associated with post
|-- [POST_ID_Y]
|-- [POST_ID_Y].jpg
|-- summary.json
| ...
- Each post by the target instagram user is stored in its own folder.
- Each folder contains the image as well as the post's associated metadata.
These instructions were designed for Ubuntu 18.04.
You will need to create a config.py
file with the following contents:
AWS_ACCESS_KEY_ID = [YOUR AWS_ACCESS_KEY_ID]
AWS_SECRET_ACCESS_KEY = [YOUR AWS_SECRET_ACCESS_KEY]
AWS_REGION_NAME = [YOUR AWS_REGION_NAME]
S3_BUCKET_NAME = [YOUR AWS_S3_BUCKET_NAME]
INSTAGRAM_USER_ID = [YOUR INSTAGRAM_USER_ID]
INSTAGRAM_USER_PASSWORD = [YOUR INSTAGRAM_USER_PASSWORD]
TARGET_INSTAGRAM_USER = [YOUR TARGET_INSTAGRAM_USER TO SCRAPE DATA FROM]
A config_template.py
file has been provided for your convenience.
Now, follow these instructions to get the variables above.
- Lines 1-3 relating to AWS.
- Line 4 relating to AWS S3.
- Lines 5-7 are self-explanatory. The TARGET_INSTAGRAM_USER refers to the name of the user you intend to scrape data from.
NOTE: Your userId and password are required to scrape data from private users followed by you.
- Clone this repository.
git clone https://github.com/Jordan396/S3-Compatible-Instagram-Scraper.git cd S3-Compatible-Instagram-Scraper/
- Create a venv and activate it.
python3 -m venv venv source venv/bin/activate
- Install dependencies.
pip install -r requirements.txt
- Add your
config.py
above to the base directory. - Start scraping!
python scrape.py
- Navigate to your S3 bucket to view the scraped data.