Image Scrapper scraps a given link and gets images and stores them in the database, later can be queried, not only the image is saved but also meta data [image format, height, width, size etc] are stored
Django
: To create backend along with ORMDjango Rest Framework
: To create restful APIBeautifulsoup
: For web scrapping and collecting imagesrequests
: To handle external URL handlingPillow
: Image processingCelery
: Background async task handlingdrf_yasg
: Swagger API Docsdjangorestframework-simplejwt[crypto]
For JWT Authentication
Get started with the project, install dependencies and run the project. View /api-docs for API Documentation
git clone
virtualenv venv
- For
windows
🪟
E:/image-scrapper/venv/scripts/activate
- For
MacOS/Linux
🍏
source venv/bin/activate
pip install -r requirements.txt
python manage.py test
python manage.py runserver
If you have docker and want to use docker
docker build --tag scrapper-api .
docker publish scrapper-api 8000:8000
After running the server, you can visit /api-docs
to get a swagger interface,
you can interfere with the api from api-docs/
, it will automatically provide a beautiful UI and interface
to test the API
OR
Use /redoc
for Redoc Documentation
Takes URL, and returns List of images scrapped from the URL, Saves those images along with meta data in the database
{
"url": "https://example.com"
}
Status Code: 200
[
{
"id": 0,
"image_url": "https://example.com/api/image/0",
"image_name": "string",
"parent_url": {
"id": 0,
"link": "https://example.com"
},
"original_url": "https://example.com",
"height": 0,
"width": 0,
"mode": "string",
"format": "string",
"created": "2019-08-24T14:15:22Z",
"updated": "2019-08-24T14:15:22Z"
}
]
Returns Metadata and Image Link, if given a valid image id
{
"url": "https://example.com"
}
Status Code: 200
{
"id": 0,
"image_url": "https://example.com/api/image/0",
"image_name": "string",
"parent_url": {
"id": 0,
"link": "https://example.com"
},
"original_url": "https://example.com",
"height": 0,
"width": 0,
"mode": "string",
"format": "string",
"created": "2019-08-24T14:15:22Z",
"updated": "2019-08-24T14:15:22Z"
}
Returns Metadata and Image Link, if given a valid image id
{
"url": "https://example.com"
}
Status Code: 204
Returns List of saved Metadata and Image Link, if given a valid Parent URL(The URL that was used to scrape the images)
{
"url": "https://example.com"
}
Status Code: 200
{
"id": 0,
"image_url": "https://example.com/api/image/0",
"image_name": "string",
"parent_url": {
"id": 0,
"link": "https://example.com"
},
"original_url": "https://example.com",
"height": 0,
"width": 0,
"mode": "string",
"format": "string",
"created": "2019-08-24T14:15:22Z",
"updated": "2019-08-24T14:15:22Z"
}
Returns List of saved Metadata and Image Link, if given a valid Original Image URL
{
"url": "https://example.com/image/image.jpeg" // Original Image URL
}
Status Code: 200
{
"id": 0,
"image_url": "https://example.com/api/image/0",
"image_name": "string",
"parent_url": {
"id": 0,
"link": "https://example.com"
},
"original_url": "https://example.com",
"height": 0,
"width": 0,
"mode": "string",
"format": "string",
"created": "2019-08-24T14:15:22Z",
"updated": "2019-08-24T14:15:22Z"
}
Deletes all previous images and re-scrape and restore them
{
"url": "https://example.com" // Original Image URL
}
Status Code: 200
{
"id": 0,
"image_url": "https://example.com/api/image/0",
"image_name": "string",
"parent_url": {
"id": 0,
"link": "https://example.com"
},
"original_url": "https://example.com",
"height": 0,
"width": 0,
"mode": "string",
"format": "string",
"created": "2019-08-24T14:15:22Z",
"updated": "2019-08-24T14:15:22Z"
}
Parameter | Type | Default | Options |
---|---|---|---|
width | integer/string | Image default width | small , medium , large |
height | integer | Image default height | small , medium , large |
quality | integer | 100 | Any number between 1 to 100 |
format | string | Image Default | "gif", "png", "jpeg", "jpg", "bmp", "webp" |
Note: If height and width both are given, only width will work to maintain aspect ratio
/image/2?width=small
or
/image/2?width=678
Workflow
There are 2 main endpoints, one scraps through the link to get images and store them, the second one sends image queried by id.
- How the scraping and saving works?
- When the api is called, request library uses the link the get HTTP Response content, then bs4 module parses the HTTP
response and gets
<img/>
tag from it and get the image sources/links. Then these links are passed to Pillow Image module to parse the image and get metadata, then the image is saved in the storage and metadata are stored in the database. The image is given an id. Later the image can be queried through ID
The model schema of Image
id: Integer
image: String | Image location in the file system
parent_url: String | The URL that was scrapped to get the image
original_url: String | Original URL of the image
height: Integer | Image height
width: Integer | Image width
mode: String | Image color mode
format: String | Image format