Crawl a website and download images up to a specified depth.
uv run python crawl.py <start_url> <depth>
start_url
- The URL to start crawling fromdepth
- The maximum depth of links to follow
For example:
uv run crawl.py "https://www.langchain.com" 1
This will start crawling from https://www.langchain.com, following links up to a depth of 1 page, and downloading any images found along the way.
Downloaded images and a JSON file listing all images will be saved to the images/
directory.
The script generates images.json
file with metadata about all downloaded images in the following format:
{
"images": [
{
"url": "https://framerusercontent.com/images/t18A4tlmjN2gLQ8jHIyOBTtnzw.png",
"page": "https://www.langchain.com",
"depth": 1
},
{
"url": "https://framerusercontent.com/images/KA746UxB9OGmWwcvKeFeZBv0TxY.svg",
"page": "https://www.langchain.com",
"depth": 1
},
{
"url": "https://framerusercontent.com/images/ON1gmAd4rngG30H3qHZpIrpBVw.png",
"page": "https://www.langchain.com",
"depth": 1
},
{
"url": "https://framerusercontent.com/images/TscdHUIz9BEEgHWHa6GlbIFuYZw.png",
"page": "https://www.langchain.com",
"depth": 1
},
{
"url": "https://framerusercontent.com/images/TscdHUIz9BEEgHWHa6GlbIFuYZw.png",
"page": "https://www.langchain.com",
"depth": 1
},
{
"url": "https://framerusercontent.com/images/TscdHUIz9BEEgHWHa6GlbIFuYZw.png",
"page": "https://www.langchain.com",
"depth": 1
},
{
"url": "https://framerusercontent.com/images/FX0cg2i7uqcgKaINPfXTeJ1mWU.png",
"page": "https://www.langchain.com",
"depth": 1
}
]
}
It also saves all images to the images/
directory, named by their URL filename.
To run the included tests:
uv run pytest -vv
We use two key parameters to control how deep we go into a website to download images:
max_depth
: Determines how far we can go from the starting page to find images. Ifmax_depth
is set to 1, we will only download images from the starting page. If it is set to 2, we will also download images from any page directly linked to it, and so on.current_depth
: Keeps track of how deep we are within the website's structure. It begins at 1 on the starting page and increases as we extract links from the HTML content and add them to the queue.