Skip to content

ejlb/google-open-image-download

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Google open image download

A py2/py3 script for downloading and rescaling the open image dataset in parallel. Here it is maxing out a 200mbit pipe over 5 days.

Maxing out a 200mbit pipe

setup

To install dependencies run

pip install -r requirements

Follow the instructions on the open image data repo to get the list of image urls.

usage

The two requirement arguments are input and output. Input is the csv file of urls from the open image data set. Output is a directory where the scaled images will be saved.

By default, the images will be scaled so that the smallest dimension is equal to 256 (controlled by the min-dim arg). The saved images are placed in sub-directories for efficiency (the number of which is controlled by the sub-dirs arg). The name of the saved image corresponds to Google's ImageID which can be used to look up labels in the open image dataset.

Use --help to see the other optional args.

notes

I'm not using asyncio because the processes also scale the image so we wouldn't see much speed up

About

A parallel download util for Google's open image dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages