Nature Dataset
Nature Dataset
This is a curated version of iNaturalist 2017 Dataset for the purpose of training single image super resolution models. The original dataset consists of 675'170 images and is 200GB in size.
There is a small version that consists of 3000 images of 512x512px that can be used to train lightweight networks like for example compact or SPAN. Average hyperiqa score of hr_small with 3000 images is: 0.767434819261233
There is also a medium version that consists of 7000 images of 512x512 that can be used to train medium or heavy networks like for example RealPLKSR or RGT/DAT/ATD. Average hyperiqa score of hr_small with 7000 images is: 0.754073106459209
HR folder, LRx2 and LRx4 folders and a validation folder provided in the Assets as zip files.
I will list the changes I applied (or simply what I did) below:
For the HR folder, I
- moved all images into the same folder
- removed all files that were smaller than 300kB -> 240'833 images left (from 675'170)
- tiled to 512x512
- hyperiqa scored all of them and removed all that were below 0.7 -> 32'499 images left, 18GB in size
- checked all images for visual similarity and removed duplicates
- removed a lot of human hand photos (too many human hands)
- made a small version with 3k images that can be used for training lightweight sisr networks.
- made a medium version with images that can be used for training medium/heavy sisr networks.
- normalized filenames
- oxipng -o 4 --strip safe --alpha *.png
For the LRx4 folder, I took the HR folder and applied
- scaling with randomized down_up (range 0.75, 1.5), linear, cubic_mitchell, lanczos, gauss and box
- slight randomized gaussian blurring
- randomized jpg compression with quality 75 - 100
- oxipng -o 4 --strip safe --alpha *.png
The same approach was used for the LRx2 folder
The corresponding zip files are in the Assets below. Since GitHub file size limit is 2GB, the HR_medium was split into 2 files.
Example of HR images from the dataset:
Example of bad images removed from the original iNaturalist 2017 dataset: