A .NET Core 3.1 Console application to gather relevant information from Booru websites to create datasets.
Powered by Boorusharp, with support for Danbooru, E621, E926, Gelbooru and many more.
BooruDatasetGatherer is an in .NET Core 3.1 written Console application that aims to give the user an easy way to gather a large dataset from Booru based API's. With support for profiles, downloading images and gathering information inside a CSV dataset, it provides you with a set of tools to get you started with tagged Booru datasets for Machine Learning. The uses for this application could be for Object Recognition, Latent Diffusion, local Booru web-servers or other ends that require tagged images.
Build the application using dotnet build BooruDatasetGatherer
inside the cloned repository or use Visual Studio (Windows & Mac only) to compile a Debug or Release version.
This application supports a range of different arguments has to be passed for it to function. The various arguments and settings will be explained below.
All arguments that are passed to the console application need to be prefixed with --
and strings
containing whitespace need to be wrapped in double quotes "
for arguments to successfully parse.
Example: --source safebooru --filter "1girl, confetti" --nsfw false
Argument | Data Type | Description | Default | Required |
---|---|---|---|---|
source |
string |
The Booru source from which to collect the dataset (SafeBooru , Danbooru , DerpiBooru ) |
✔️ | |
filter |
string[] |
Comma separated tags, only posts with these tags will be collected (eye_focus , nature ) |
✖️ | |
files |
string[] |
What type of file extensions can be accepted, recommended to filter out videos (".png, .jpg" , .gif ) |
".png, .jpg, jpeg, .gif, .webp" |
✖️ |
download |
bool |
If enabled, images that are saved in the dataset will also be saved locally (false , 1 ) |
false |
✖️ |
location |
string |
The location where the dataset (and if enabled images as well) will be saved to ("C:\Dataset" ) |
{Executing Folder}/Images |
✖️ |
size |
int |
The total size of the dataset (1000 , 512 ) |
1000 |
✖️ |
batch |
int |
The size that a batch is allowed to be. The total size is split among threads, the split size will be divided by the batch size in to cycles (50 , 15 ) |
20 |
✖️ |
threads |
byte |
The amount of threads to split the dataset gathering task with (4 , 16 ) |
4 |
✖️ |
nsfw |
bool |
If enabled, images that contain Not Safe For Work (NSFW) content will be saved into the dataset and downloaded if download is enabled (true , 0 ) |
false |
✖️ |
profile |
string |
The location (or name if located in the same folder) of the .json file containing a valid profile, more info in section Profiles (C:\booruProfile.json , booruProfile ) |
✖️ | |
exceptionLimit |
byte |
The amount of exceptions that can occur inside a thread. (14 , "20" ) |
5 |
✖️ |
username |
string |
The username of the Booru account that you wish to authenticate with ("My Username" , my_username ) |
✖️ | |
passwordHash |
string |
The password used to authenticate yourself at the given source. Note: this can be your password hash, API key or a different key, depending on the Booru of your choosing. More info at Authentication ("{password_hash}" , {api_key} ). |
{Executing Folder}/Images |
✖️ |
In Examples multiple examples are given for different purposes using the above mentioned arguments.
To make it easier to repeat tasks, arguments can be contained inside a .json
file. This is called a profile, with which you can easily load a certain set of arguments into the application without the need to pass any arguments yourself. Any argument that's passed with a --profile
argument will be ignored.
If in the booruProfile
json the size
argument is set to 20, it'll overrule the raw argument that's passed in. In this scenario
--profile "booruProfile" --size 50
the size
will be 20, even though 50 is passed as an argument.
A profile can look like this:
{
"source": "safebooru",
"filter": [
"eye_focus",
"hair_between_eyes"
],
"files": [
".png",
".jpeg"
],
"download": true,
"location": "C:/Dataset",
"size": 50,
"batch": 5,
"threads": 5,
"nsfw": false
}
A set of examples to help you get on your way to using this tool. :warning: Caution: all examples only contain the parameters. The way the application is booted differs from platform to platform.
--source safebooru
--source safebooru --filter "1girl, eye_focus"
--source safebooru --filter "-1girl, 1boy, -galaxy"
--source safebooru --size 2000 --batch 50
--profile "C:/Profiles/safeBooruProfile.json"
---profile "safeBooruProfile"
---source safebooru --username "{username}" --passwordHash "{password_hash}"
--source safebooru --filter "1boy, confetti, celebrating" --size 1500 --batch 25 --threads 16 --download true --nsfw false --location "C:/Dataset" --files ".jpg, .jpeg"
A large amount of the Boorus that Boorusharp supports are available, though only the Boorus that support getting random images. Since this library uses the generic ABooru
class of BooruSharp
, only an abstract set of the functions are available. Because there is no abstract function to get posts using pagination, the functions for getting random images are supported. The complete list of supported boorus are as follows:
Supported Boorus |
---|
Atfbooru |
Danbooru Donmai |
Derpibooru |
E621 |
E926 |
Gelbooru |
Konachan |
Lolibooru |
Ponybooru |
Realbooru |
Sakugabooru |
Sankaku Complex |
Twibooru |
Yandere |
The output of these Boorus are generic and will not differ, it can occur that a field will be empty. The dataset output will contain these values:
FILEURL, PREVIEWURL, POSTURL, SAMPLEURI, RATING, TAGS, ID, HEIGHT, WIDTH, PREVIEWHEIGHT, PREVIEWWIDTH, CREATION, SOURCE, SCORE, MD5, LOCATION
Since BooruSharp supports authentication, this application also supports this feature. By passing along a --username
and --passwordHash
you can authenticate yourself to the given Booru source. The different methods of authentication and which Booru supports it, you should consult the BooruSharp README.md
, under the section Authentication.
Since this application is made to gather information, a few things are on the road-map to further enhance that core feature. This includes:
- Support for custom Boorus
- Defining the fields that get exported to the dataset
- Support for Boorus that don't support random posts