WIP: Multi-dataset sampler #235

bghira · 2023-12-12T04:41:39Z

Changes

New behaviour

When a dataset config entry has "scan_for_errors": true it will be read entirely at startup and any bad images will be removed if delete_problematic_images: true. It will remove any outdated cache entries.
Datasets are defined by a config file, this is now mandatory. Removing datasets can be achieved by setting "disabled": true in the dataset config entry.

Removed arguments

All of the --aws_* commandline arguments were removed for privacy reasons, are now in the multidatabackend.json
--data_backend is now --data_backend_config and is a path to a dataset config, see multidatabackend.json.example for help converting your existing configurations over

New arguments

`--data_backend_config`

What: Path to your SimpleTuner dataset configuration, set as DATALOADER_CONFIG in sdxl-env.sh
Why: Multiple datasets on different storage medium may be combined into a single training session.
Example: See (multidatabackend.json.example)[/multidatabackend.json.example] for an example configuration.

`--override_dataset_config`

What: When provided, will allow SimpleTuner to ignore differences between the cached config inside the dataset and the current values.
Why: When SimplerTuner is run for the first time on a dataset, it will create a cache document containing information about everything in that dataset. This includes the dataset config, including its "crop" and "resolution" related configuration values. Changing these arbitrarily or by accident could result in your training jobs crashing randomly, so it's highly recommended to not use this parameter, and instead resolve the differences you'd like to apply in your dataset some other way.

`--vae_cache_behaviour`

This reverts commit 6c2f8b0.

This reverts commit 8abbb5c.

…up any caches in the future

WIP: Multi-dataset sampler

febb865

bghira force-pushed the feature/multi-dataset-sampler branch from 1f010c4 to febb865 Compare December 24, 2023 19:32

bghira added 21 commits December 24, 2023 13:48

Revert "remove multi-dataset work"

64f10c6

This reverts commit 6c2f8b0.

Revert "remove import from future work"

ed45b64

This reverts commit 8abbb5c.

WIP: multi-backend

11bc902

WIP: remove unnecessary code

da38cc2

ignore config

2677760

WIP refactoring commandline args and fixing issues

31443ca

WIP fixes for aspect bucketing / disk cache

d230abf

tests

7806192

update documentation for --data_backend_config

8e77232

Update example configuration and OPTIONS file

abb76f5

allow crop configs to be specified in data backends

207ae97

clear backend cache on startup as before

d936b90

fix init of backend config

1e1aef1

MultiaspectImage: use the backend config for image prep

98c68d8

SD 2.x changes for new data backend config style

735d019

SD 2.x: fix text embed caching/vae caching

94c0b32

SD 2.x: fix text embed caching/vae caching

523487f

SDXL: Logging

7f4ab96

StateTracker: fix delete method

d6485ad

Added backend config validations so that we do not accidentally mess …

cbf8b8d

…up any caches in the future

update documentation

0f35ef3

bghira merged commit 0e91af2 into main Dec 25, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Multi-dataset sampler #235

WIP: Multi-dataset sampler #235

bghira commented Dec 12, 2023 •

edited

Loading

WIP: Multi-dataset sampler #235

WIP: Multi-dataset sampler #235

Conversation

bghira commented Dec 12, 2023 • edited Loading

Changes

New behaviour

Removed arguments

New arguments

--data_backend_config

--override_dataset_config

--vae_cache_behaviour

bghira commented Dec 12, 2023 •

edited

Loading

`--data_backend_config`

`--override_dataset_config`

`--vae_cache_behaviour`