Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Multi-dataset sampler #235

Merged
merged 22 commits into from
Dec 25, 2023
Merged

WIP: Multi-dataset sampler #235

merged 22 commits into from
Dec 25, 2023

Conversation

bghira
Copy link
Owner

@bghira bghira commented Dec 12, 2023

Changes

New behaviour

  • When a dataset config entry has "scan_for_errors": true it will be read entirely at startup and any bad images will be removed if delete_problematic_images: true. It will remove any outdated cache entries.
  • Datasets are defined by a config file, this is now mandatory. Removing datasets can be achieved by setting "disabled": true in the dataset config entry.

Removed arguments

  • All of the --aws_* commandline arguments were removed for privacy reasons, are now in the multidatabackend.json
  • --data_backend is now --data_backend_config and is a path to a dataset config, see multidatabackend.json.example for help converting your existing configurations over

New arguments

--data_backend_config

  • What: Path to your SimpleTuner dataset configuration, set as DATALOADER_CONFIG in sdxl-env.sh
  • Why: Multiple datasets on different storage medium may be combined into a single training session.
  • Example: See (multidatabackend.json.example)[/multidatabackend.json.example] for an example configuration.

--override_dataset_config

  • What: When provided, will allow SimpleTuner to ignore differences between the cached config inside the dataset and the current values.
  • Why: When SimplerTuner is run for the first time on a dataset, it will create a cache document containing information about everything in that dataset. This includes the dataset config, including its "crop" and "resolution" related configuration values. Changing these arbitrarily or by accident could result in your training jobs crashing randomly, so it's highly recommended to not use this parameter, and instead resolve the differences you'd like to apply in your dataset some other way.

--vae_cache_behaviour

  • What: Configure the behaviour of the integrity scan check.

  • Why: A dataset could have incorrect settings applied at multiple points of training, eg. if you accidentally delete the .json cache files from your dataset and switch the data backend config to use square images rather than aspect-crops. This will result in an inconsistent data cache, which can be corrected by setting scan_for_errors to true in your multidatabackend.json configuration file. When this scan runs, it relies on the setting of --vae_cache_behaviour to determine how to resolve the inconsistency: recreate (the default) will remove the offending cache entry so that it can be recreated, and sync will update the bucket metadata to reflect the reality of the real training sample. Recommended value: recreate.

  • data backend init function

  • add the identifier field to all components

    • vaecache
    • bucket_manager
    • sampler
    • train_dataset
  • convert static BucketManager calls to concrete object methods where at all possible

  • ensure VAE caching occurs in sequence across each dataset

  • ensure text embeds end up in a single folder, since they are reusable

  • checks for valid multi-dataset settings

  • load multiple local datasets correctly

  • load multiple AWS datasets correctly

  • combining local and AWS datasets

  • ensure that crop parameter overrides are correctly used for each dataset

@bghira bghira force-pushed the feature/multi-dataset-sampler branch from 1f010c4 to febb865 Compare December 24, 2023 19:32
@bghira bghira merged commit 0e91af2 into main Dec 25, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant