-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Multi-dataset sampler #235
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
bghira
force-pushed
the
feature/multi-dataset-sampler
branch
from
December 24, 2023 19:32
1f010c4
to
febb865
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Changes
New behaviour
"scan_for_errors": true
it will be read entirely at startup and any bad images will be removed ifdelete_problematic_images: true
. It will remove any outdated cache entries."disabled": true
in the dataset config entry.Removed arguments
multidatabackend.json
--data_backend
is now--data_backend_config
and is a path to a dataset config, seemultidatabackend.json.example
for help converting your existing configurations overNew arguments
--data_backend_config
DATALOADER_CONFIG
insdxl-env.sh
--override_dataset_config
--vae_cache_behaviour
What: Configure the behaviour of the integrity scan check.
Why: A dataset could have incorrect settings applied at multiple points of training, eg. if you accidentally delete the
.json
cache files from your dataset and switch the data backend config to use square images rather than aspect-crops. This will result in an inconsistent data cache, which can be corrected by settingscan_for_errors
totrue
in yourmultidatabackend.json
configuration file. When this scan runs, it relies on the setting of--vae_cache_behaviour
to determine how to resolve the inconsistency:recreate
(the default) will remove the offending cache entry so that it can be recreated, andsync
will update the bucket metadata to reflect the reality of the real training sample. Recommended value:recreate
.data backend init function
add the identifier field to all components
convert static BucketManager calls to concrete object methods where at all possible
ensure VAE caching occurs in sequence across each dataset
ensure text embeds end up in a single folder, since they are reusable
checks for valid multi-dataset settings
load multiple local datasets correctly
load multiple AWS datasets correctly
combining local and AWS datasets
ensure that crop parameter overrides are correctly used for each dataset