Implement Test and Validation Set Loss #1897

stepfunction83 · 2025-01-24T15:06:40Z

I propose creating a tracker to capture a stable loss measurement as proposed here: https://github.com/spacepxl/demystifying-sd-finetuning

Effectively, at regular intervals during training, a preselected image (or batch of images I imagine) and a preselected noise seed are used to calculate the loss. This ensures that over the course of training, the loss recorded can accurately show the progress of the training run.

By also incorporating a holdout set to use for calculating validation loss, this allows a proper evaluation of what point the model begins to overtrain.

stepfunction83 · 2025-01-24T18:39:22Z

This could be implemented as part of the standard train loop, by selecting predetermined batches from the train_dataloader into a test_dataloader and by creating a val_dataloader by either removing samples from the train_dataloader (much easier) or loading from a directory (much harder).

Then to specify the number of items to include in each set as well as the number of noise/timestep iterations to perform per image:

--test_set_count 10 (Automatically create a test set using a specified number of images from the training set. These are not set aside and continue to be used for training.)
--val_set_count 10 (Automatically create a holdout set for validation using a specified number of images from the training set. These are set aside and not included in training.)

It would be substantially easier to automatically split out a set of the train_dataloader for use as val_dataloader rather than allow specification. By automatically selecting them, we can leverage the existing cache generation and significantly reduce the extra effort to load and cache an additional directory of images and captions.

Finally, a frequency should be specified to run loss calculations on these sets:

--test_val_loss_freq_steps 50 (calculate test/val loss every 50 steps)
--test_val_loss_freq_epochs 1 (calculate test/val loss every epoch)

The results of these calculations should be logged to the standard log and to wandb.

For an initial implementation, doing predetermined entries from the train_dataloader for a test loss would be simplest. This would involve the creation of:

--test_set_count 10
--test_val_loss_freq_steps 50 (calculate test/val loss every 50 steps)

Followed by:

val_dataloader load samples held out from the dataset and specified separately
--val_set_count 10
--test_val_loss_freq_epochs 1 (calculate test/val loss every epoch)

Finally due to the additional complexity of ingesting an additional directory:

--val_dataloader_dir "~/val_dataset_dir/" (To override --val_set_count if provided, and load a directory of validation images)

stepfunction83 · 2025-01-26T18:18:57Z

Implemented in #1899. I am working on a number of enhancements to it in #1900.

stepfunction83 changed the title ~~Implement Stable Loss Calculation for Run Tracking~~ Implement Test and Validation Set Loss Jan 25, 2025

stepfunction83 mentioned this issue Jan 26, 2025

Implement Test and Validation Loss for Flux Finetuning and LoRA Training #1898

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Test and Validation Set Loss #1897

Implement Test and Validation Set Loss #1897

stepfunction83 commented Jan 24, 2025

stepfunction83 commented Jan 24, 2025 •

edited

Loading

stepfunction83 commented Jan 26, 2025

Implement Test and Validation Set Loss #1897

Implement Test and Validation Set Loss #1897

Comments

stepfunction83 commented Jan 24, 2025

stepfunction83 commented Jan 24, 2025 • edited Loading

stepfunction83 commented Jan 26, 2025

stepfunction83 commented Jan 24, 2025 •

edited

Loading