Skip to content

ELI5 Training

ArrowM edited this page Mar 22, 2023 · 23 revisions

Preamble

These settings are a starting point. Optimal settings will vary between datasets and environments. Getting a great model may take multiple attempts. This guide covers the basics. We'll probably add a separate page to describe the more advanced options.

0. Installation

Installation page.

1. Create a dataset

The most important step - your dreambooth settings do not matter if you have a bad dataset.

Training Images (instance pics)

What makes a good training image?

  • High resolution
  • Unobstructed view of the training subject
  • Simple composition
  • Variety - If you are training a person, more background/lighting/clothing/facial expression/pose variations are better.

What makes bad training images?

  • Low resolution - Do not use images smaller than the training resolution. Do not use images with visible compression.
  • Bad cropping - if your inputs are close-ups, the model outputs will be close-ups
  • Multiple subjects
  • Duplicates or high similarity
  • Images where the subject is not the main focus

DreamBooth with "learn" the whole image (objects, colors, sharpness, background, etc). You want to minimize the overlap of anything that is not your subject in the dataset.

Bucketing

You can use any resolution and any image format supported by the Pillow library (which is most formats). This extension will automatically scale your images to the Resolution set on the Settings tab of the Input column. The extension might slightly crop them to nearby resolution to create buckets for performance and quality reason. You can preview the output of this process using the Bucket Cropping section on the Testing tab of the Input column.

Captioning

If you train with captions, dreambooth will associate the dataset images with those exact captions, hence you will need to use the same words to evoke the trained content from the output model. As an example, let's say I have captions like 1girl, hat, 1girl, ball, 1girl, desk. The images in the dataset will become highly linked to the word 1girl because it is in every caption. This is good if I plan to the trained model on captions including 1girl. However, captions like woman, holding drink will have little resemblance to the dataset, because the words do not overlap. This is the reason you should not use captioning if you are training a style. This is also the reason why you should use captioning when training a specific subject, because it will limit the trained content to specific words (like your instance token).

To caption your datataset, create text files that matches the associated photo name (e.g. dog1.png & dog1.txt) and put your caption in the text file. You can do this manually or use an automated tool like stable-diffusion-webui-wd14-tagger. Anime taggers can work very well on non-anime images, but keep the previous paragraph in mind - you will need to use the anime tags to evoke the trained content from the output model.

2. Model column

Create tab

Go to the Left Model column, Create tab. Set the Name the model you are creating (this can be anything and is unrelated to the captions you may be training). Set your Source Checkpoint. This is the source model your custom model is branching from. If you are using a 768px model, uncheck the 512x model box. You can leave the other settings untouched. Hit Create Model. The new dreambooth model will be created, added to the Model dropdown on the Select tab, and be automatically selected.

Select tab

The Model dropdown is where you select which model to train. Picking a model from the Model dropdown will update the fields of the Model column with information like its epoch and source. It will not automatically update the other UI fields. To load the settings from a previous dreambooth model, select the previous model in the Model dropdown, then click Load Settings at the top.

3. Input column

Settings tab

The Performance Wizard will automatically configure some fields for you (but it's not perfect).

General

You can select alternative training methods in this section, like LoRa. The other methods allow you to train on weaker devices, but with somewhat reduced quality. The available methods (in order of best-to-worst quality) are:

  • Dreambooth (default, requires >= 10GB VRAM)
  • LoRa Extended (requires >= 8GB VRAM)
  • LoRa (requires >= 6GB VRAM)
  • Imagic (???)

Intervals

Training Steps Per Image (Epochs) is the number of times each image in the training and class datasets will be trained on. 100 is the recommended default. You can think of dreambooth training as being analogous to cooking, where Learning Rate (LR) is temperature the # of Epochs is time. If you go too fast, you will burn your model. If you go too slow, nothing will train. If you want to maintain a similar level of trained-ness and you either LR or the # of Epochs, you should halve the other.

Set Amount of time to pause between Epochs to 0. Save Model Frequency is how often a new model will be created during training. Save Preview(s) Frequency (Epochs) is how often sample images will be generated during training.

Batching

For beginners, the Performance Wizard should handle most of this for you. Batch Size is how many dataset images are trained-on simultaneously. Increasing it will increase training speed and the cost of increased VRAM. Gradient Accumulation Steps is for advanced users. Class Batch Size is the batch size for generating class images (similar to txt2img). It does not impact the actual training process. Enable Set Gradients to None When Zeroing and Gradient Checkpointing.

Learning Rate

A Learning Rate of 2e-6 is a good place to start. Increase to train faster at potentially reduced quality and vice versa. Lowering LR has diminishing returns, so don't bother setting LR below 1e-6 with the default scheduler/attention/optimizers.

LoRa uses a separate set of Learning Rate fields because the LR values are much higher for LoRa than normal dreambooth. For LoRa, the LR defaults are 1e-4 for UNET and 5e-5 for Text.

The LR Scheduler settings allow you to control how LR changes during training. The default is constant_with_warmup with 0 warmup steps. For beginners, I'd recommend setting the number of warmup steps to 500. More advanced dreambooth users may find benefits from other schedulers and settings.

Image Processing

New users should leave these unchanged.

Tuning

For beginners, the Performance Wizard should handle most of this for you (except for Text Encoder). However, you may want to change:

  • Memory Attention = Default (if you are using Torch2 or have >= 16 VRAM)
  • ✅ Cache Latents (basically always)
  • Step Ratio of Text Encoder Training = 0 (text encoder has some quirks, see the advanced guide)

Prior Loss Weight is very important if you are using class images. If you are not using class images, Prior Loss Weight does nothing. It determines how much weight is given to the class images (higher value = more class weight). If you aren't getting enough training, reduce Prior Loss Weight. If your concept is "bleeding" beyond the desired tokens, increase Prior Loss Weight.

Advanced

New users should leave these unchanged.

Concepts tab

Directories

Set the Dataset Directory to the path of your training images. Classification Dataset Directory can be left blank unless you want to reuse previous class images.

If you are training a style, you can skip the rest of this tab.

Training Prompts

This section is after Filewords in the UI, but is listed first in this guide for explanation reasons. If you would like to learn more about Class Images/Prompts, see the Class Explained page. If you are not using class images, you do not need to set a Class Prompt or Class Token.

Instance Prompt is a description of the instance images (the concept you are training). If you are using captions (described above), you can put [filewords] and the instance prompt will be substituted for the captions. For example, if I am training on images of a cat named Rufus, I may put photo of Rufus. Or even better, I could put [filewords] for the Instance Prompt, and use captions like close up photo of Rufus and Rufus wearing a cowboy hat. Class Prompt is a description of the images excluding your concept (this will make more sense after reading the Filewords section).

You can put [filewords] for the Class Prompt and your captions will be used to generate class images. In my example, Rufus is a cat, so my class prompt may be photo of a cat. Don't overthink the class, close enough is good enough. Classification Image Negative Prompt is only used for class image generation (it is not fed into dreambooth, like the other 2 prompts). It works like the usual txt2img negatives prompts. For example, you may put worst quality, low quality.

Filewords

Skip this section if you are not using [filewords] in your Instance Prompt or Class Prompt.

The Instance and Class Tokens These tokens are mixed with your prompts. Conventionally, the Instance Token is what you will use to evoke the trained concept once you are done training and the Class Token is 1-or-2 word description of the concept's class.

In my previous example, Instance Token would be Rufus and the Class Token would be cat. These Tokens are mixed with your Instance Prompt and Class Prompt. See the example below.

Imagine I have 3 images of Rufus with the captions:

(1) A photo of Rufus dancing, (2) Close up, bedroom background, (3) Rufus the cat

One configuration could be:

  • Instance Token = Rufus
  • Class Token = cat
  • Instance Prompt = [filewords]
  • Class Prompt = [filewords]

which would feed the instance images into dreambooth as:

(1) A photo of Rufus cat dancing, (2) Rufus cat, Close up, bedroom background, (3) Rufus the cat

and generate class images using:

(1) A photo of cat dancing, (2) cat, Close up, bedroom background, (3) the cat

NOTE: Your Instance Token (like all tokens) will inherit its value from the source model. For this reason, an Instance Token like Sarah is not recommended and unique "unclaimed" tokens like ohwx are preferred.

Sample Prompts

Sample Image Prompt and Sample Negative Prompt are the prompts used for generating sample images during and after training. [filewords] can also be used here. For example, I might set

  • Save Preview(s) Frequency (Epochs) = 10 (on the Settings tab)
  • Sample Image Prompt = [filewords]
  • Sample Negative Prompt = worst quality, low quality
  • Number of Samples to Generate = 5 (below)

to generate 5 sample images every 10 epochs.

Sample Prompt Template File can be used to generate samples from a different file.

Image Generation and Sample Image Generation

These are the settings for Class and Sample image generation.

The Number of Samples to Generate can be increased to generate more class images (more class images = more variety = more flexible output model). If you are using class images, 5 is number to start with. Each Instance image and 1 of its associated Class image is fed into dreambooth during an epoch. Increasing will the number of class images will not increase the ratio of class to instance images that are fed into dreambooth.

Saving tab

General and Checkpoints and Diffusion Weights

This is where you configure saving behavior, the defaults should be fine for most people.

Generate a .ckpt file when saving during training should be used in combination with the Save Model Frequency (Epochs) option on the Settings tab. Diffusion Weights (training snapshots) can be used to save training snapshots that can be resumed from. This is an advanced feature and we recommend leaving Snapshots disabled unless you know what you are doing. Snapshots take up a lot of hard-drive space.

Lora

LoRa UNET and Text Encoder Rank are basically the quality sliders for Lora output. The higher the ranks = higher quality = larger output files. We recommend starting with both set to 32, which will produce ~100MB output files.

When LoRa is enabled, the Checkpoints settings will be used to generate checkpoints with your LoRa merged in. However, we recommend disabling them and using the LoRa Weights setting at the bottom of the LoRa section instead. The LoRa Weights settings control the output to the models/Lora directory. To use those smaller LoRa models (and others that may not already be compatible with A1111), install the a1111-sd-webui-locon extension.

Generate tab

This tab has more controls for class image and graph generation. The defaults should be fine for most users.

Testing

This tab has experimental settings. The defaults should be fine for most users.

4. Train

Hit the orange Train button to begin training. Training will stop when it reaches the # of epochs specified in Training Steps Per Image (Epochs). Hit Cancel to stop early.

5. Test results

After training completes, refresh your A1111 checkpoint list. You should see your new checkpoint(s) named with the pattern {modelname}_{step #}. If you used Save Frequency, you will see multiple checkpoints with different step counts. More step = more training.

Select your model from the checkpoint list. Create a prompt with your keyword and generate several images. Repeat this process for a few prompts.

If the output does not resemble your training subject, you may need to train for longer: Select your dreambooth model. Load settings. Set Training Steps Per Image (Epochs) to 10 or 20, Train. Repeat until you are happy.

If the output has weird textures or everything in the images starts to look like your training subject (bleed), you trained too much.

Other info

You can generate a checkpoint anytime with the Generate Checkpoint button. You can also use your custom checkpoints as Source Checkpoints if you want to resume training from a previous step count.