Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate text encoder cache + add deepspeed arg parsing #1372

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

heyalexchoi
Copy link

@heyalexchoi heyalexchoi commented Jun 14, 2024

Changes:

  • Validates text encoder cache, similarly to how image latents are validated. Previously only checked existence of file, but I had issue with incomplete or corrupted file breaking my training 600 steps in.
  • Fixes deepspeed arg parsing in cache latents scripts, related to cache_text_encoder_outputs.py raises AttributeError: 'Namespace' object has no attribute 'deepspeed' #1288
    - Adds --skip_to_step arg to sdxl_train. I am somewhat new to this project so not sure if this is correct. The purpose is to bring the dataloader in sync with desired step and resumed state. It is much simpler than Train resume step #1359. I will remove this change if not correct. Removed

Update:

  • Enable load of SDXL tokenizer from pretrained_model_name_or_path. Not sure why it doesn't do that in the first place, perhaps from before SDXL was easily available on huggingface?. Currently it loads from
TOKENIZER1_PATH = "openai/clip-vit-large-patch14"
TOKENIZER2_PATH = "laion/CLIP-ViT-bigG-14-laion2B-39B-b160k"

However, I have modified tokenizers that I expected to be loaded from pretrained_model_name_or_path.

@kohya-ss
Copy link
Owner

Thank you for this! For --skip_to_step, this implementation calls DataLoader for all steps, it will take a lot of time if the dataset is large. For example, if one epoch has 10,000 steps, skipping 10 epochs needs 100,000 calls for DataLoader. So we needs 'skipping epoch' feature.

I will review other updates in this PR.

@heyalexchoi
Copy link
Author

heyalexchoi commented Jun 15, 2024

Thank you for this! For --skip_to_step, this implementation calls DataLoader for all steps, it will take a lot of time if the dataset is large. For example, if one epoch has 10,000 steps, skipping 10 epochs needs 100,000 calls for DataLoader. So we needs 'skipping epoch' feature.

I will review other updates in this PR.

Ah, I see I missed the epoch part entirely. Removed that commit.
There is one more addition that I described in original post update.

@heyalexchoi
Copy link
Author

Question:
Do you know if there is a reason batch_size=1 in cache latents and text encoder outputs?
https://github.com/kohya-ss/sd-scripts/blob/main/tools/cache_text_encoder_outputs.py#L129
https://github.com/kohya-ss/sd-scripts/blob/main/tools/cache_latents.py#L124

Would it be ok to make this adjustable w/ args?

@kohya-ss
Copy link
Owner

Do you know if there is a reason batch_size=1 in cache latents and text encoder outputs?

This is because of Aspect Ratio Bucketing. The batchs are made in dataset with bucketing (all images in one batch have same resolution), so we must use batch_size=1 here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants