Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Video temporal upsampling #305

Closed
HReynaud opened this issue Jan 23, 2023 · 32 comments
Closed

Video temporal upsampling #305

HReynaud opened this issue Jan 23, 2023 · 32 comments

Comments

@HReynaud
Copy link

Hi,

Is there a way with the current library to do temporal super-resolution ? From what I can see, only the spatial super resolution is currently possible. I would like to have one UNet that generates videos with dimensions 64x64x16 and a super-resolution UNet that would upsample them to 128x128x32.
Please let me know if I am missing something !

@lucidrains
Copy link
Owner

@HReynaud not yet, i'm planning to build that by month's end, or by early next month

@HReynaud
Copy link
Author

Excellent news ! Thank you for your hard work !

@lucidrains
Copy link
Owner

@HReynaud yea no problem, it will take some thought rather than just blindly throwing in a 3d conv with strides

the reason is because i want to maintain the ability to pretrain on images before video, so it'll have to follow the existing scheme of being agnostic to image / video inputs

@lucidrains
Copy link
Owner

@HReynaud should be able to get the temporal upsampling / interpolation finished tomorrow morning!

@lucidrains
Copy link
Owner

@HReynaud pray tell, what dataset are you training on?

@HReynaud
Copy link
Author

Thank you so much ! I was considering giving it a try, but was not sure where to start. I am training on the Echonet-Dynamic dataset to do cardiac ultrasound generation: https://echonet.github.io/dynamic/ .
The goal is to explore video generation with precise control over the embeddings. I am using auto-encoders and other techniques to encode specific information into a latent space that I use as the "text embeddings" for the conditional generation. I will give it a try as soon as it's live !

@lucidrains
Copy link
Owner

@HReynaud oh that's so cool! ok i'll def make sure to build this right 😃 this-mitral-valve-prolapse-does-not-exist.com lmao

@lucidrains
Copy link
Owner

@HReynaud was able to get it working, although inpainting may still not be functional (will look at that tomorrow morning) 44da9be

@lucidrains
Copy link
Owner

lucidrains commented Jan 25, 2023

@HReynaud to use it, when instantiating Imagen, just pass in temporal_downsample_factor

it will be a tuple of integers, specifying at each stage of the series of unets, how much to divide the frame size by

@HReynaud
Copy link
Author

Awesome ! I will let you know how it goes !

@lucidrains
Copy link
Owner

@HReynaud how did it go?

@HReynaud
Copy link
Author

The code seems to work flawlessly and temporal_downsample_factor is very straight forward to set up, thank you for that !

videos_10004_f59356ee9aae088a26ee

(Top is sampled, bottom is ground truth)

This is the output of the upsampling model, I am going from 64x64x16 to 112x112x64. The top row is sampled from the SR model, conditioned on the low resolution ground truth videos. I have to investigate if tuning some parameters in the ElucidatedImagen can make the speckle noise more consistent.

@lucidrains
Copy link
Owner

@HReynaud ohh beautiful, it captured the valves and septal motion so well! looking forward to reading your paper on generated cardiac pathologies 😄 ❤️

@lucidrains
Copy link
Owner

@HReynaud make sure you show that to some cardiologists, go blow their minds!

@HReynaud
Copy link
Author

I'll make sure to ping you when the results get published somewhere 😃

@HReynaud
Copy link
Author

Hi @lucidrains, there might be a bug with the temporal upsampling pipeline. If I want to use the cond_images parameter, ElucidatedImagen gives the same cond_images to both the base model and the super-resolution model and only resamples the spatial dimensions. But the number of frames has to be different if temporal_downsample_factor is used.

The bug happens here. From my perspective, the solution would be to pass one cond_images per unet when sampling, which would make it possible to have only one of the two models conditioned on cond_images. This way cond_images could be set per-unet in the sampling loop.

The bug might also happen during training, but as I train the models separately, I have not tested it. Nonetheless, when training the models, I take care of passing a downsampled version of my cond_images to the base unet myself, which probably means there is a similar bug in the training loop.

@alif-munim
Copy link

Hey, @HReynaud, awesome results! I'm currently working with the EchoNet dataset as well, but so far have only been getting noise. May I ask how many steps you trained your model for in total to get these videos?

@HReynaud
Copy link
Author

HReynaud commented Feb 2, 2023

Hi @alif-munim, glad to hear other people are looking into this ! Generating images of size 64x64 with a Unet (Not 3D), takes less than an hour of training on a modern GPU ex. 3090/A5000, I would try that first. This holds with parameters left to default for the Imagen and ImagenTrainer modules. For the Unet, setting dim=64 and dim_mults = (1, 2, 4) should give good results.

For Video, I use the ElucidatedImagen. Try leaving all parameters to default (especially ignore_time=False) and you should get results after a few hours of compute on a GPU cluster.

@lucidrains
Copy link
Owner

lucidrains commented Feb 4, 2023

Hi @lucidrains, there might be a bug with the temporal upsampling pipeline. If I want to use the cond_images parameter, ElucidatedImagen gives the same cond_images to both the base model and the super-resolution model and only resamples the spatial dimensions. But the number of frames has to be different if temporal_downsample_factor is used.

The bug happens here. From my perspective, the solution would be to pass one cond_images per unet when sampling, which would make it possible to have only one of the two models conditioned on cond_images. This way cond_images could be set per-unet in the sampling loop.

The bug might also happen during training, but as I train the models separately, I have not tested it. Nonetheless, when training the models, I take care of passing a downsampled version of my cond_images to the base unet myself, which probably means there is a similar bug in the training loop.

oh oops! ok, i'll get this fixed next week!

you should only have to pass in the cond_images (should be renamed cond_images_or_video) of what the super resoluting net receives, and it should automatically temporally downsample for the base unet during training

@lucidrains
Copy link
Owner

@HReynaud say you have two unets, the second unet temporally upsampled 2x, i'll probably make it error out if the number of frames on the conditioning video is any less than 2. is that appropriate you think? or should i also allow a single conditioning frame across all unets

@lucidrains
Copy link
Owner

@HReynaud actually, i'll just allow for that behind a feature flag, something like can_condition_on_single_frame_across_unets (long name just to be clear)

@HReynaud
Copy link
Author

HReynaud commented Feb 4, 2023

Hi @lucidrains,
For the image conditioning, I am using a single frame repeated as many times as necessary on the time dimension, so I have not practically encountered the case you mention.
I guess if I were to use a video as conditioning the problem you state could arise and your solution seems sound.

If you have some time, I have been making a few small corrections / edits to the code on my fork. These are minimal edits which should be easy to track.

main...HReynaud:imagen-pytorch:main

I have not push my latest edits that correct a few bugs when using more than 2 Unets with temporal super resolution. I’ll push the commit tomorrow.

I considered making a pull request but my edits are really focused on what I am targeting and would not be general enough.

I’ll continue using your repo and will let you know if I encounter any more bugs.

@lucidrains
Copy link
Owner

lucidrains commented Feb 4, 2023

@HReynaud ohh yes, there's actually two scenarios at play here

either you want to condition on an image, in which case you repeat across time and concat along the channel dimension

but the other type of conditioning would be say conditioning on a few preceding frames (much like prompting in GPT), in which case we would want to do the temporal down/upsampling and concatenate across time

@lucidrains
Copy link
Owner

but the latter can be achieved through inpainting too

ok, this is all very confusing, but i'll definitely take a look at this monday and get it squared away by end of next week!

@lucidrains
Copy link
Owner

@HReynaud hey Hadrien! i believe your issue should be resolved in the latest version! (do let me know if it hasn't)

i'll keep working on the conditioning across the time dimension, as that will allow one to generate arbitrarily long videos akin to phenaki

@HReynaud
Copy link
Author

HReynaud commented Feb 6, 2023

Hi @lucidrains, Thanks for your reactivity ! The commit looks great, I would just like to add that in resize_video_to, there is a check that prevents temporal upsampling if the spatial dimensions are untouched, ie orig_video_size = target_image_size and target_frames != None

@lucidrains
Copy link
Owner

@HReynaud yes indeed! thank you! 3c24c60

@alif-munim
Copy link

Hi @HReynaud, thank you so much for your advice! I have been trying to train a simpler text-to-image model as you suggested, but upon sampling after 50 epochs of training I'm still just getting a black square :/

Could you kindly let me know how large your dataset size was and how many epochs / training steps you needed before seeing some decent samples?

@HReynaud
Copy link
Author

HReynaud commented Feb 8, 2023

Hi @alif-munim, try this script to get started: example

@alif-munim
Copy link

Hi @alif-munim, try this script to get started: example

Thanks so much @HReynaud! Could you kindly let me know why you used trainer.train_step() and trainer.valid_step() over trainer.update()? Is there a difference?

@HReynaud
Copy link
Author

Hi @alif-munim, lucid could probably explain this more in-depth than me, but to put in simply, trainer.update() only does the backpropagation operation ie. tensor.backward() in pytorch.

trainer.train_step() first runs the foward process and then automatically calls trainer.update() to train the model. If you don't run the forward step first the model has no gradients to backpropagate through.

@alif-munim
Copy link

Thanks once again @HReynaud, I had been stuck on this issue for a while now and you've helped tremendously! I believe the issue was that I was only ever using trainer.update(), so the model did not learn to generate anything but noise

@lucidrains, I think it would be a great idea to have @HReynaud's example script somewhere in the documentation for beginners like me :)

AIDevMonster added a commit to AIDevMonster/Text-to-Image-Neural-Network-Pytorch that referenced this issue Jun 27, 2023
whiteghostDev added a commit to whiteghostDev/Text-to-Image-Neural-Network-Pytorch that referenced this issue Aug 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants