-
-
Notifications
You must be signed in to change notification settings - Fork 774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Video temporal upsampling #305
Comments
@HReynaud not yet, i'm planning to build that by month's end, or by early next month |
Excellent news ! Thank you for your hard work ! |
@HReynaud yea no problem, it will take some thought rather than just blindly throwing in a 3d conv with strides the reason is because i want to maintain the ability to pretrain on images before video, so it'll have to follow the existing scheme of being agnostic to image / video inputs |
@HReynaud should be able to get the temporal upsampling / interpolation finished tomorrow morning! |
@HReynaud pray tell, what dataset are you training on? |
Thank you so much ! I was considering giving it a try, but was not sure where to start. I am training on the Echonet-Dynamic dataset to do cardiac ultrasound generation: https://echonet.github.io/dynamic/ . |
@HReynaud oh that's so cool! ok i'll def make sure to build this right 😃 this-mitral-valve-prolapse-does-not-exist.com lmao |
@HReynaud to use it, when instantiating Imagen, just pass in it will be a tuple of integers, specifying at each stage of the series of unets, how much to divide the frame size by |
Awesome ! I will let you know how it goes ! |
@HReynaud how did it go? |
The code seems to work flawlessly and (Top is sampled, bottom is ground truth) This is the output of the upsampling model, I am going from 64x64x16 to 112x112x64. The top row is sampled from the SR model, conditioned on the low resolution ground truth videos. I have to investigate if tuning some parameters in the |
@HReynaud ohh beautiful, it captured the valves and septal motion so well! looking forward to reading your paper on generated cardiac pathologies 😄 ❤️ |
@HReynaud make sure you show that to some cardiologists, go blow their minds! |
I'll make sure to ping you when the results get published somewhere 😃 |
Hi @lucidrains, there might be a bug with the temporal upsampling pipeline. If I want to use the The bug happens here. From my perspective, the solution would be to pass one The bug might also happen during training, but as I train the models separately, I have not tested it. Nonetheless, when training the models, I take care of passing a downsampled version of my |
Hey, @HReynaud, awesome results! I'm currently working with the EchoNet dataset as well, but so far have only been getting noise. May I ask how many steps you trained your model for in total to get these videos? |
Hi @alif-munim, glad to hear other people are looking into this ! Generating images of size 64x64 with a For Video, I use the |
oh oops! ok, i'll get this fixed next week! you should only have to pass in the |
@HReynaud say you have two unets, the second unet temporally upsampled 2x, i'll probably make it error out if the number of frames on the conditioning video is any less than 2. is that appropriate you think? or should i also allow a single conditioning frame across all unets |
@HReynaud actually, i'll just allow for that behind a feature flag, something like |
Hi @lucidrains, If you have some time, I have been making a few small corrections / edits to the code on my fork. These are minimal edits which should be easy to track. main...HReynaud:imagen-pytorch:main I have not push my latest edits that correct a few bugs when using more than 2 Unets with temporal super resolution. I’ll push the commit tomorrow. I considered making a pull request but my edits are really focused on what I am targeting and would not be general enough. I’ll continue using your repo and will let you know if I encounter any more bugs. |
@HReynaud ohh yes, there's actually two scenarios at play here either you want to condition on an image, in which case you repeat across time and concat along the channel dimension but the other type of conditioning would be say conditioning on a few preceding frames (much like prompting in GPT), in which case we would want to do the temporal down/upsampling and concatenate across time |
but the latter can be achieved through inpainting too ok, this is all very confusing, but i'll definitely take a look at this monday and get it squared away by end of next week! |
Hi @lucidrains, Thanks for your reactivity ! The commit looks great, I would just like to add that in resize_video_to, there is a check that prevents temporal upsampling if the spatial dimensions are untouched, ie |
Hi @HReynaud, thank you so much for your advice! I have been trying to train a simpler text-to-image model as you suggested, but upon sampling after 50 epochs of training I'm still just getting a black square :/ Could you kindly let me know how large your dataset size was and how many epochs / training steps you needed before seeing some decent samples? |
Hi @alif-munim, try this script to get started: example |
Thanks so much @HReynaud! Could you kindly let me know why you used |
Hi @alif-munim, lucid could probably explain this more in-depth than me, but to put in simply,
|
Thanks once again @HReynaud, I had been stuck on this issue for a while now and you've helped tremendously! I believe the issue was that I was only ever using @lucidrains, I think it would be a great idea to have @HReynaud's example script somewhere in the documentation for beginners like me :) |
Hi,
Is there a way with the current library to do temporal super-resolution ? From what I can see, only the spatial super resolution is currently possible. I would like to have one UNet that generates videos with dimensions 64x64x16 and a super-resolution UNet that would upsample them to 128x128x32.
Please let me know if I am missing something !
The text was updated successfully, but these errors were encountered: