-
Notifications
You must be signed in to change notification settings - Fork 643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducing DALL-E using DeepSpeed #137
Comments
@mehdidc Hi Mehdi! I am actually busy with protein folding replication (Alphafold2), but I think @robvanvolt and @afiaka87 would definitely love to make use of the resources :) Thank you! |
@lucidrains Just for context, I deferred them to you due to my inability to answer questions regarding multi-GPU compute. @mehdidc Seems we're all a bit busy at the moment. I will do my best to help you with this if you can file issues for us, but I've decided to be fairly hands off in the discord chat for personal reasons. |
@lucidrains I know you're busy but a quick yes or no will suffice- Does the codebase in its current form make use of multiple GPUs? |
@mehdidc Just to be clear - we are quite interested. I'll be making this a high priority but can only help so much due to my lack of machine learning knowledge. I'm assuming robvanvolt feels similarly, but they are also dealing with quite an in-surge in traffic on the newly created discord. If you have a bit of patience though, we'll both be able to help you out along the process. |
ohhh right, so the current script does not do multi-GPU, but it should be pretty easy to get multi-GPU working with the newest deepspeed (or pytorch lightning). I'll see what I can do tomorrow |
Hey folks! So there will be no trouble to arrange access to compute resources in size of 2 compute nodes with 4x GPUs each, given that we have together a look into multi GPU execution, preferable using DeepSpeed (as it seems to me the most straightforward way with transformers right now), but open for other suggestions. @lucidrains I can imagine that starting from that, we can also enter for working together on AlphaFold2 as well, at least with regard to its Transformer component. So it can be that we can turn it into a generic collaboration on distributed training of various useful architectures on multi node, multi GPU . Please let us know what do you think. |
My first concern is with regard to deepspeed. I've not yet been able to get it working (independently) with the sparse attention that we use. Is this something you've dealt with? I believe lucidrains has gotten it working because there's an install script in the repo and code for it. But as it stands we don't have a pleasant docker deploy type scenario (and those scripts don't seem to work on my configs even if I switch to the correct cudatoolkit, etc.). Furthermore - I'm not certain that microsoft actually supports the A100 GPU yet. For now it seems your best bet is to deploy using a V100 or a Titan RTX. I've filed an issue about this here. Give it a thumbs up and maybe they'll have a look? Not likely though. That's not to say that it won't work - but it may require severe tinkering. |
Good question - so first we have to clarify whether we get the sparse attention transformer and deepspeed go along together. We ourselves haven't tried it - in fact, we only run deepspeed in a very simple multi-gpu scenario, data parallel mode, for standard CIFAR-10 supervised training on ResNet-50, so quite boring. How about that: I will provide you links with instructions how to register at our supercomputing facilities and will grant you access to some compute resources. We then can try together to make this one particular test, deepspeed with sparse attention transformer. Timewise there is no hurry. In fact we are also unfortunately quite busy and until end of May will have kind of sparse way )) to do hands on with your together. From June on, it looks then better. But first steps we can manage, so that you have your environment on compute nodes and the libraries in place etc. One note - on supercomputing nodes, it is not really possible to switch flexibly low level things like nvidia drivers or switch between a lot of different CUDA versions etc, if that becomes necessary. |
It is not a problem to start with V100, we have nodes with those as well. |
Hm - well if you're not in desperate need of the actual sparse attention then as far as I'm concerned - turn it off the moment it gives you problems ha. And yeah, I believe the V100s would be a better starting point to just get the code running at least. Do any of you have local dev environments with GPUS you can use as well without needing to explicitly include them in your budget? |
We do have a machine with 4x V100 without budget limitation, with a drawback that is not accessible from outside. I think it would be better to get on a machine where we all can work indeed together. Let's try to have a model training running on a compute node where we all have access. Once we have it tuned, we can commit longer training runs on the local machine for further testing |
@lucidrains @afiaka87 Let's do a step like this: please drop me a short email to j.jitsev@fz-juelich.de, and I send your instructions so that you can already register for the access, so that I can also already add you both to the compute project. We do this step and see from then on how to organize ourselves. |
With regard to sparse attention: here another colleague of us, Alex Strube, @surak, was opening issue at deepspeed - so it should be fine to go with V100 and cuda 11, it seems from that discussion: microsoft/DeepSpeed#790 |
@lucidrains If it's fine with you, I'd take the learning experience with DeepSpeed and try to get it running on some V100s tomorrow. Please tell me if you'd rather do it yourself, otherwise I'm definitely up to relieve you from that. |
@janEbert would be pleased for you to take the helm! |
Thanks for the trust. ;) |
Awesome! this got a little traction fast! :D I'm currently trying to get deepspeed with sparse attention running on a 3090rtx (should work on the A100 then if it succeeds). @afiaka87 is right, i'm rather new to ML, just a programmer for a little more than a decade, so i wouldn't be of much help in the deep darks of ML outside of a little code optimization / preprocessing and translating of captions / organizing stuff (that was the reason for the discord a more organized crew and less "chat" here in the github issues) |
@janEbert Thanks a ton for taking this up! Your prior experience means you're likely to figure that out a bit faster than I could have. |
Ah right, you also mentioned you wanted to do it, sorry! I'll see how far I can get tomorrow and stay in touch with you on the Discord, is that okay? |
Please do! I'll be highly available to help if you need anything. |
That's great to know, thank you! |
@janEbert @JeniaJitsev This system is indeed complex. Could I borrow one of you for a quick tutorial on deploying dalle-pytorch with a proper dataset? I believe that would speed up things a bit for me. |
You can also borrow @mehdidc who originated this issue, he will be also eager to help I guess )) |
Great work so far! I just want to throw it out there that Pytorch Lightning has deepspeed (and wandb) integration https://pytorch-lightning.readthedocs.io/en/stable/advanced/multi_gpu.html?highlight=deepspeed Perhaps by using it we can get the best of both worlds and have it be significantly less complex than it needs to be? |
Thanks for the suggestion, I didn't know about that! To the software engineer in me it's valuable to have direct access to the API I'm using. For what it's worth, if I understood the documentation correctly, training is now set up so we can do anything PyTorch Lightning can (using ZeRO or ZeRO-Offload). I definitely see the value in clean (and, even more importantly, battle-tested) research code. |
Well, as expected, I was completely wrong... :D |
@janEbert no worries, give it another try :) I believe in you |
@janEbert , @mehdidc & everyone: It seems Eleuther AI folks work on training and then releasing a publicly available large GPT version (175B one), and they as well use a code base that employs DeepSpeed. Looks to me it can be helpful in deepspeed experiments we conduct. They also have their own fork of DeepSpeed adapted to this end.
|
Well, we are lucky enough to use the DeepSpeed library itself so we have stage 2 working already! I can't test stage 3 as I don't have access to a recent enough version of DeepSpeed but from my assumptions this really should work out of the box with the current code. |
Okay, if it will work out of the box with DeepSpeed, the better. Less libraries, less trouble )) |
@janEbert amazing job! 💯 💯 🙏 |
Just to confirm: Stage 3 works, our sysadmin sacrificed valuable holiday time to quickly upgrade DeepSpeed! |
Hi @lucidrains, Hi @robvanvolt,
@JeniaJitsev started initially a discussion in the discord channel of @robvanvolt.
Just a brief recap.
We (@JeniaJitsev, @janEbert, and myself) are in a research group in Germany, Helmholtz AI,
which is part of the Helmholtz Association. We are interested in reproducing DALL-E. We have the possibility to offer
you access to A100 GPUs (from https://apps.fz-juelich.de/jsc/hps/juwels/booster-overview.html) for reproducing the model, ideally using DeepSpeed for distributed training.
What are your thoughts ? Would you be interested ?
The text was updated successfully, but these errors were encountered: