-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding TorchBench SDL / Container Example for Gpu Benchmarking #387
Comments
Thanks for this @rakataprime -- here are some thoughts (will look to @chainzero @andy108369 @troian for additional inputs as well): What is our timeout for large docker container pulls on Akash? What is our desired time budget for running the benchmark how long to run? What is our desired computational budget for running the benchmark / smallest provider resources to assume that we have? What models are most important? The older models e.g. ResNet will be more comparable across platforms, but newer models that are applications focused like a dalle2 or llama inference run are more relevant to the end user? We could make the container a lot smaller by running the torchbench install script and downloading the models at run time which would add about a 5-10 minute start delay. |
So if we look at the two tiers of gpus we still have a lot of variation within those tiers. Tier 1 H100, A100, V100, P100, A40, A10, P4, K80, T4, 4090, 4080, 3090Ti, 3090, 3080Ti, 3080, 3060Ti. For instance latest cuda 11 is deprecated for the k80 generation of cards and before. the lowest VRAM usage for tier 1 is the 3060Ti with 8gb of VRAM, We probably wouldn't want to run benchmarks like bert large on the cards without enough VRAM to actually run. Right now of the shared models between that the lamda list and torchbench the only models that we couldn't run on all of tier 1 would be bert or other llms. The other kind of thorny issue is what cuda/cudnn version install on nodes. I think k8s is still limited to one driver version on the nodes and one cuda version on the nodes. Even if you could do multiple cuda versions it would hurt the distributed training if the pool was highly fragmented bc the deployment would only be able to work with a fraction of the cuda compatible nodes. If you have to keep the newly deprecated cards, it may be better to move just that generation of cards to the last supported cuda version and bump the others to the latest. Currently the torchbench container is using pytorch 2.0.1-cuda11.7-cudnn8-runtime with python 3.10 There are some major performance improvements with latest cuda 11 and pytorch2 for generative ai, especially stable diffusion vs pytorch 1 and prior cuda version before Jan 2022. the relevant torchbench models currently supported in that list from lambda labs are: The models not currently included are ssd, gnmt, transformerxlbase, transformerxllarge, and baseglow. We could substitute the transformerxl with longformer, and ssd with yolov3. I'm not sure what would be a similar model for gnmt that is already in torchbench. if that subset of the shared models is sufficient than I can refactor the container to install on run and update the entrypoint to only benchmark those shared models. Once we have a list of core models and smaller container we will have a better sense of where we stand relative to the 5 min gpu benchmark goal. |
Thanks for the details @rakataprime - the substitutions of the models you mentioned sound fine. The Cuda version issue should only arise in cases of a heterogenous provider (more than one GPU type in the same cluster) and if the GPUs models in the cluster require different Cuda versions, right? I think that may be a relatively uncommon case for the testnet (but could be a problem). Thinking of the logistics of all this, is it better if we just built an SDL (or more than one SDL) that deployed a jupyter notebook with the correct python kernel and pytorch included? At least for the tensorflow models, the approach I was thinking we could take would be to have people run https://github.com/akash-network/awesome-akash/tree/master/tensorflow-jupyter-mnist and then use that instance to run the models from the list in https://github.com/tensorflow/models/tree/master/official |
@anilmurty, if you don't actively try to coral the providers into standardized cuda versions it would prevent people from running training jobs like foundation models across multiple providers because the sdl includes 1 docker container for the training job with a cuda version dependency. My startup wants to train a foundation model with akash (lmk if you want to discuss a formal partnership on this more) , but would want to train across a huge cluster of gpus not just 1 provider. I think the you can have gpu heterogeneity but you want them to be on the same cuda/cudnn version and preferable a known minimum vram. I think in k8s you can set gpu requirements for vram with a helm plugin. VRAM resource resource requirements a setting in sdl right now? I'm not sure if i saw that in the docs. I don't like the notebooks because they are prone to people executing cells out of order and not having functional code. You could do a notebook and then have people export after the benchmark runs as a pdf. Usually the formatting of console like output isn't that great though. I think we probably would better off writing a json output to somewhere else like s3 compatible bucket or ipfs or internal database for aggregation. It might be a lot of data to write on chain though but you could certainly write out some of the summary data on chain easily. I don't know if there is an easy cosmos python client though and you may have to use rust through python rust bridge to do that easily |
hey @rakataprime - sorry for the late reply - somehow missed the notification of this. Would definitely be interested in discussing a partnership with you. I've reached out via discord DM to coordinate. Re. notebooks - I was looking at them purely for the benchmarking exercise for the testnet and not really for use in production for training or inference. Do you feel like the Pytorch SDL is usable now? Asking because I was planning to update the instructions to tell people to use either pytorch or tensorflow for the testnet exercise with a preference towards pytorch. Thanks! |
@anilmurty , I think someone should test the torchbench sdl on the gpu testnet before we say its usable. I believe it is currently usable, but should test that assumption since the gpu testnet is up now. If we want jupyter notebook usage we should package a jupyter notebook container in the docker container or add a second one / sdl to make it easy as possible for people with clear instructions for those who may not have used jupyter before. I would also clarify how you want them to export the notebook in those intstructions as well if you want to look at 20+ submissions. |
Thanks @rakataprime - I'll test this out and confirm https://github.com/akash-network/awesome-akash/blob/e115932a1b8e0536649a2d88f3a614f097ad2c43/torchbench/torchbench_gpu_sdl.yaml (@chainzero - would be great if you did too). Is this usable for the jupyter notebook? https://github.com/akash-network/awesome-akash/tree/master/jupyter |
hey @rakataprime - I just tested it and unfortunately it doesn't work because we have since added support for specifying some GPU attributes (vendor and model). Here are 3 examples of what the structure is like https://docs.akash.network/testnet/example-gpu-sdls At the minimum the SDL needs to be updated to include the "vendor" key as shown here https://docs.akash.network/testnet/example-gpu-sdls/specific-gpu-vendor
It still doesn't return bids (probably because there are no GPU providers on the network that meet the requirements yet) but at least the SDL is valid |
@anilmurty the latest commit adds jupyter and an example notebook. It still needs to be tested on testnet. Also the juypter notebook implementation requires users to paste in the auth token from the logss to access. |
Hello
The purpose of this issue to add an sdl/container for running benchmarks for pytorch with torchbench on gpu providers.
I have written an sdl and dockerfile. It is taking forever(>3hrs to dockerhub/ecr) to push the docker image due to size. I may change our approach slightly to decrease the size of the image with a slight delay in runtime start. I wanted to get some feedback from the gpu team/community before proceeding further.
We need to set some requirements for the provider benchmarks:
Some added context:
The container image is around 20Gb uncompressed, 6Gb of this is the pytorch runtime and the other 14GB are the models and code used for benchmarking. We could make the container a lot smaller by running the torchbench install script and downloading the models at run time which would add about a 5-10 minute start delay.
The actual benchmark itself would take about 5-8 hours to run if run sequentially on a macbook pro skipping the gpu benchmarks and Meta currently runs the benchmark on a gpu cluster.
We don’t have to run every benchmark though and the fastest approach would be to run a small subset of relevant benchmarks with the torchbench repo installed delayed till runtime to decrease the container size.
The text was updated successfully, but these errors were encountered: