-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace base image with CUDA image #39
Conversation
I've run this on a single node currently and the results are indistinguishable between the 6.1.1 image and this new, lighter image, but am marking this as a draft until I can do some multi-node testing just to verify functionality though I expect that to yield the same results given single-node uses mpirun. |
Definitely my only concern will be if we lost some network functionality somehow, but I think DeepOps uses the base CUDA image for some multinode testing, so I'm hopeful. As you say, let's hold this until we can do a multinode test. |
c447d28
to
6fc44f0
Compare
Bobber currently uses a TensorFlow image from NGC as the base to use some of the TensorFlow functionality in the tests. While this is efficient, the Bobber image is quite large (12GB+ at present). By moving to a CUDA base image and installing TensorFlow inside the Bobber image, it might be possible to reduce the overall image size by several gigabytes with no change in functionality or performance. This will require a thorough investigation of the potential impacts of such a change. Signed-Off-By: Robert Clark <roclark@nvidia.com>
28882a4
to
6b490a8
Compare
I was able to verify functionality on multi-node just now and it appears to work well (though I don't have the fastest storage for this cluster so it isn't scaling, but that's expected). Given this will resolve issue #82, there doesn't appear to be a performance regression, and it works well for single- and multi-node, I will go ahead and move this out of draft and merge. |
The Bobber image requires compiling multiple binaries across many repositories, creating a lot of unnecessary files which bloats the image. By using multi-stage builds, much of the compilation and dependency installation can be done in a beefier base image and only the necessary components can be copied to the final runtime image. This results in a reduction of several gigabytes in the final image for no loss of performance or functionality. Signed-Off-By: Robert Clark <roclark@nvidia.com>
Newer versions of FIO changed the way results are displayed, causing the parser to complain that the results are invalid. The newer versions contain an extra value in the results line which can safely be ignored. Signed-Off-By: Robert Clark <roclark@nvidia.com>
6b490a8
to
e9d7d1a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
The TensorFlow image on NGC is quite large (10+ GB) and contains a lot of unnecessary data. Using one of the CUDA images as a base cuts down on several gigabytes of data with no change in performance or functionality with the existing Bobber image. To further reduce the image size, using multi-stage builds allows us to compile many of the testing tools, such as NCCL and mdtest, inside a beefier build image while copying only the necessary binaries to the final, slimmer runtime image.
Closes #2
Fixes #82
Signed-Off-By: Robert Clark roclark@nvidia.com