Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace base image with CUDA image #39

Merged
merged 3 commits into from
Mar 21, 2022
Merged

Replace base image with CUDA image #39

merged 3 commits into from
Mar 21, 2022

Conversation

roclark
Copy link
Member

@roclark roclark commented Feb 23, 2021

The TensorFlow image on NGC is quite large (10+ GB) and contains a lot of unnecessary data. Using one of the CUDA images as a base cuts down on several gigabytes of data with no change in performance or functionality with the existing Bobber image. To further reduce the image size, using multi-stage builds allows us to compile many of the testing tools, such as NCCL and mdtest, inside a beefier build image while copying only the necessary binaries to the final, slimmer runtime image.

Closes #2
Fixes #82

Signed-Off-By: Robert Clark roclark@nvidia.com

@roclark roclark added enhancement New feature or request docker Any items related to the Dockerfile or running and building the image labels Feb 23, 2021
@roclark roclark self-assigned this Feb 23, 2021
@roclark
Copy link
Member Author

roclark commented Feb 23, 2021

I've run this on a single node currently and the results are indistinguishable between the 6.1.1 image and this new, lighter image, but am marking this as a draft until I can do some multi-node testing just to verify functionality though I expect that to yield the same results given single-node uses mpirun.

@joehandzik
Copy link
Contributor

Definitely my only concern will be if we lost some network functionality somehow, but I think DeepOps uses the base CUDA image for some multinode testing, so I'm hopeful. As you say, let's hold this until we can do a multinode test.

Bobber currently uses a TensorFlow image from NGC as the base to use some
of the TensorFlow functionality in the tests. While this is efficient,
the Bobber image is quite large (12GB+ at present). By moving to a CUDA
base image and installing TensorFlow inside the Bobber image, it might be
possible to reduce the overall image size by several gigabytes with no
change in functionality or performance. This will require a thorough
investigation of the potential impacts of such a change.

Signed-Off-By: Robert Clark <roclark@nvidia.com>
@roclark roclark force-pushed the update-base-image branch 3 times, most recently from 28882a4 to 6b490a8 Compare March 15, 2022 17:59
@roclark
Copy link
Member Author

roclark commented Mar 16, 2022

I was able to verify functionality on multi-node just now and it appears to work well (though I don't have the fastest storage for this cluster so it isn't scaling, but that's expected). Given this will resolve issue #82, there doesn't appear to be a performance regression, and it works well for single- and multi-node, I will go ahead and move this out of draft and merge.

@roclark roclark marked this pull request as ready for review March 16, 2022 20:08
The Bobber image requires compiling multiple binaries across many
repositories, creating a lot of unnecessary files which bloats the
image. By using multi-stage builds, much of the compilation and
dependency installation can be done in a beefier base image and only the
necessary components can be copied to the final runtime image. This
results in a reduction of several gigabytes in the final image for no
loss of performance or functionality.

Signed-Off-By: Robert Clark <roclark@nvidia.com>
Newer versions of FIO changed the way results are displayed, causing
the parser to complain that the results are invalid. The newer versions
contain an extra value in the results line which can safely be ignored.

Signed-Off-By: Robert Clark <roclark@nvidia.com>
Copy link

@fredvx fredvx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@roclark roclark merged commit 0d2808e into main Mar 21, 2022
@roclark roclark deleted the update-base-image branch March 21, 2022 21:15
@roclark roclark added this to the Release 6.3.1 milestone Mar 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docker Any items related to the Dockerfile or running and building the image enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error while building container Investigate replacing base image with CUDA image
3 participants