Replace base image with CUDA image #39

roclark · 2021-02-23T22:59:09Z

The TensorFlow image on NGC is quite large (10+ GB) and contains a lot of unnecessary data. Using one of the CUDA images as a base cuts down on several gigabytes of data with no change in performance or functionality with the existing Bobber image. To further reduce the image size, using multi-stage builds allows us to compile many of the testing tools, such as NCCL and mdtest, inside a beefier build image while copying only the necessary binaries to the final, slimmer runtime image.

Closes #2
Fixes #82

Signed-Off-By: Robert Clark roclark@nvidia.com

roclark · 2021-02-23T23:00:45Z

I've run this on a single node currently and the results are indistinguishable between the 6.1.1 image and this new, lighter image, but am marking this as a draft until I can do some multi-node testing just to verify functionality though I expect that to yield the same results given single-node uses mpirun.

joehandzik · 2021-02-24T22:38:00Z

Definitely my only concern will be if we lost some network functionality somehow, but I think DeepOps uses the base CUDA image for some multinode testing, so I'm hopeful. As you say, let's hold this until we can do a multinode test.

Bobber currently uses a TensorFlow image from NGC as the base to use some of the TensorFlow functionality in the tests. While this is efficient, the Bobber image is quite large (12GB+ at present). By moving to a CUDA base image and installing TensorFlow inside the Bobber image, it might be possible to reduce the overall image size by several gigabytes with no change in functionality or performance. This will require a thorough investigation of the potential impacts of such a change. Signed-Off-By: Robert Clark <roclark@nvidia.com>

roclark · 2022-03-16T20:07:47Z

I was able to verify functionality on multi-node just now and it appears to work well (though I don't have the fastest storage for this cluster so it isn't scaling, but that's expected). Given this will resolve issue #82, there doesn't appear to be a performance regression, and it works well for single- and multi-node, I will go ahead and move this out of draft and merge.

The Bobber image requires compiling multiple binaries across many repositories, creating a lot of unnecessary files which bloats the image. By using multi-stage builds, much of the compilation and dependency installation can be done in a beefier base image and only the necessary components can be copied to the final runtime image. This results in a reduction of several gigabytes in the final image for no loss of performance or functionality. Signed-Off-By: Robert Clark <roclark@nvidia.com>

Newer versions of FIO changed the way results are displayed, causing the parser to complain that the results are invalid. The newer versions contain an extra value in the results line which can safely be ignored. Signed-Off-By: Robert Clark <roclark@nvidia.com>

fredvx

LGTM.

roclark added enhancement New feature or request docker Any items related to the Dockerfile or running and building the image labels Feb 23, 2021

roclark requested review from joehandzik and fredvx February 23, 2021 22:59

roclark self-assigned this Feb 23, 2021

roclark force-pushed the update-base-image branch from c447d28 to 6fc44f0 Compare April 5, 2021 19:53

roclark force-pushed the update-base-image branch 3 times, most recently from 28882a4 to 6b490a8 Compare March 15, 2022 17:59

roclark marked this pull request as ready for review March 16, 2022 20:08

roclark added 2 commits March 16, 2022 15:10

roclark force-pushed the update-base-image branch from 6b490a8 to e9d7d1a Compare March 16, 2022 20:10

fredvx approved these changes Mar 17, 2022

View reviewed changes

joehandzik approved these changes Mar 21, 2022

View reviewed changes

roclark merged commit 0d2808e into main Mar 21, 2022

roclark deleted the update-base-image branch March 21, 2022 21:15

roclark added this to the Release 6.3.1 milestone Mar 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace base image with CUDA image #39

Replace base image with CUDA image #39

roclark commented Feb 23, 2021 •

edited

Loading

roclark commented Feb 23, 2021

joehandzik commented Feb 24, 2021

roclark commented Mar 16, 2022

fredvx left a comment

Replace base image with CUDA image #39

Replace base image with CUDA image #39

Conversation

roclark commented Feb 23, 2021 • edited Loading

roclark commented Feb 23, 2021

joehandzik commented Feb 24, 2021

roclark commented Mar 16, 2022

fredvx left a comment

Choose a reason for hiding this comment

roclark commented Feb 23, 2021 •

edited

Loading