Skip to content

[QST] Setting Exclusive_Mode on GPUs making it unavailable for ML workloads #5338

Answered by viadea
Niharikadutta asked this question in General
Discussion options

You must be logged in to vote

@Niharikadutta @tgravescs As discussed today, if the ML job(Horovod on Spark) needs to run after the ETL portion has finished, we may need to split the GPU memory to 2 parts -- one for ETL job(Spark executor), one for Horovod job(which is a separate python process using GPU).

Say for a T4 with 16G GPU memory, we can allocate 6GB to Spark executor, and 10GB to other stuff(including ML: Horovod job in this case).
The 10GB can be reserved by using parameter spark.rapids.memory.gpu.reserve.

To troubleshoot if the Horovod issue is caused by inefficient GPU memory, we can try to do one test to troubleshoot:

  1. Reserve most of the GPU memory for Horovod. (Say spark.rapids.memory.gpu.reserve set to…

Replies: 11 comments

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by sameerz
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
3 participants
Converted from issue

This discussion was converted from issue #3694 on April 27, 2022 16:46.