Is it possible to use Modnet with gpu acceleration ? #226

naik-aakash · 2024-10-10T08:39:18Z

Hi @ml-evs , @ppdebreuck , I have been trying to use Modnet with GPU. But cannot get it to work. I had to use a different tensorflow version than pinned on modnet. I am using tensorflow==2.15.0. (As using 2.11.0, gpus are not detected at all on my system)

System has cuda 12.4 installed.

It always fails with the following error

"CUDA-capable device(s) is/are busy or unavailable" or failed to set cuda device.

I tested with this to see if the TensorFlow is installed correctly. It seems to be the case tensorflow is working fine.

import os
import tensorflow as tf
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
tf.debugging.set_log_device_placement(True)
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

with tf.device('/gpu:0'):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)

with tf.Session() as sess:
    print (sess.run(c))

Not able to figure out what could be the problem here. Any help in this regard would be great!

The text was updated successfully, but these errors were encountered:

ml-evs · 2024-10-10T08:43:00Z

Could you post a minimum failing MODNet script too? I've never had issues trying it out on my GPU with TF 2.11 (I doubt much of MODNet works with a more recent TF, though we don't really have the resources to spend to update it).

naik-aakash · 2024-10-10T08:49:49Z

Hi @ml-evs , will post in today in sometime. Need to wait for gpus to be free again.

Can you also share your environment file, I can test with that too if that helps.

ml-evs · 2024-10-10T08:51:41Z

I haven't tried for years as I never saw any worthwhile speed up for the size of networks I was creating, so don't have one to hand unfortunately. But I can try to reproduce your issue locally at least.

naik-aakash · 2024-10-10T16:01:52Z

Hi @ml-evs , I realized the example script is almost same as in one of the example notebooks, just with additional features appended to dataframe .

Just at the top I have this additional lines to limit to 1 GPU

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

https://github.com/ppdebreuck/modnet/blob/master/example_notebooks/training_multi_data.ipynb

On further checking a bit, I find when I try to use GA especially this error occurs. In the same notebook, it seems to run fine untill following line.

# fitting
model.fit(train_data)

But When I try to run with Gentic algo based hp opt, i get cudaSetDevice() on GPU:0 failed. Status: CUDA-capable device(s) is/are busy or unavailable error. So seems somehow gpu gets registered once model is initialized and not get released for next iterations.

I also tried with model.fit_preset(train_data) same error.

Also, just want to mention when I ran this tests , I have ran this fit commands independently after restarting kernel. So gpu was free in all the cases. It worked only when not using preset or genetic algo

ml-evs · 2024-10-10T16:08:42Z

Also, just want to mention when I ran this tests , I have ran this fit commands independently after restarting kernel. So gpu was free in all the cases. It worked only when not using preset or genetic algo

I don't have time to look into this fully now, but my guess is that the first preset is running and the rest see that the device is busy, as we are using Python multiprocessing and typically TF will allocate the entire GPU ot the first process. You could try e.g. tf.debugging.set_log_device_placement(True). If this is the case, you might be able to fiddle around with set_memory_growth to allow each process to use a small amount of memory initially. I'm not an expert on this, but in the past the gains we saw were very small (/non-existent) from using a GPU for small MODNet models (which is to say I'd be interested in your results if you can get this working!)

ppdebreuck · 2024-10-21T14:23:48Z

Hey all ! Sorry for the late reply. I would not spent time on this @naik-aakash: (i) MODNet uses small networks (intended for small datasets), with no benefit of GPU training (except if you force on having big architectures, but this would be a very specific use case). (ii) We are not fully happy with TF, and would like to migrate to Torch or JAX. Matthew and I don't have time for this, but we might have a student doing this next semester (finger crossed) ;p

ml-evs · 2024-10-21T14:28:09Z

Just to add I'd still love to find time to update MODNet to the latest Keras core, which has torch/Jax and TF backends (and hopefully should be quite an easy translation job!) -- see #158 for some details

ppdebreuck · 2024-10-21T14:43:01Z

Great idea, and probably easier than translating to PyTorch given our frequent usage of tf.keras

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to use Modnet with gpu acceleration ? #226

Is it possible to use Modnet with gpu acceleration ? #226

naik-aakash commented Oct 10, 2024

ml-evs commented Oct 10, 2024

naik-aakash commented Oct 10, 2024

ml-evs commented Oct 10, 2024

naik-aakash commented Oct 10, 2024

ml-evs commented Oct 10, 2024

ppdebreuck commented Oct 21, 2024

ml-evs commented Oct 21, 2024

ppdebreuck commented Oct 21, 2024

Is it possible to use Modnet with gpu acceleration ? #226

Is it possible to use Modnet with gpu acceleration ? #226

Comments

naik-aakash commented Oct 10, 2024

ml-evs commented Oct 10, 2024

naik-aakash commented Oct 10, 2024

ml-evs commented Oct 10, 2024

naik-aakash commented Oct 10, 2024

ml-evs commented Oct 10, 2024

ppdebreuck commented Oct 21, 2024

ml-evs commented Oct 21, 2024

ppdebreuck commented Oct 21, 2024