Including tfjs_graph_converter disables GPU #35

bhelm · 2022-02-24T12:51:25Z

Demo Script:

import os
import tensorflow as tf
print("cuda devices before include", os.getenv('CUDA_VISIBLE_DEVICES'))

import tfjs_graph_converter.api as tfjs_api
import tfjs_graph_converter.util as tfjs_util

print("cuda devices after include", os.getenv('CUDA_VISIBLE_DEVICES'))
physical_devices = tf.config.list_physical_devices('GPU')
print("Num GPUs:", len(physical_devices))

output:

cuda devices before include None
cuda devices after include -1
Num GPUs: 0

a os.unsetenv('CUDA_VISIBLE_DEVICES') directly after the include works around the problem.

My System: Debian 10, Python 3.8/3.9 with

tensorflow                   2.8.0
tensorflow-hub               0.12.0
tensorflow-io-gcs-filesystem 0.24.0
tensorflowjs                 3.13.0
tfjs-graph-converter         1.4.2

this is where it happens at the source:
tfjs_graph_converter/__init__.py:

# disable CUDA devices - we only want the CPU do work with data
os.environ['CUDA_VISIBLE_DEVICES'] = "-1"

Including a dependency should not set CUDA_VISIBLE_DEVICES or any other ENV, as this can cause side-affects to other applications. In my case, this problem caused virtual-webcam to not use GPU acceleration even if it is available, leading to high load and low latency without any hint why it happens. It took me 6 hours to figure this out.
It may be possible to scope that "disable gpu for conversion" only to the conversion functions if this is really required or wrap it into a function so the developer using the library can decide.

Thank You.

The text was updated successfully, but these errors were encountered:

patlevin · 2022-02-24T14:55:15Z

Including a dependency should not set CUDA_VISIBLE_DEVICES or any other ENV, as this can cause side-affects to other applications.

Well, at the time this library was started there was no other way to (reliably) control the device used by Tensorflow.
Since configuration happened at initialisation time (i.e. once tf was first included), changing environment variables was the only option.

Since os.environ only affects the current process and this library wasn't meant to be used as a loader, but for converting, I didn't see a problem with this.

To my knowledge using environment variables is still this the only way to control the log-level from Python, but I will take another look at limiting visible devices via API calls and whether that changed with the newer versions.

bhelm · 2022-02-24T15:36:32Z

I understand. maybe using with tf.device("/cpu:0"), but im just guessing.

Thank you :)

patlevin · 2022-02-24T23:31:30Z

After about an hour of testing and another hour or so digging through both the Python and the C++ code base of Tensorflow, I can now say with confidence that there's no way to reliably disable NVIDIA GPUs using the TF Python API.

The converter doesn't use the TF2 interface, because it converts graph models which aren't used by the TF2 compute model.
It needs to access low-level APIs that aren't affected by the context manager used by tf.device or TF1.x's ConfigProto.

The graph model and the internal optimizer ("grappler" in TF-lingo) don't know about the Python API and rely on the C++ platform manager instead. So while the Python API can be used to set the device for model training and inference, it doesn't affect any low-level processing like graph manipulation.

The environment variable (CUDA_VISIBLE_DEVICES) isn't even known by Tensorflow - it's an NVIDIA driver configuration.

The reason the converter needs to run on the CPU is memory and compute capability. The CUDA driver likes to lock up if models are converted on weaker (or just older) GPUs and error out if not enough memory is available. Running the converter on the CPU only (or on ROCM - I haven't tested on AMD or Intel GPUs using ROCM yet) ensures that converting doesn't randomly fail because there's the "wrong" NVIDIA GPU installed on the system.

I will get back to this later tomorrow and test whether resetting the environment variable works for re-enabling CUDA after converting is finished.

patlevin · 2022-02-25T17:56:53Z

I've added the option to enable CUDA. This will be available in the next release.

Basically the default will still be to run any script that includes the converter in CPU-only mode, but optionally CUDA can be enabled like so:

from typing import List
import sys
import tfjs_graph_converter as tfjs


def main(args: List[str]) -> None:
    if '--enable-cuda' in args:
        tfjs.api.enable_cuda()
    graph = tfjs.api.load_graph_model('models/some_tfjs_graph_model')
    model = tfjs.api.graph_to_function_v2(graph)
    inputs = ...
    # inference will run on CUDA-device if available
    result = model(inputs)
    # CUDA-capable GPU will be available for use with other libraries and tf functions, too


if __name__ == '__main__':
    main(sys.argv)

It doesn't really matter when and where enable_cuda() is called, as long as it happens before any Tensorflow or graph converter function is called.

I'll document the change and package a release in a bit.

bhelm · 2022-03-07T07:19:34Z

Ok, thank you for caring, i think that should do the trick. 👍

bhelm mentioned this issue Feb 24, 2022

Workaround for disabled GPU by tfjs_graph_converter allo-/virtual_webcam_background#88

Merged

patlevin self-assigned this Feb 24, 2022

patlevin added documentation Improvements or additions to documentation enhancement New feature or request labels Feb 24, 2022

patlevin closed this as completed in 2420955 Feb 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Including tfjs_graph_converter disables GPU #35

Including tfjs_graph_converter disables GPU #35

bhelm commented Feb 24, 2022

patlevin commented Feb 24, 2022

bhelm commented Feb 24, 2022

patlevin commented Feb 24, 2022 •

edited

Loading

patlevin commented Feb 25, 2022 •

edited

Loading

bhelm commented Mar 7, 2022

Including tfjs_graph_converter disables GPU #35

Including tfjs_graph_converter disables GPU #35

Comments

bhelm commented Feb 24, 2022

patlevin commented Feb 24, 2022

bhelm commented Feb 24, 2022

patlevin commented Feb 24, 2022 • edited Loading

patlevin commented Feb 25, 2022 • edited Loading

bhelm commented Mar 7, 2022

patlevin commented Feb 24, 2022 •

edited

Loading

patlevin commented Feb 25, 2022 •

edited

Loading