Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Including tfjs_graph_converter disables GPU #35

Closed
bhelm opened this issue Feb 24, 2022 · 5 comments
Closed

Including tfjs_graph_converter disables GPU #35

bhelm opened this issue Feb 24, 2022 · 5 comments
Assignees
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@bhelm
Copy link

bhelm commented Feb 24, 2022

Demo Script:

import os
import tensorflow as tf
print("cuda devices before include", os.getenv('CUDA_VISIBLE_DEVICES'))

import tfjs_graph_converter.api as tfjs_api
import tfjs_graph_converter.util as tfjs_util

print("cuda devices after include", os.getenv('CUDA_VISIBLE_DEVICES'))
physical_devices = tf.config.list_physical_devices('GPU')
print("Num GPUs:", len(physical_devices))

output:

cuda devices before include None
cuda devices after include -1
Num GPUs: 0

a os.unsetenv('CUDA_VISIBLE_DEVICES') directly after the include works around the problem.

My System: Debian 10, Python 3.8/3.9 with

tensorflow                   2.8.0
tensorflow-hub               0.12.0
tensorflow-io-gcs-filesystem 0.24.0
tensorflowjs                 3.13.0
tfjs-graph-converter         1.4.2

this is where it happens at the source:
tfjs_graph_converter/__init__.py:

# disable CUDA devices - we only want the CPU do work with data
os.environ['CUDA_VISIBLE_DEVICES'] = "-1"

Including a dependency should not set CUDA_VISIBLE_DEVICES or any other ENV, as this can cause side-affects to other applications. In my case, this problem caused virtual-webcam to not use GPU acceleration even if it is available, leading to high load and low latency without any hint why it happens. It took me 6 hours to figure this out.
It may be possible to scope that "disable gpu for conversion" only to the conversion functions if this is really required or wrap it into a function so the developer using the library can decide.

Thank You.

@patlevin
Copy link
Owner

Including a dependency should not set CUDA_VISIBLE_DEVICES or any other ENV, as this can cause side-affects to other applications.

Well, at the time this library was started there was no other way to (reliably) control the device used by Tensorflow.
Since configuration happened at initialisation time (i.e. once tf was first included), changing environment variables was the only option.

Since os.environ only affects the current process and this library wasn't meant to be used as a loader, but for converting, I didn't see a problem with this.

To my knowledge using environment variables is still this the only way to control the log-level from Python, but I will take another look at limiting visible devices via API calls and whether that changed with the newer versions.

@bhelm
Copy link
Author

bhelm commented Feb 24, 2022

I understand. maybe using with tf.device("/cpu:0"), but im just guessing.

Thank you :)

@patlevin
Copy link
Owner

patlevin commented Feb 24, 2022

After about an hour of testing and another hour or so digging through both the Python and the C++ code base of Tensorflow, I can now say with confidence that there's no way to reliably disable NVIDIA GPUs using the TF Python API.

The converter doesn't use the TF2 interface, because it converts graph models which aren't used by the TF2 compute model.
It needs to access low-level APIs that aren't affected by the context manager used by tf.device or TF1.x's ConfigProto.

The graph model and the internal optimizer ("grappler" in TF-lingo) don't know about the Python API and rely on the C++ platform manager instead. So while the Python API can be used to set the device for model training and inference, it doesn't affect any low-level processing like graph manipulation.

The environment variable (CUDA_VISIBLE_DEVICES) isn't even known by Tensorflow - it's an NVIDIA driver configuration.

The reason the converter needs to run on the CPU is memory and compute capability. The CUDA driver likes to lock up if models are converted on weaker (or just older) GPUs and error out if not enough memory is available. Running the converter on the CPU only (or on ROCM - I haven't tested on AMD or Intel GPUs using ROCM yet) ensures that converting doesn't randomly fail because there's the "wrong" NVIDIA GPU installed on the system.

I will get back to this later tomorrow and test whether resetting the environment variable works for re-enabling CUDA after converting is finished.

@patlevin patlevin self-assigned this Feb 24, 2022
@patlevin patlevin added documentation Improvements or additions to documentation enhancement New feature or request labels Feb 24, 2022
@patlevin
Copy link
Owner

patlevin commented Feb 25, 2022

I've added the option to enable CUDA. This will be available in the next release.

Basically the default will still be to run any script that includes the converter in CPU-only mode, but optionally CUDA can be enabled like so:

from typing import List
import sys
import tfjs_graph_converter as tfjs


def main(args: List[str]) -> None:
    if '--enable-cuda' in args:
        tfjs.api.enable_cuda()
    graph = tfjs.api.load_graph_model('models/some_tfjs_graph_model')
    model = tfjs.api.graph_to_function_v2(graph)
    inputs = ...
    # inference will run on CUDA-device if available
    result = model(inputs)
    # CUDA-capable GPU will be available for use with other libraries and tf functions, too


if __name__ == '__main__':
    main(sys.argv)

It doesn't really matter when and where enable_cuda() is called, as long as it happens before any Tensorflow or graph converter function is called.

I'll document the change and package a release in a bit.

@bhelm
Copy link
Author

bhelm commented Mar 7, 2022

Ok, thank you for caring, i think that should do the trick. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants