-
Some of you might run into issues when using tensorflow on the GPU node. Take this example (thanks to @kdlamb): # Simple example code from https://machinelearningmastery.com/tensorflow-tutorial-deep-learning-with-tf-keras/
# mlp for binary classification
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
# load the dataset
path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/ionosphere.csv'
df = read_csv(path, header=None)
# split into input and output columns
X, y = df.values[:, :-1], df.values[:, -1]
# ensure all data are floating point values
X = X.astype('float32')
# encode strings to integer
y = LabelEncoder().fit_transform(y)
# split into train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
# determine the number of input features
n_features = X_train.shape[1]
# define model
model = Sequential()
model.add(Dense(10, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,)))
model.add(Dense(8, activation='relu', kernel_initializer='he_normal'))
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# fit the model
model.fit(X_train, y_train, epochs=150, batch_size=32, verbose=0) running this will give you an error in the last line of the code:
2023-04-24 21:37:07.897895: W tensorflow/compiler/xla/service/gpu/nvptx_helper.cc:56] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
/srv/conda/envs/notebook
/usr/local/cuda-11.2
/usr/local/cuda
.
You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions. For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.
2023-04-24 21:37:07.899614: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2023-04-24 21:37:07.899891: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc
2023-04-24 21:37:07.919090: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2023-04-24 21:37:07.919300: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc
2023-04-24 21:37:07.938775: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2023-04-24 21:37:07.939026: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc
2023-04-24 21:37:07.959816: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2023-04-24 21:37:07.960046: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc
2023-04-24 21:37:07.979382: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2023-04-24 21:37:07.979626: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc
2023-04-24 21:37:07.998499: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2023-04-24 21:37:07.998721: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc
---------------------------------------------------------------------------
InternalError Traceback (most recent call last)
Cell In[5], line 35
32 model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
34 # fit the model
---> 35 model.fit(X_train, y_train, epochs=150, batch_size=32, verbose=0)
File /srv/conda/envs/notebook/lib/python3.10/site-packages/keras/utils/traceback_utils.py:70, in filter_traceback..error_handler(*args, **kwargs) File /srv/conda/envs/notebook/lib/python3.10/site-packages/tensorflow/python/eager/execute.py:52, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name) InternalError: Graph execution error: Detected at node 'StatefulPartitionedCall_4' defined at (most recent call last): When running this on a cpu node the problem does not happen. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
The issue for this error is a missing The solution (as suggested here https://github.com/pangeo-data/pangeo-docker-images/blob/614419aa55eea9200876357626eb498b17a27755/README.md?plain=1#L173) is to manually install cuda-nvvm, with mamba install -c nvidia cuda-nvcc You need to additionally set the cuda directory as environment variable in your notebook: import os
os.environ['XLA_FLAGS'] = '--xla_gpu_cuda_data_dir=/srv/conda/envs/notebook' |
Beta Was this translation helpful? Give feedback.
The issue for this error is a missing
cuda
library. Apparently due to licensing issues this cannot be included with the current pangeo-docker-image.The solution (as suggested here https://github.com/pangeo-data/pangeo-docker-images/blob/614419aa55eea9200876357626eb498b17a27755/README.md?plain=1#L173) is to manually install cuda-nvvm, with
You need to additionally set the cuda directory as environment variable in your notebook: