`libdevice not found at ./libdevice.10.bc` error with tensorflow on GPU node [SOLVED] #61

jbusecke · 2023-04-24T21:37:28Z

jbusecke
Apr 24, 2023
Maintainer

Some of you might run into issues when using tensorflow on the GPU node.

Take this example (thanks to @kdlamb):

# Simple example code from https://machinelearningmastery.com/tensorflow-tutorial-deep-learning-with-tf-keras/
# mlp for binary classification
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense

# load the dataset
path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/ionosphere.csv'
df = read_csv(path, header=None)
# split into input and output columns
X, y = df.values[:, :-1], df.values[:, -1]
# ensure all data are floating point values
X = X.astype('float32')
# encode strings to integer
y = LabelEncoder().fit_transform(y)

# split into train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
# determine the number of input features
n_features = X_train.shape[1]

# define model
model = Sequential()
model.add(Dense(10, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,)))
model.add(Dense(8, activation='relu', kernel_initializer='he_normal'))
model.add(Dense(1, activation='sigmoid'))

# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# fit the model
model.fit(X_train, y_train, epochs=150, batch_size=32, verbose=0)

running this will give you an error in the last line of the code:

2023-04-24 21:37:07.897895: W tensorflow/compiler/xla/service/gpu/nvptx_helper.cc:56] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice. Searched for CUDA in the following directories: /srv/conda/envs/notebook /usr/local/cuda-11.2 /usr/local/cuda . You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions. For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work. 2023-04-24 21:37:07.899614: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc 2023-04-24 21:37:07.899891: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc 2023-04-24 21:37:07.919090: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc 2023-04-24 21:37:07.919300: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc 2023-04-24 21:37:07.938775: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc 2023-04-24 21:37:07.939026: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc 2023-04-24 21:37:07.959816: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc 2023-04-24 21:37:07.960046: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc 2023-04-24 21:37:07.979382: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc 2023-04-24 21:37:07.979626: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc 2023-04-24 21:37:07.998499: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc 2023-04-24 21:37:07.998721: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc --------------------------------------------------------------------------- InternalError Traceback (most recent call last) Cell In[5], line 35 32 model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) 34 # fit the model ---> 35 model.fit(X_train, y_train, epochs=150, batch_size=32, verbose=0)

File /srv/conda/envs/notebook/lib/python3.10/site-packages/keras/utils/traceback_utils.py:70, in filter_traceback..error_handler(*args, **kwargs)
67 filtered_tb = _process_traceback_frames(e.traceback)
68 # To get the full stack trace, call:
69 # tf.debugging.disable_traceback_filtering()
---> 70 raise e.with_traceback(filtered_tb) from None
71 finally:
72 del filtered_tb

File /srv/conda/envs/notebook/lib/python3.10/site-packages/tensorflow/python/eager/execute.py:52, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
50 try:
51 ctx.ensure_initialized()
---> 52 tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
53 inputs, attrs, num_outputs)
54 except core._NotOkStatusException as e:
55 if name is not None:

InternalError: Graph execution error:

Detected at node 'StatefulPartitionedCall_4' defined at (most recent call last):
File "/srv/conda/envs/notebook/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/srv/conda/envs/notebook/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/srv/conda/envs/notebook/lib/python3.10/site-packages/ipykernel_launcher.py", line 17, in
app.launch_new_instance()
File "/srv/conda/envs/notebook/lib/python3.10/site-packages/traitlets/config/application.py", line 1043, in launch_instance
app.start()
File "/srv/conda/envs/notebook/lib/python3.10/site-packages/ipykernel/kernelapp.py", line 725, in start
self.io_loop.start()
File "/srv/conda/envs/notebook/lib/python3.10/site-packages/tornado/platform/asyncio.py", line 215, in start
self.asyncio_loop.run_forever()
File "/srv/conda/envs/notebook/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/srv/conda/envs/notebook/lib/python3.10/asyncio/base_events.py", line 1906, in _run_once
handle._run()
File "/srv/conda/envs/notebook/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/srv/conda/envs/notebook/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 513, in dispatch_queue
await self.process_one()
File "/srv/conda/envs/notebook/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 502, in process_one
await dispatch(*args)
File "/srv/conda/envs/notebook/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 409, in dispatch_shell
await result
File "/srv/conda/envs/notebook/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 729, in execute_request
reply_content = await reply_content
File "/srv/conda/envs/notebook/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 422, in do_execute
res = shell.run_cell(
File "/srv/conda/envs/notebook/lib/python3.10/site-packages/ipykernel/zmqshell.py", line 540, in run_cell
return super().run_cell(*args, **kwargs)
File "/srv/conda/envs/notebook/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 2961, in run_cell
result = self._run_cell(
File "/srv/conda/envs/notebook/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3016, in _run_cell
result = runner(coro)
File "/srv/conda/envs/notebook/lib/python3.10/site-packages/IPython/core/async_helpers.py", line 129, in pseudo_sync_runner
coro.send(None)
File "/srv/conda/envs/notebook/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3221, in run_cell_async
has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
File "/srv/conda/envs/notebook/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3400, in run_ast_nodes
if await self.run_code(code, result, async=asy):
File "/srv/conda/envs/notebook/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3460, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "/tmp/ipykernel_648/2483036698.py", line 35, in
model.fit(X_train, y_train, epochs=150, batch_size=32, verbose=0)
File "/srv/conda/envs/notebook/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
return fn(*args, **kwargs)
File "/srv/conda/envs/notebook/lib/python3.10/site-packages/keras/engine/training.py", line 1650, in fit
tmp_logs = self.train_function(iterator)
File "/srv/conda/envs/notebook/lib/python3.10/site-packages/keras/engine/training.py", line 1249, in train_function
return step_function(self, iterator)
File "/srv/conda/envs/notebook/lib/python3.10/site-packages/keras/engine/training.py", line 1233, in step_function
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "/srv/conda/envs/notebook/lib/python3.10/site-packages/keras/engine/training.py", line 1222, in run_step
outputs = model.train_step(data)
File "/srv/conda/envs/notebook/lib/python3.10/site-packages/keras/engine/training.py", line 1027, in train_step
self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
File "/srv/conda/envs/notebook/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 527, in minimize
self.apply_gradients(grads_and_vars)
File "/srv/conda/envs/notebook/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1140, in apply_gradients
return super().apply_gradients(grads_and_vars, name=name)
File "/srv/conda/envs/notebook/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 634, in apply_gradients
iteration = self._internal_apply_gradients(grads_and_vars)
File "/srv/conda/envs/notebook/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1166, in _internal_apply_gradients
return tf.internal.distribute.interim.maybe_merge_call(
File "/srv/conda/envs/notebook/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1216, in _distributed_apply_gradients_fn
distribution.extended.update(
File "/srv/conda/envs/notebook/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1211, in apply_grad_to_update_var
return self._update_step_xla(grad, var, id(self._var_key(var)))
Node: 'StatefulPartitionedCall_4'
libdevice not found at ./libdevice.10.bc
[[{{node StatefulPartitionedCall_4}}]] [Op:__inference_train_function_1009]

When running this on a cpu node the problem does not happen.

Answered by jbusecke

Apr 24, 2023

The issue for this error is a missing cuda library. Apparently due to licensing issues this cannot be included with the current pangeo-docker-image.

The solution (as suggested here https://github.com/pangeo-data/pangeo-docker-images/blob/614419aa55eea9200876357626eb498b17a27755/README.md?plain=1#L173) is to manually install cuda-nvvm, with

mamba install -c nvidia cuda-nvcc

You need to additionally set the cuda directory as environment variable in your notebook:

import os
os.environ['XLA_FLAGS'] = '--xla_gpu_cuda_data_dir=/srv/conda/envs/notebook'

View full answer

jbusecke · 2023-04-24T21:44:54Z

jbusecke
Apr 24, 2023
Maintainer Author

The issue for this error is a missing cuda library. Apparently due to licensing issues this cannot be included with the current pangeo-docker-image.

The solution (as suggested here https://github.com/pangeo-data/pangeo-docker-images/blob/614419aa55eea9200876357626eb498b17a27755/README.md?plain=1#L173) is to manually install cuda-nvvm, with

mamba install -c nvidia cuda-nvcc

You need to additionally set the cuda directory as environment variable in your notebook:

import os
os.environ['XLA_FLAGS'] = '--xla_gpu_cuda_data_dir=/srv/conda/envs/notebook'

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`libdevice not found at ./libdevice.10.bc` error with tensorflow on GPU node [SOLVED] #61

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

libdevice not found at ./libdevice.10.bc error with tensorflow on GPU node [SOLVED] #61

jbusecke Apr 24, 2023 Maintainer

Replies: 1 comment

jbusecke Apr 24, 2023 Maintainer Author

`libdevice not found at ./libdevice.10.bc` error with tensorflow on GPU node [SOLVED] #61

jbusecke
Apr 24, 2023
Maintainer

jbusecke
Apr 24, 2023
Maintainer Author