-
Notifications
You must be signed in to change notification settings - Fork 548
Can not build Dockerfile.example.tensorflow-mpi #1336
Comments
By the way, is "OMPI_SKIP_MPICXX=1" suitable? It means skip cxx when building. |
I will submit a PR after I have tested it. |
OK, after fixing the docker image, I'll run the mpi tensorflow example again. |
is it possible to solve the problem by upgrading tensorflow to higher version? |
I have tried it when I tested the tensorflow benchmark on cifar-10 example. When the version of tensorflow reached over 1.4(1.5,1.7,1.8), even though I used cuda9.0+cudnn7, there are still some error due to nvidia driver version. |
2018-09-14 10:36:49.520632: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_UNKNOWN
|
Now, I can build the docker image, but it seems that the open MPI was not installed correctly. |
By the way, I can run the same code with the same data by using tensoflow1.4 without mpi. |
Maybe there are some error in mpi base image, are there some methods to test it? |
Hi @qyyy, did you submit the job to PAI or run the Docker on the host?
It seems the driver path wasn't mounted correctly in
It seems the gpu devices wasn't mounted correctly in Because the openmpi is built with cuda, the gpu problems will also cause the failure of mpi_init. |
I run it on PAI. And I try to run it again with 16GB memory, but neither the server nor the worker cannot run correctly. The error message is the same as above. I have test the cntk_mpi example, it's no problem. So, I think it's still some error of the building process of tensorflow. |
@qyyy Could you please try this simpler mpi example code? Here's an example of its json file: {
"jobName": "tensorflow-mpi",
"image": "openpai/pai.example.tensorflow-mpi",
"taskRoles": [
{
"name": "mpi",
"taskNumber": 1,
"cpuNumber": 8,
"memoryMB": 16384,
"gpuNumber": 0,
"command": "mpirun --allow-run-as-root -np 2 --host tf-0,tf-1 python tf-mpi.py",
"minSucceededTaskCount": 1
},
{
"name": "tf",
"taskNumber": 2,
"cpuNumber": 8,
"memoryMB": 16384,
"gpuNumber": 2,
"command": "/bin/bash"
}
]
} |
Where is the tf-mpi.py file? |
here's tf-mpi.py |
Traceback (most recent call last): |
I have tried this, but it doesn't work. |
I have put an issue onto tensorflow. Here is the issue. |
It's caused by tf version, you can use I have tried those two tensorflow mpi examples. The benchmark example succeed with There exists a few docs on how to build tensorflow with openmpi, including the official readme. The only switch I can make sure is I will figure out how to compile and use tensorflow with openmpi, which may take a moment. |
By the way, Dockerfile.example.cntk-mpi cannot be build correctly, either.
|
It seems that the hdfs-mount installment process is problematic, both in Dockerfile.example.cntk-mpi and Dockerfile.example.cntk |
@abuccts - per discussion, team won't support the docker anymore, clean up the docker and related samples. |
Remove TensorFlow mpi example which cannot be run currently. Closes #1336.
Remove TensorFlow mpi example which cannot be run currently. Closes #1336.
I'd like to build the Dockerfile.example.tensorflow-mpi image by myself, I follow these steps:
1.build the Dockerfile.build.mpi image.
2.build the Dockerfile.example.tensorflow-mpi image.
But I meet mistakes during the step 2. I cannot execute the command
bazel build -c opt --config=cuda \ --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" \ tensorflow/tools/pip_package:build_pip_package
There're some error, and I fix some of them:
1.apt-get update && apt-get -y upgrade && apt-get install -y git
2.pip install numpy
However, there still an error that I cannot solve:"openmpi/ompi/mpi/cxx/mpicxx.h: No such file or directory"
I have tried several methods to solve it according to this page:
1.upload bazel version to 0.5.4
2.export OMPI_SKIP_MPICXX=1
3.export CC_OPT_FLAGS="-DOMPI_SKIP_MPICXX=1 -march=native"
But they all don't work.
The text was updated successfully, but these errors were encountered: