Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Can not build Dockerfile.example.tensorflow-mpi #1336

Closed
qyyy opened this issue Sep 12, 2018 · 21 comments
Closed

Can not build Dockerfile.example.tensorflow-mpi #1336

qyyy opened this issue Sep 12, 2018 · 21 comments
Assignees

Comments

@qyyy
Copy link
Contributor

qyyy commented Sep 12, 2018

I'd like to build the Dockerfile.example.tensorflow-mpi image by myself, I follow these steps:
1.build the Dockerfile.build.mpi image.
2.build the Dockerfile.example.tensorflow-mpi image.
But I meet mistakes during the step 2. I cannot execute the command bazel build -c opt --config=cuda \ --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" \ tensorflow/tools/pip_package:build_pip_package
There're some error, and I fix some of them:
1.apt-get update && apt-get -y upgrade && apt-get install -y git
2.pip install numpy
However, there still an error that I cannot solve:"openmpi/ompi/mpi/cxx/mpicxx.h: No such file or directory"
I have tried several methods to solve it according to this page:
1.upload bazel version to 0.5.4
2.export OMPI_SKIP_MPICXX=1
3.export CC_OPT_FLAGS="-DOMPI_SKIP_MPICXX=1 -march=native"
But they all don't work.

@qyyy
Copy link
Contributor Author

qyyy commented Sep 12, 2018

By the way, is "OMPI_SKIP_MPICXX=1" suitable? It means skip cxx when building.

@abuccts
Copy link
Member

abuccts commented Sep 13, 2018

  1. Building TensorFlow r1.4 with Bazel 0.11.0 doesn't work. We can cherry pick tensorflow/tensorflow@3f57956 to fix the Bazel versoin check bug or downgrade Bazel to 0.5.4.
  2. Adding OMPI_SKIP_MPICXX is OK, pls refer to tensorflow/tensorflow@f73d7c9#diff-7f5f80d91bf584c6c77b2d1bf874ee9b, TensorFlow r1.4 missed that.

I will submit a PR after I have tested it.

@qyyy
Copy link
Contributor Author

qyyy commented Sep 13, 2018

OK, after fixing the docker image, I'll run the mpi tensorflow example again.

@fanyangCS
Copy link
Contributor

is it possible to solve the problem by upgrading tensorflow to higher version?

@qyyy
Copy link
Contributor Author

qyyy commented Sep 13, 2018

I have tried it when I tested the tensorflow benchmark on cifar-10 example. When the version of tensorflow reached over 1.4(1.5,1.7,1.8), even though I used cuda9.0+cudnn7, there are still some error due to nvidia driver version.

@qyyy
Copy link
Contributor Author

qyyy commented Sep 14, 2018

2018-09-14 10:36:49.520632: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_UNKNOWN
2018-09-14 10:36:49.520675: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:152] no NVIDIA GPU device is present: /dev/nvidia0 does not exist
[paigcr-a-gpu-1010:00954] mca: base: component_find: unable to open /usr/local/mpi/lib/openmpi/mca_shmem_posix: /usr/local/mpi/lib/openmpi/mca_shmem_posix.so: undefined symbol: opal_shmem_base_framework (ignored)
[paigcr-a-gpu-1010:00954] mca: base: component_find: unable to open /usr/local/mpi/lib/openmpi/mca_shmem_mmap: /usr/local/mpi/lib/openmpi/mca_shmem_mmap.so: undefined symbol: opal_show_help (ignored)
[paigcr-a-gpu-1010:00954] mca: base: component_find: unable to open /usr/local/mpi/lib/openmpi/mca_shmem_sysv: /usr/local/mpi/lib/openmpi/mca_shmem_sysv.so: undefined symbol: opal_show_help (ignored)

It looks like opal_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

opal_shmem_base_select failed
--> Returned value -1 instead of OPAL_SUCCESS


It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

opal_init failed
--> Returned value Error (-1) instead of ORTE_SUCCESS


It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

ompi_mpi_init: ompi_rte_init failed
--> Returned "Error" (-1) instead of "Success" (0)

*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[paigcr-a-gpu-1010:954] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!

@qyyy
Copy link
Contributor Author

qyyy commented Sep 14, 2018

Now, I can build the docker image, but it seems that the open MPI was not installed correctly.

@qyyy
Copy link
Contributor Author

qyyy commented Sep 14, 2018

By the way, I can run the same code with the same data by using tensoflow1.4 without mpi.

@qyyy
Copy link
Contributor Author

qyyy commented Sep 15, 2018

Maybe there are some error in mpi base image, are there some methods to test it?

@abuccts
Copy link
Member

abuccts commented Sep 16, 2018

Hi @qyyy, did you submit the job to PAI or run the Docker on the host?

tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_UNKNOWN

It seems the driver path wasn't mounted correctly in docker run.

tensorflow/stream_executor/cuda/cuda_diagnostics.cc:152] no NVIDIA GPU device is present: /dev/nvidia0 does not exist

It seems the gpu devices wasn't mounted correctly in docker run.
Please also check there's enough share memory in docker.

Because the openmpi is built with cuda, the gpu problems will also cause the failure of mpi_init.
The mpi base image referred to the openmpi installation part in official cntk image, and Docker.example.cntk-mpi built on the top of mpi base image also works well using openmpi.

@qyyy
Copy link
Contributor Author

qyyy commented Sep 17, 2018

I run it on PAI. And I try to run it again with 16GB memory, but neither the server nor the worker cannot run correctly. The error message is the same as above. I have test the cntk_mpi example, it's no problem. So, I think it's still some error of the building process of tensorflow.

@abuccts
Copy link
Member

abuccts commented Sep 17, 2018

@qyyy Could you please try this simpler mpi example code? Here's an example of its json file:

{
  "jobName": "tensorflow-mpi",
  "image": "openpai/pai.example.tensorflow-mpi",
  "taskRoles": [
    {
      "name": "mpi",
      "taskNumber": 1,
      "cpuNumber": 8,
      "memoryMB": 16384,
      "gpuNumber": 0,
      "command": "mpirun --allow-run-as-root -np 2 --host tf-0,tf-1 python tf-mpi.py",
      "minSucceededTaskCount": 1
    },
    {
      "name": "tf",
      "taskNumber": 2,
      "cpuNumber": 8,
      "memoryMB": 16384,
      "gpuNumber": 2,
      "command": "/bin/bash"
    }
  ]
}

@qyyy
Copy link
Contributor Author

qyyy commented Sep 17, 2018

Where is the tf-mpi.py file?

@abuccts
Copy link
Member

abuccts commented Sep 17, 2018

Could you please try this simpler mpi example code ...

here's tf-mpi.py

@qyyy
Copy link
Contributor Author

qyyy commented Sep 18, 2018

Traceback (most recent call last):
File "tf-mpi.py", line 5, in
import tensorflow.contrib.mpi as mpi
ImportError: No module named mpi

@qyyy
Copy link
Contributor Author

qyyy commented Sep 18, 2018

I have tried this, but it doesn't work.

@qyyy
Copy link
Contributor Author

qyyy commented Sep 18, 2018

I have put an issue onto tensorflow. Here is the issue.

@abuccts
Copy link
Member

abuccts commented Sep 18, 2018

import tensorflow.contrib.mpi as mpi
ImportError: No module named mpi

It's caused by tf version, you can use tensorflow.contrib.mpi_collectives instead.

I have tried those two tensorflow mpi examples. The benchmark example succeed with grpc protocol but failed with grpc+mpi protocal. However, openmpi itself worked fine in that Docker image. The problem may lied in mpi extension in tensorflow compilation, which causes the "An error occurred in MPI_Init" issue. It might be mpi version mismatch, or bugs in tensorflow mpi code.

There exists a few docs on how to build tensorflow with openmpi, including the official readme. The only switch I can make sure is TF_NEED_MPI=1 or Do you wish to build TensorFlow with MPI support [y/N] in tensorflow compilation. And we can only build tensorflow r1.4 with openmpi successfully at present.

I will figure out how to compile and use tensorflow with openmpi, which may take a moment.

@qyyy
Copy link
Contributor Author

qyyy commented Sep 19, 2018

By the way, Dockerfile.example.cntk-mpi cannot be build correctly, either.
The error message is following:

# _/hdfs-mount
import cycle not allowed in test
package _/hdfs-mount (test)
        imports _/hdfs-mount

FAIL    _/hdfs-mount [setup failed]
make: *** [test] Error 1
Makefile:52: recipe for target 'test' failed

@qyyy
Copy link
Contributor Author

qyyy commented Sep 19, 2018

It seems that the hdfs-mount installment process is problematic, both in Dockerfile.example.cntk-mpi and Dockerfile.example.cntk

@scarlett2018
Copy link
Member

@abuccts - per discussion, team won't support the docker anymore, clean up the docker and related samples.

abuccts added a commit that referenced this issue Mar 18, 2019
Remove TensorFlow mpi example which cannot be run currently.
Closes #1336.
abuccts added a commit that referenced this issue Mar 19, 2019
Remove TensorFlow mpi example which cannot be run currently.
Closes #1336.
@abuccts abuccts closed this as completed Mar 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants