Can not build Dockerfile.example.tensorflow-mpi #1336

qyyy · 2018-09-12T06:50:49Z

I'd like to build the Dockerfile.example.tensorflow-mpi image by myself, I follow these steps:
1.build the Dockerfile.build.mpi image.
2.build the Dockerfile.example.tensorflow-mpi image.
But I meet mistakes during the step 2. I cannot execute the command bazel build -c opt --config=cuda \ --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" \ tensorflow/tools/pip_package:build_pip_package
There're some error, and I fix some of them:
1.apt-get update && apt-get -y upgrade && apt-get install -y git
2.pip install numpy
However, there still an error that I cannot solve:"openmpi/ompi/mpi/cxx/mpicxx.h: No such file or directory"
I have tried several methods to solve it according to this page:
1.upload bazel version to 0.5.4
2.export OMPI_SKIP_MPICXX=1
3.export CC_OPT_FLAGS="-DOMPI_SKIP_MPICXX=1 -march=native"
But they all don't work.

The text was updated successfully, but these errors were encountered:

qyyy · 2018-09-12T08:07:57Z

By the way, is "OMPI_SKIP_MPICXX=1" suitable? It means skip cxx when building.

abuccts · 2018-09-13T05:35:02Z

Building TensorFlow r1.4 with Bazel 0.11.0 doesn't work. We can cherry pick tensorflow/tensorflow@3f57956 to fix the Bazel versoin check bug or downgrade Bazel to 0.5.4.
Adding OMPI_SKIP_MPICXX is OK, pls refer to tensorflow/tensorflow@f73d7c9#diff-7f5f80d91bf584c6c77b2d1bf874ee9b, TensorFlow r1.4 missed that.

I will submit a PR after I have tested it.

qyyy · 2018-09-13T08:34:43Z

OK, after fixing the docker image, I'll run the mpi tensorflow example again.

fanyangCS · 2018-09-13T08:36:12Z

is it possible to solve the problem by upgrading tensorflow to higher version?

qyyy · 2018-09-13T08:39:54Z

I have tried it when I tested the tensorflow benchmark on cifar-10 example. When the version of tensorflow reached over 1.4(1.5,1.7,1.8), even though I used cuda9.0+cudnn7, there are still some error due to nvidia driver version.

qyyy · 2018-09-14T10:37:12Z

2018-09-14 10:36:49.520632: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_UNKNOWN
2018-09-14 10:36:49.520675: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:152] no NVIDIA GPU device is present: /dev/nvidia0 does not exist
[paigcr-a-gpu-1010:00954] mca: base: component_find: unable to open /usr/local/mpi/lib/openmpi/mca_shmem_posix: /usr/local/mpi/lib/openmpi/mca_shmem_posix.so: undefined symbol: opal_shmem_base_framework (ignored)
[paigcr-a-gpu-1010:00954] mca: base: component_find: unable to open /usr/local/mpi/lib/openmpi/mca_shmem_mmap: /usr/local/mpi/lib/openmpi/mca_shmem_mmap.so: undefined symbol: opal_show_help (ignored)
[paigcr-a-gpu-1010:00954] mca: base: component_find: unable to open /usr/local/mpi/lib/openmpi/mca_shmem_sysv: /usr/local/mpi/lib/openmpi/mca_shmem_sysv.so: undefined symbol: opal_show_help (ignored)

It looks like opal_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

opal_shmem_base_select failed
--> Returned value -1 instead of OPAL_SUCCESS

It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

opal_init failed
--> Returned value Error (-1) instead of ORTE_SUCCESS

It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

ompi_mpi_init: ompi_rte_init failed
--> Returned "Error" (-1) instead of "Success" (0)

*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[paigcr-a-gpu-1010:954] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!

qyyy · 2018-09-14T10:38:01Z

Now, I can build the docker image, but it seems that the open MPI was not installed correctly.

qyyy · 2018-09-14T10:43:26Z

By the way, I can run the same code with the same data by using tensoflow1.4 without mpi.

qyyy · 2018-09-15T05:14:41Z

Maybe there are some error in mpi base image, are there some methods to test it?

abuccts · 2018-09-16T13:01:33Z

Hi @qyyy, did you submit the job to PAI or run the Docker on the host?

tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_UNKNOWN

It seems the driver path wasn't mounted correctly in docker run.

tensorflow/stream_executor/cuda/cuda_diagnostics.cc:152] no NVIDIA GPU device is present: /dev/nvidia0 does not exist

It seems the gpu devices wasn't mounted correctly in docker run.
Please also check there's enough share memory in docker.

Because the openmpi is built with cuda, the gpu problems will also cause the failure of mpi_init.
The mpi base image referred to the openmpi installation part in official cntk image, and Docker.example.cntk-mpi built on the top of mpi base image also works well using openmpi.

qyyy · 2018-09-17T02:29:57Z

I run it on PAI. And I try to run it again with 16GB memory, but neither the server nor the worker cannot run correctly. The error message is the same as above. I have test the cntk_mpi example, it's no problem. So, I think it's still some error of the building process of tensorflow.

abuccts · 2018-09-17T09:38:54Z

@qyyy Could you please try this simpler mpi example code? Here's an example of its json file:

{
  "jobName": "tensorflow-mpi",
  "image": "openpai/pai.example.tensorflow-mpi",
  "taskRoles": [
    {
      "name": "mpi",
      "taskNumber": 1,
      "cpuNumber": 8,
      "memoryMB": 16384,
      "gpuNumber": 0,
      "command": "mpirun --allow-run-as-root -np 2 --host tf-0,tf-1 python tf-mpi.py",
      "minSucceededTaskCount": 1
    },
    {
      "name": "tf",
      "taskNumber": 2,
      "cpuNumber": 8,
      "memoryMB": 16384,
      "gpuNumber": 2,
      "command": "/bin/bash"
    }
  ]
}

qyyy · 2018-09-17T10:19:02Z

Where is the tf-mpi.py file?

abuccts · 2018-09-17T11:52:39Z

Could you please try this simpler mpi example code ...

here's tf-mpi.py

qyyy · 2018-09-18T02:11:52Z

Traceback (most recent call last):
File "tf-mpi.py", line 5, in
import tensorflow.contrib.mpi as mpi
ImportError: No module named mpi

qyyy · 2018-09-18T09:12:46Z

I have tried this, but it doesn't work.

qyyy · 2018-09-18T12:05:12Z

I have put an issue onto tensorflow. Here is the issue.

abuccts · 2018-09-18T14:49:51Z

import tensorflow.contrib.mpi as mpi
ImportError: No module named mpi

It's caused by tf version, you can use tensorflow.contrib.mpi_collectives instead.

I have tried those two tensorflow mpi examples. The benchmark example succeed with grpc protocol but failed with grpc+mpi protocal. However, openmpi itself worked fine in that Docker image. The problem may lied in mpi extension in tensorflow compilation, which causes the "An error occurred in MPI_Init" issue. It might be mpi version mismatch, or bugs in tensorflow mpi code.

There exists a few docs on how to build tensorflow with openmpi, including the official readme. The only switch I can make sure is TF_NEED_MPI=1 or Do you wish to build TensorFlow with MPI support [y/N] in tensorflow compilation. And we can only build tensorflow r1.4 with openmpi successfully at present.

I will figure out how to compile and use tensorflow with openmpi, which may take a moment.

qyyy · 2018-09-19T02:17:32Z

By the way, Dockerfile.example.cntk-mpi cannot be build correctly, either.
The error message is following:

# _/hdfs-mount
import cycle not allowed in test
package _/hdfs-mount (test)
        imports _/hdfs-mount

FAIL    _/hdfs-mount [setup failed]
make: *** [test] Error 1
Makefile:52: recipe for target 'test' failed

qyyy · 2018-09-19T03:20:14Z

It seems that the hdfs-mount installment process is problematic, both in Dockerfile.example.cntk-mpi and Dockerfile.example.cntk

scarlett2018 · 2019-03-15T07:40:37Z

@abuccts - per discussion, team won't support the docker anymore, clean up the docker and related samples.

Remove TensorFlow mpi example which cannot be run currently. Closes #1336.

abuccts mentioned this issue Sep 13, 2018

[Examples] Update TensorFlow MPI Dockerfile #1361

Closed

qyyy mentioned this issue Sep 18, 2018

tensorflow.contrib.mpi import fails and cannot run with mpi even though tensorflow is compiled with mpi tensorflow/tensorflow#22344

Closed

fanyangCS assigned qyyy Oct 16, 2018

scarlett2018 added pai-dev docker labels Jan 22, 2019

fanyangCS assigned abuccts and unassigned qyyy Feb 28, 2019

abuccts added a commit that referenced this issue Mar 18, 2019

Remove TensorFlow mpi example

68bb6e7

Remove TensorFlow mpi example which cannot be run currently. Closes #1336.

abuccts mentioned this issue Mar 18, 2019

[Docs] Remove TensorFlow mpi example #2337

Merged

abuccts added a commit that referenced this issue Mar 19, 2019

Remove TensorFlow mpi example (#2337)

015ea2e

Remove TensorFlow mpi example which cannot be run currently. Closes #1336.

abuccts closed this as completed Mar 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can not build Dockerfile.example.tensorflow-mpi #1336

Can not build Dockerfile.example.tensorflow-mpi #1336

qyyy commented Sep 12, 2018 •

edited

Loading

qyyy commented Sep 12, 2018

abuccts commented Sep 13, 2018

qyyy commented Sep 13, 2018

fanyangCS commented Sep 13, 2018

qyyy commented Sep 13, 2018 •

edited

Loading

qyyy commented Sep 14, 2018

qyyy commented Sep 14, 2018

qyyy commented Sep 14, 2018

qyyy commented Sep 15, 2018

abuccts commented Sep 16, 2018 •

edited

Loading

qyyy commented Sep 17, 2018

abuccts commented Sep 17, 2018

qyyy commented Sep 17, 2018

abuccts commented Sep 17, 2018

qyyy commented Sep 18, 2018

qyyy commented Sep 18, 2018

qyyy commented Sep 18, 2018

abuccts commented Sep 18, 2018

qyyy commented Sep 19, 2018

qyyy commented Sep 19, 2018

scarlett2018 commented Mar 15, 2019

Can not build Dockerfile.example.tensorflow-mpi #1336

Can not build Dockerfile.example.tensorflow-mpi #1336

Comments

qyyy commented Sep 12, 2018 • edited Loading

qyyy commented Sep 12, 2018

abuccts commented Sep 13, 2018

qyyy commented Sep 13, 2018

fanyangCS commented Sep 13, 2018

qyyy commented Sep 13, 2018 • edited Loading

qyyy commented Sep 14, 2018

opal_shmem_base_select failed --> Returned value -1 instead of OPAL_SUCCESS

opal_init failed --> Returned value Error (-1) instead of ORTE_SUCCESS

ompi_mpi_init: ompi_rte_init failed --> Returned "Error" (-1) instead of "Success" (0)

qyyy commented Sep 14, 2018

qyyy commented Sep 14, 2018

qyyy commented Sep 15, 2018

abuccts commented Sep 16, 2018 • edited Loading

qyyy commented Sep 17, 2018

abuccts commented Sep 17, 2018

qyyy commented Sep 17, 2018

abuccts commented Sep 17, 2018

qyyy commented Sep 18, 2018

qyyy commented Sep 18, 2018

qyyy commented Sep 18, 2018

abuccts commented Sep 18, 2018

qyyy commented Sep 19, 2018

qyyy commented Sep 19, 2018

scarlett2018 commented Mar 15, 2019

qyyy commented Sep 12, 2018 •

edited

Loading

qyyy commented Sep 13, 2018 •

edited

Loading

opal_shmem_base_select failed
--> Returned value -1 instead of OPAL_SUCCESS

opal_init failed
--> Returned value Error (-1) instead of ORTE_SUCCESS

ompi_mpi_init: ompi_rte_init failed
--> Returned "Error" (-1) instead of "Success" (0)

abuccts commented Sep 16, 2018 •

edited

Loading