这是GPU集群版本的caffe吗 #9

lyblsgo · 2015-11-23T14:51:22Z

这个版本的caffe可以运行在GPU集群上吗？如果可以的话，对集群有要求吗？谢谢

yjxiong · 2015-11-23T14:52:52Z

可以。
安装openmpi>1.7.4即可。

lyblsgo · 2015-11-23T15:04:34Z

Thanks

zimenglan-sysu-512 · 2016-07-20T07:27:17Z

can you provide the installation instruction of open mpi?

KeyKy · 2016-08-13T03:56:39Z

can you give a code snippet in python to show how to set multiple gpu devices? I find that caffe::SetDevice accepts an integer.

yjxiong · 2016-08-13T04:37:24Z

Multi Gpu configuration is through command line. Python interfaces cannot
launch multiGPU training.

On Saturday, August 13, 2016, 康洋 notifications@github.com wrote:

can you give a code snippet in python to show how to set multiple gpu
devices? I find that caffe::SetDevice accepts an integer.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#9 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AGg4dwQ9x2j-_-fJ_VCeMDiaA_hzAy8Mks5qfUB4gaJpZM4Gnkfk
.

熊元骏
Xiong, Yuanjun

Department of Information Engineering,
The Chinese University of Hong Kong.

E-mail: bitxiong@gmail.com
Mobile: +852 54282253
Mainland China Mobile: +86 147 1430 8814
Address: SHB703, The Chinese University of Hong Kong.
Shatin, N.T, Hong Kong.

yjxiong · 2016-08-15T02:09:44Z

Hi @zimenglan-sysu-512 ,

To install OpenMPI, please see https://www.open-mpi.org/faq/?category=building#easy-build

In the configure step, please add the following options

--with-cuda --enable-mpi-thread-multiple

for optimal performance.

quietsmile · 2016-12-07T07:19:09Z

Excellent work!
I can now run mpirun perfectly in a single machine.
Suppose I have two machines each with 4 GPUs. And I want to train a model with machine 1's gpu 0 & gpu 1, and machine 2's gpu 1&gpu 2. What should I do then?
Thanks.

sunnyxiaohu · 2017-05-09T03:01:03Z

想请问下，在实现和效率上，这种数据并行的方式和caffe master(http://caffe.berkeleyvision.org/tutorial/interfaces.html －－－－－－－－Parallelism: the -gpu flag to the caffe tool can take a comma separated list of IDs to run on multiple GPUs. A solver and net will be instantiated for each GPU so the batch size is effectively multiplied by the number of GPUs. To reproduce single GPU training, reduce the batch size in the network definition accordingly.)有什么不一样呢?

yjxiong · 2017-05-09T03:16:35Z

@sunnyxiaohu
They are both data parallelism, implemented in different manners.

pkuCactus · 2017-09-25T02:46:01Z

@yjxiong i have encountered that "unrecognized options: --enable-mpi-thread-multiple", how could solve it?

yokattame · 2017-12-18T03:39:21Z

@pkuCactus
The option was removed starting from the v3.0 series. It means v3.x openmpi always enable MPI_THREAD_MULTIPLE support.

zzy123abc · 2017-12-21T09:14:13Z

想请问一下，集群上是intelmpi的话，可以使用吗？我想设置成openmpi，但是不知道具体怎么做，因为cmake的时候总是自动检测intelmpi，ccmake的话我写的可能有问题

yjxiong · 2017-12-21T21:56:35Z

@zzy123abc Intelmpi is not tested. You can manually modify the cache variables (search MPI in ccmake) in CMake to point to OpenMPI.

zzy123abc · 2017-12-22T01:12:49Z

谢谢，那请问您测试的是单节点多gpu还是多节点多gpu的呢？上面也有人问到，就是说，gpu02的gpu0和gpu1，加上gpu03的gpu0和gpu1怎么一起工作？solver设置里面写的0，1，2，3好像不可以，修改成0，1，0，1可以使用吗？

yjxiong · 2017-12-25T06:21:54Z

@zzy123abc

Yes. Just as you said, [0, 1, 0, 1]

melody-rain · 2017-12-28T01:49:24Z

Hi, @yjxiong

I met the problem that MPI mode is disabled.

I1228 09:34:57.642899 25965 common.cpp:59] Caffe::MPI_all_rank() = 1
I1228 09:34:57.643031 25965 common.cpp:65] You are running caffe compiled with MPI support. Now it's running in non-parallel model

I have one PC with multiple GPUs. As you can see in the above log, the program runs in non-parallel mode. The problem should be that Caffe::MPI_all_rank() = 1. Could you give any hints why it happens? Thanks.

melody-rain · 2017-12-28T06:16:07Z

I fix the problem.
I should have run the caffe command with mpirun -np.

Erdos001 · 2018-05-30T01:47:55Z

can this synchronized batchnorm be used on the One-Device-Multi-GPU ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

这是GPU集群版本的caffe吗 #9

这是GPU集群版本的caffe吗 #9

lyblsgo commented Nov 23, 2015

yjxiong commented Nov 23, 2015

lyblsgo commented Nov 23, 2015

zimenglan-sysu-512 commented Jul 20, 2016

KeyKy commented Aug 13, 2016

yjxiong commented Aug 13, 2016

yjxiong commented Aug 15, 2016

quietsmile commented Dec 7, 2016

sunnyxiaohu commented May 9, 2017

yjxiong commented May 9, 2017

pkuCactus commented Sep 25, 2017

yokattame commented Dec 18, 2017

zzy123abc commented Dec 21, 2017

yjxiong commented Dec 21, 2017

zzy123abc commented Dec 22, 2017

yjxiong commented Dec 25, 2017

melody-rain commented Dec 28, 2017

melody-rain commented Dec 28, 2017 •

edited

Loading

Erdos001 commented May 30, 2018

这是GPU集群版本的caffe吗 #9

这是GPU集群版本的caffe吗 #9

Comments

lyblsgo commented Nov 23, 2015

yjxiong commented Nov 23, 2015

lyblsgo commented Nov 23, 2015

zimenglan-sysu-512 commented Jul 20, 2016

KeyKy commented Aug 13, 2016

yjxiong commented Aug 13, 2016

yjxiong commented Aug 15, 2016

quietsmile commented Dec 7, 2016

sunnyxiaohu commented May 9, 2017

yjxiong commented May 9, 2017

pkuCactus commented Sep 25, 2017

yokattame commented Dec 18, 2017

zzy123abc commented Dec 21, 2017

yjxiong commented Dec 21, 2017

zzy123abc commented Dec 22, 2017

yjxiong commented Dec 25, 2017

melody-rain commented Dec 28, 2017

melody-rain commented Dec 28, 2017 • edited Loading

Erdos001 commented May 30, 2018

melody-rain commented Dec 28, 2017 •

edited

Loading