Advices on Multi-GPU support? #121

RichardKov · 2017-06-07T15:16:17Z

Hi Ender, thanks for your work!

There have been some requests for multi-gpu support(e.g. #51). I am now trying to write a multi-gpu version based on your code.

However, after looking into the code, it seems that the current structure does not support multi-gpu well. For example. if I modify train_val.py in this way:

      with tf.variable_scope(tf.get_variable_scope()):
        for i in range(2):
            with tf.device("/gpu:" + str(i)):
                with tf.name_scope("tower_" + str(i)) as scope:
                    # Build the main computation graph
                    layers = self.net.create_architecture(sess,'TRAIN', self.num_classes, tag='default',
                                                          anchor_scales=cfg.ANCHOR_SCALES,
                                                          anchor_ratios=cfg.ANCHOR_RATIOS)
                    # Define the loss
                    loss = layers['total_loss']
                    losses.append(loss)
                    
                    tf.get_variable_scope().reuse_variables()
                    
                    grads = self.optimizer.compute_gradients(loss)
                    
                    tower_grads.append(grads)
                    scopes.append(scope)
      # Compute the gradients wrt the loss                  
      gvs = self.average_gradients(tower_grads)

It can not work because the network class has only one "self.image" so an error of

InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'tower_0/Placeholder' with dtype float

will be throwed.

Can you give any advises of how to implement a multi-gpu version of this code?

many thanks.

The text was updated successfully, but these errors were encountered:

endernewton · 2017-06-07T15:33:18Z

thanks for the effort! you will first need to dump a dataset to some tfrecord, tf slim has great support for multi-gpu training. i have been trying to do this for a long time but haven't really got into it yet.

RichardKov · 2017-06-08T02:59:25Z

I see.
There seems to be an branch that support tfrecord here: philokey@3297a46
But we can't have summary on valid set if we build the network in this way:

      layers = self.net.create_architecture(sess, 'TRAIN', self.imdb.num_classes,
                                            image=image,
                                            im_info=tf.expand_dims(im_shape[1:], dim=0),
                                            gt_boxes=gt_boxes, tag='default',
                                            anchor_scales=cfg.ANCHOR_SCALES,
                                            anchor_ratios=cfg.ANCHOR_RATIOS)

Can you give some suggestions on how to use tf slim to implement a multi-gpu version, based on this branch? It seems tricky because your network is defined in a class...

philokey · 2017-06-12T07:48:48Z

I seems that py_func do't support multi gpu yet. I try to use multi gpu by slim but failed.

RichardKov · 2017-06-16T03:34:20Z

I think py_func may be a bottleneck. But I am not sure whether it supports multi gpu

crh19970307 · 2017-07-15T17:02:00Z

So have anyone implemented a version that supports multi gpu?

Caprimulgusindicus · 2017-08-22T08:45:41Z

...so why py_func is a bottleneck? what is the matter?

engineer1109 · 2017-09-06T08:58:56Z

Are your GPUs the same type

ppwwyyxx · 2017-10-13T02:42:24Z

I recently wrote one with multi-gpu support.
https://github.com/ppwwyyxx/tensorpack/tree/master/examples/FasterRCNN

endernewton · 2017-10-13T18:54:32Z

Wow thanks so much @ppwwyyxx! This looks amazing! closing this.

Atmegal · 2019-09-14T01:09:58Z

It seems like the errors are caused by the nms() used in tf.py_func. When I changed it into py_nms, the errors are solved. However, the time complicity are increased a lot.

endernewton closed this as completed Oct 13, 2017

endernewton mentioned this issue Oct 24, 2017

Is there a quick way to use distributed tensorflow instead of GPU? #57

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advices on Multi-GPU support? #121

Advices on Multi-GPU support? #121

RichardKov commented Jun 7, 2017

endernewton commented Jun 7, 2017

RichardKov commented Jun 8, 2017

philokey commented Jun 12, 2017 •

edited

Loading

RichardKov commented Jun 16, 2017

crh19970307 commented Jul 15, 2017

Caprimulgusindicus commented Aug 22, 2017

engineer1109 commented Sep 6, 2017

ppwwyyxx commented Oct 13, 2017

endernewton commented Oct 13, 2017

Atmegal commented Sep 14, 2019

Advices on Multi-GPU support? #121

Advices on Multi-GPU support? #121

Comments

RichardKov commented Jun 7, 2017

endernewton commented Jun 7, 2017

RichardKov commented Jun 8, 2017

philokey commented Jun 12, 2017 • edited Loading

RichardKov commented Jun 16, 2017

crh19970307 commented Jul 15, 2017

Caprimulgusindicus commented Aug 22, 2017

engineer1109 commented Sep 6, 2017

ppwwyyxx commented Oct 13, 2017

endernewton commented Oct 13, 2017

Atmegal commented Sep 14, 2019

philokey commented Jun 12, 2017 •

edited

Loading