Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advices on Multi-GPU support? #121

Closed
RichardKov opened this issue Jun 7, 2017 · 10 comments
Closed

Advices on Multi-GPU support? #121

RichardKov opened this issue Jun 7, 2017 · 10 comments

Comments

@RichardKov
Copy link

Hi Ender, thanks for your work!

There have been some requests for multi-gpu support(e.g. #51). I am now trying to write a multi-gpu version based on your code.

However, after looking into the code, it seems that the current structure does not support multi-gpu well. For example. if I modify train_val.py in this way:

      with tf.variable_scope(tf.get_variable_scope()):
        for i in range(2):
            with tf.device("/gpu:" + str(i)):
                with tf.name_scope("tower_" + str(i)) as scope:
                    # Build the main computation graph
                    layers = self.net.create_architecture(sess,'TRAIN', self.num_classes, tag='default',
                                                          anchor_scales=cfg.ANCHOR_SCALES,
                                                          anchor_ratios=cfg.ANCHOR_RATIOS)
                    # Define the loss
                    loss = layers['total_loss']
                    losses.append(loss)
                    
                    tf.get_variable_scope().reuse_variables()
                    
                    grads = self.optimizer.compute_gradients(loss)
                    
                    tower_grads.append(grads)
                    scopes.append(scope)
      # Compute the gradients wrt the loss                  
      gvs = self.average_gradients(tower_grads)

It can not work because the network class has only one "self.image" so an error of

InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'tower_0/Placeholder' with dtype float

will be throwed.

Can you give any advises of how to implement a multi-gpu version of this code?

many thanks.

@endernewton
Copy link
Owner

thanks for the effort! you will first need to dump a dataset to some tfrecord, tf slim has great support for multi-gpu training. i have been trying to do this for a long time but haven't really got into it yet.

@RichardKov
Copy link
Author

I see.
There seems to be an branch that support tfrecord here: philokey@3297a46
But we can't have summary on valid set if we build the network in this way:

      layers = self.net.create_architecture(sess, 'TRAIN', self.imdb.num_classes,
                                            image=image,
                                            im_info=tf.expand_dims(im_shape[1:], dim=0),
                                            gt_boxes=gt_boxes, tag='default',
                                            anchor_scales=cfg.ANCHOR_SCALES,
                                            anchor_ratios=cfg.ANCHOR_RATIOS)

Can you give some suggestions on how to use tf slim to implement a multi-gpu version, based on this branch? It seems tricky because your network is defined in a class...

@philokey
Copy link
Contributor

philokey commented Jun 12, 2017

I seems that py_func do't support multi gpu yet. I try to use multi gpu by slim but failed.

@RichardKov
Copy link
Author

I think py_func may be a bottleneck. But I am not sure whether it supports multi gpu

@crh19970307
Copy link

So have anyone implemented a version that supports multi gpu?

@Caprimulgusindicus
Copy link

...so why py_func is a bottleneck? what is the matter?

@engineer1109
Copy link

Are your GPUs the same type

@ppwwyyxx
Copy link
Contributor

I recently wrote one with multi-gpu support.
https://github.com/ppwwyyxx/tensorpack/tree/master/examples/FasterRCNN

@endernewton
Copy link
Owner

Wow thanks so much @ppwwyyxx! This looks amazing! closing this.

@Atmegal
Copy link

Atmegal commented Sep 14, 2019

It seems like the errors are caused by the nms() used in tf.py_func. When I changed it into py_nms, the errors are solved. However, the time complicity are increased a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants