Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training model on Quora dataset #8

Open
TheMnBN opened this issue Jul 28, 2020 · 4 comments
Open

Training model on Quora dataset #8

TheMnBN opened this issue Jul 28, 2020 · 4 comments

Comments

@TheMnBN
Copy link

TheMnBN commented Jul 28, 2020

Hi all,

I ran into an OOM error while trying to train the model, on a 1-GPU workstation (GTX 1080TI). A snip of the error log can be seen below. Has anyone been successful in training this model with similar hardware?
Thanks in advance!

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[512,98,200] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[Node: block-1/align/sub_1/mul_5 = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](block-1/align/sub_1/mul_2, block-1/align/sub_1/add_1)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[Node: stack_8/_159 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_6954_stack_8", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Edit: Decreasing batch size to 64 (default was 512) fixed this issue. I will try to see what's the maximum batch size that the 1080TI can handle.

@hitvoice
Copy link
Collaborator

A single V100(32G) can run this experiment. For a 16GB GPU, in my previous runs setting the batch size to 488 leads to similar results.

Decreasing batch size to 64 may result in worse performance.

@hitvoice
Copy link
Collaborator

Sorry for the late reply....

@Jch520
Copy link

Jch520 commented Sep 10, 2021

Hello, I'm sorry to bother you, can it be convenient for you to provide the code for calculating network parameters? Thank you!

@hitvoice
Copy link
Collaborator

Something like sum(p.numel() for p in model.parameters() if p.requires_grad)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants