ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,12,1024,1024] and type float on /job:localhost/replica:0/task:0/device:GPU #8

josai · 2019-06-07T21:32:48Z

Caused by op 'model/h3/attn/truediv_1', defined at:
File "train.py", line 293, in
main()
File "train.py", line 138, in main
opt_grads = memory_saving_gradients.gradients(loss, train_vars)
File "C:\Users\The Atomizer\Desktop\text\gpt2\memory_saving_gradients.py", line 250, in gradients
copied_sgv, info = ge.copy_with_input_replacements(ge.sgv(ops_to_copy), {})
File "C:\Users\The Atomizer\Miniconda3\envs\gtext\lib\site-packages\tensorflow\contrib\graph_editor\transform.py", line 673, in copy_with_input_replacements
sgv, dst_graph, dst_scope, src_scope, reuse_dst_scope=reuse_dst_scope)
File "C:\Users\The Atomizer\Miniconda3\envs\gtext\lib\site-packages\tensorflow\contrib\graph_editor\transform.py", line 453, in call
self.copy_ops(info)
File "C:\Users\The Atomizer\Miniconda3\envs\gtext\lib\site-packages\tensorflow\contrib\graph_editor\transform.py", line 467, in copy_ops
op, op_outputs = self.transform_op_handler(info, op, new_inputs)
File "C:\Users\The Atomizer\Miniconda3\envs\gtext\lib\site-packages\tensorflow\contrib\graph_editor\transform.py", line 177, in copy_op_handler
[], input_types_, None, op_def_)
File "C:\Users\The Atomizer\Miniconda3\envs\gtext\lib\site-packages\tensorflow\python\framework\ops.py", line 1770, in init
self._traceback = tf_stack.extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,12,1024,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node model/h3/attn/truediv_1 (defined at C:\Users\The Atomizer\Miniconda3\envs\gtext\lib\site-packages\tensorflow\contrib\graph_editor\transform.py:177) = RealDiv[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model/h3/attn/Exp_1, model/h3/attn/Sum_1)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

josai · 2019-06-07T21:33:21Z

Whats going on here? am I out of memory? I can't get it to train.

josai · 2019-06-07T22:16:06Z

conda install -c anaconda cudnn==6.0.0 --yes

seemed to fix the problem...

ghost · 2019-06-10T17:20:31Z

@josai how did you know that you needed to do conda install -c anaconda cudnn==6.0.0 --yes from the error message?

josai · 2019-06-11T06:54:08Z

@josai how did you know that you needed to do conda install -c anaconda cudnn==6.0.0 --yes from the error message?

googling similar errors and trying their solutions until one worked.

nshepperd · 2019-06-11T18:25:30Z

Is this with the 345M model? I've found it only just fits in a 1080TI, so anything using substantial vram like a browser running in the background can push it over the edge.

josai · 2019-06-11T18:28:40Z

Is this with the 345M model? I've found it only just fits in a 1080TI, so anything using substantial vram like a browser running in the background can push it over the edge.

No, neither models were working until I conda installed cudnn. I am currently retraining the 345m model on GTX 970 with several applications including chrome in the background with no problems.

iacoposk8 · 2019-07-27T19:10:09Z

I have a gtx 1060 6gb and I also have this problem.
Searching on google I read that the batch size should be reduced, so I launched
PYTHONPATH=src ./train.py --batch_size 1 --dataset test.txt
but I had the same problem.
I then changed this line to train.py:
return [data_sampler.sample(1024) for _ in range(args.batch_size)]
in:
return [data_sampler.sample(512) for _ in range(args.batch_size)]
but I don't know what this line does, how will the training change?
if this change is not good how can I fix it?

dji-transpire · 2019-08-04T13:54:23Z

Same problem. Are there any diagnostics we can run?

nshepperd · 2019-08-15T00:53:46Z

iocaposk8, that change is one way to reduce the memory usage. You are basically shortening the model's memory there, by allowing it to only remember the last 512 words instead of the full 1024, during training. I'm not sure how much the effect on output quality would be from that.

iacoposk8 · 2019-08-15T14:12:30Z

thanks for the answer, so can I use for example 895 as value? or is better number like 128, 512, 1024 ecc...?

Another question: I am training a model for my language, according to you how much should the loss be for having a good model but not going overfit?

last question: how can I generate texts that speak about a certain topic?

Thank you

ProtoxiDe22 · 2019-08-27T13:02:40Z

i'm having the same problem trying to train the 355M model on a RTX2070 8GB, even with both --memory_saving_gradients and --optimizer sgd i get the following error tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,16,1024,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node model/h23/attn/MatMul_1_1}} = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model/h23/attn/truediv_1, model/h23/attn/transpose_2_1)]]

i didn't use Conda but i have cudnn installed manually (cudnn v7.6.2.24 on CUDA 9.0)

schematical · 2020-06-11T19:37:51Z

I have a gtx 1060 6gb and I also have this problem.
Searching on google I read that the batch size should be reduced, so I launched
PYTHONPATH=src ./train.py --batch_size 1 --dataset test.txt
but I had the same problem.
I then changed this line to train.py:
return [data_sampler.sample(1024) for _ in range(args.batch_size)]
in:
return [data_sampler.sample(512) for _ in range(args.batch_size)]
but I don't know what this line does, how will the training change?
if this change is not good how can I fix it?

This fix worked for me.

nshepperd pushed a commit that referenced this issue Aug 27, 2019

Add note to install cudnn, re #8

9b60912

nshepperd pushed a commit that referenced this issue Aug 27, 2019

Add note to install cudnn, re #8

50fa3b6

shamiul94 mentioned this issue Jun 20, 2020

How to ACTUALLY train 345M on Multiple GPU using train-horovod.py? #53

Open

nshepperd pushed a commit that referenced this issue Oct 31, 2022

Add note to install cudnn, re #8

b813a00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,12,1024,1024] and type float on /job:localhost/replica:0/task:0/device:GPU #8

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,12,1024,1024] and type float on /job:localhost/replica:0/task:0/device:GPU #8

josai commented Jun 7, 2019

josai commented Jun 7, 2019

josai commented Jun 7, 2019

ghost commented Jun 10, 2019

josai commented Jun 11, 2019

nshepperd commented Jun 11, 2019

josai commented Jun 11, 2019

iacoposk8 commented Jul 27, 2019 •

edited

Loading

dji-transpire commented Aug 4, 2019

nshepperd commented Aug 15, 2019

iacoposk8 commented Aug 15, 2019

ProtoxiDe22 commented Aug 27, 2019

schematical commented Jun 11, 2020

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,12,1024,1024] and type float on /job:localhost/replica:0/task:0/device:GPU #8

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,12,1024,1024] and type float on /job:localhost/replica:0/task:0/device:GPU #8

Comments

josai commented Jun 7, 2019

josai commented Jun 7, 2019

josai commented Jun 7, 2019

ghost commented Jun 10, 2019

josai commented Jun 11, 2019

nshepperd commented Jun 11, 2019

josai commented Jun 11, 2019

iacoposk8 commented Jul 27, 2019 • edited Loading

dji-transpire commented Aug 4, 2019

nshepperd commented Aug 15, 2019

iacoposk8 commented Aug 15, 2019

ProtoxiDe22 commented Aug 27, 2019

schematical commented Jun 11, 2020

iacoposk8 commented Jul 27, 2019 •

edited

Loading