Skip to content
This repository has been archived by the owner on Oct 31, 2022. It is now read-only.

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,12,1024,1024] and type float on /job:localhost/replica:0/task:0/device:GPU #8

Open
josai opened this issue Jun 7, 2019 · 12 comments

Comments

@josai
Copy link

josai commented Jun 7, 2019

Caused by op 'model/h3/attn/truediv_1', defined at:
File "train.py", line 293, in
main()
File "train.py", line 138, in main
opt_grads = memory_saving_gradients.gradients(loss, train_vars)
File "C:\Users\The Atomizer\Desktop\text\gpt2\memory_saving_gradients.py", line 250, in gradients
copied_sgv, info = ge.copy_with_input_replacements(ge.sgv(ops_to_copy), {})
File "C:\Users\The Atomizer\Miniconda3\envs\gtext\lib\site-packages\tensorflow\contrib\graph_editor\transform.py", line 673, in copy_with_input_replacements
sgv, dst_graph, dst_scope, src_scope, reuse_dst_scope=reuse_dst_scope)
File "C:\Users\The Atomizer\Miniconda3\envs\gtext\lib\site-packages\tensorflow\contrib\graph_editor\transform.py", line 453, in call
self.copy_ops(info)
File "C:\Users\The Atomizer\Miniconda3\envs\gtext\lib\site-packages\tensorflow\contrib\graph_editor\transform.py", line 467, in copy_ops
op
, op_outputs
= self.transform_op_handler(info, op, new_inputs)
File "C:\Users\The Atomizer\Miniconda3\envs\gtext\lib\site-packages\tensorflow\contrib\graph_editor\transform.py", line 177, in copy_op_handler
[], input_types_, None, op_def_)
File "C:\Users\The Atomizer\Miniconda3\envs\gtext\lib\site-packages\tensorflow\python\framework\ops.py", line 1770, in init
self._traceback = tf_stack.extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,12,1024,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node model/h3/attn/truediv_1 (defined at C:\Users\The Atomizer\Miniconda3\envs\gtext\lib\site-packages\tensorflow\contrib\graph_editor\transform.py:177) = RealDiv[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model/h3/attn/Exp_1, model/h3/attn/Sum_1)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

@josai
Copy link
Author

josai commented Jun 7, 2019

Whats going on here? am I out of memory? I can't get it to train.

@josai
Copy link
Author

josai commented Jun 7, 2019

conda install -c anaconda cudnn==6.0.0 --yes

seemed to fix the problem...

@ghost
Copy link

ghost commented Jun 10, 2019

@josai how did you know that you needed to do conda install -c anaconda cudnn==6.0.0 --yes from the error message?

@josai
Copy link
Author

josai commented Jun 11, 2019

@josai how did you know that you needed to do conda install -c anaconda cudnn==6.0.0 --yes from the error message?

googling similar errors and trying their solutions until one worked.

@nshepperd
Copy link
Owner

Is this with the 345M model? I've found it only just fits in a 1080TI, so anything using substantial vram like a browser running in the background can push it over the edge.

@josai
Copy link
Author

josai commented Jun 11, 2019

Is this with the 345M model? I've found it only just fits in a 1080TI, so anything using substantial vram like a browser running in the background can push it over the edge.

No, neither models were working until I conda installed cudnn. I am currently retraining the 345m model on GTX 970 with several applications including chrome in the background with no problems.

@iacoposk8
Copy link

iacoposk8 commented Jul 27, 2019

I have a gtx 1060 6gb and I also have this problem.
Searching on google I read that the batch size should be reduced, so I launched
PYTHONPATH=src ./train.py --batch_size 1 --dataset test.txt
but I had the same problem.
I then changed this line to train.py:
return [data_sampler.sample(1024) for _ in range(args.batch_size)]
in:
return [data_sampler.sample(512) for _ in range(args.batch_size)]
but I don't know what this line does, how will the training change?
if this change is not good how can I fix it?

@dji-transpire
Copy link

Same problem. Are there any diagnostics we can run?

@nshepperd
Copy link
Owner

iocaposk8, that change is one way to reduce the memory usage. You are basically shortening the model's memory there, by allowing it to only remember the last 512 words instead of the full 1024, during training. I'm not sure how much the effect on output quality would be from that.

@iacoposk8
Copy link

thanks for the answer, so can I use for example 895 as value? or is better number like 128, 512, 1024 ecc...?

Another question: I am training a model for my language, according to you how much should the loss be for having a good model but not going overfit?

last question: how can I generate texts that speak about a certain topic?

Thank you

nshepperd pushed a commit that referenced this issue Aug 27, 2019
nshepperd pushed a commit that referenced this issue Aug 27, 2019
@ProtoxiDe22
Copy link

i'm having the same problem trying to train the 355M model on a RTX2070 8GB, even with both --memory_saving_gradients and --optimizer sgd i get the following error tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,16,1024,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node model/h23/attn/MatMul_1_1}} = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model/h23/attn/truediv_1, model/h23/attn/transpose_2_1)]]

i didn't use Conda but i have cudnn installed manually (cudnn v7.6.2.24 on CUDA 9.0)

@schematical
Copy link

I have a gtx 1060 6gb and I also have this problem.
Searching on google I read that the batch size should be reduced, so I launched
PYTHONPATH=src ./train.py --batch_size 1 --dataset test.txt
but I had the same problem.
I then changed this line to train.py:
return [data_sampler.sample(1024) for _ in range(args.batch_size)]
in:
return [data_sampler.sample(512) for _ in range(args.batch_size)]
but I don't know what this line does, how will the training change?
if this change is not good how can I fix it?

This fix worked for me.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants