Apex issue #14

InaamHassan · 2021-09-03T09:05:51Z

so when i run "!bash run_training.sh" after "%cd scripts", I get the following issue.

`09/03/2021 08:57:05 - INFO - main - Saving features into cached file ../data/AG-news/cached_train_gpt2-medium_192_sst-2
Traceback (most recent call last):
File "../train_GeDi.py", line 193, in train
from apex import amp
ModuleNotFoundError: No module named 'apex'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "../train_GeDi.py", line 1103, in
main()
File "../train_GeDi.py", line 1052, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "../train_GeDi.py", line 195, in train
raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
ImportError: Please install apex from https://www.github.com/nvidia/apex to use fp16 training.`

Although the apex is installed. How to cater this issue.

akhileshgotmare · 2021-09-07T07:02:49Z

Hi! Did you follow the commands in this script for setting up apex? https://github.com/salesforce/GeDi/blob/master/scripts/setup.sh
Reference: https://github.com/NVIDIA/apex#linux

InaamHassan · 2021-09-17T07:44:42Z

Yes i did follow those commands. They did not help. I identified the issue and resolved it by just commenting our the exception raised. It installs after we do that without any errors. But the nest thing i face due to it is during training time it raises another exception:

Epoch: 0% 0/1 [00:01<?, ?it/s]
Traceback (most recent call last):
File "../train_GeDi.py", line 1103, in
main()
File "../train_GeDi.py", line 1052, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "../train_GeDi.py", line 355, in train
loss_a*=loss_mask
File "/usr/local/lib/python3.7/dist-packages/apex-0.1-py3.7-linux-x86_64.egg/apex/amp/wrap.py", line 53, in wrapper
return orig_fn(*args, **kwargs)
RuntimeError: Output 0 of SplitBackward is a view and is being modified inplace. This view is an output of a function that returns multiple views. Inplace operators on such views is forbidden. You should replace the inplace operation by an out-of-place one.

I cannot find any clue on how to solve this. No resources found online and i have tried to alter as much code as i can but to no avail.

InaamHassan · 2021-09-17T08:12:50Z

I was able to resolve this error.
You just have to change

loss_a*=loss_mask
loss_b*=loss_mask

to

loss_a = loss_a * loss_mask
loss_b = loss_b * loss_mask

in train_gedi.py at line 355. This occurs due to an internal inplace function happening when you write the upper mentioned code.

InaamHassan · 2021-09-21T10:08:04Z

I am running my code on google colab with 12 GB of RAM and on CUDA. But it is giving me these errors.

RuntimeError: CUDA out of memory. Tried to allocate 12.00 MiB (GPU 0; 11.17 GiB total capacity; 10.42 GiB already allocated; 1.81 MiB free; 10.66 GiB reserved in total by PyTorch)

Just because of allocating 12MiB the CUDA memory overloads. How to free up space from PyTorch as it has reserved much of it. What i have tried on my end is

Cache Cleaning
Runtime Restart
Reduced dataset

But to no avail.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apex issue #14

Apex issue #14

InaamHassan commented Sep 3, 2021

akhileshgotmare commented Sep 7, 2021

InaamHassan commented Sep 17, 2021 •

edited

Loading

InaamHassan commented Sep 17, 2021 •

edited

Loading

InaamHassan commented Sep 21, 2021

Apex issue #14

Apex issue #14

Comments

InaamHassan commented Sep 3, 2021

akhileshgotmare commented Sep 7, 2021

InaamHassan commented Sep 17, 2021 • edited Loading

InaamHassan commented Sep 17, 2021 • edited Loading

InaamHassan commented Sep 21, 2021

InaamHassan commented Sep 17, 2021 •

edited

Loading

InaamHassan commented Sep 17, 2021 •

edited

Loading