Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apex issue #14

Open
InaamHassan opened this issue Sep 3, 2021 · 4 comments
Open

Apex issue #14

InaamHassan opened this issue Sep 3, 2021 · 4 comments

Comments

@InaamHassan
Copy link

so when i run "!bash run_training.sh" after "%cd scripts", I get the following issue.

`09/03/2021 08:57:05 - INFO - main - Saving features into cached file ../data/AG-news/cached_train_gpt2-medium_192_sst-2
Traceback (most recent call last):
File "../train_GeDi.py", line 193, in train
from apex import amp
ModuleNotFoundError: No module named 'apex'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "../train_GeDi.py", line 1103, in
main()
File "../train_GeDi.py", line 1052, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "../train_GeDi.py", line 195, in train
raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
ImportError: Please install apex from https://www.github.com/nvidia/apex to use fp16 training.`

Although the apex is installed. How to cater this issue.

@akhileshgotmare
Copy link
Collaborator

Hi! Did you follow the commands in this script for setting up apex? https://github.com/salesforce/GeDi/blob/master/scripts/setup.sh
Reference: https://github.com/NVIDIA/apex#linux

@InaamHassan
Copy link
Author

InaamHassan commented Sep 17, 2021

Yes i did follow those commands. They did not help. I identified the issue and resolved it by just commenting our the exception raised. It installs after we do that without any errors. But the nest thing i face due to it is during training time it raises another exception:

Epoch: 0% 0/1 [00:01<?, ?it/s]
Traceback (most recent call last):
File "../train_GeDi.py", line 1103, in
main()
File "../train_GeDi.py", line 1052, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "../train_GeDi.py", line 355, in train
loss_a*=loss_mask
File "/usr/local/lib/python3.7/dist-packages/apex-0.1-py3.7-linux-x86_64.egg/apex/amp/wrap.py", line 53, in wrapper
return orig_fn(*args, **kwargs)
RuntimeError: Output 0 of SplitBackward is a view and is being modified inplace. This view is an output of a function that returns multiple views. Inplace operators on such views is forbidden. You should replace the inplace operation by an out-of-place one.

I cannot find any clue on how to solve this. No resources found online and i have tried to alter as much code as i can but to no avail.

@InaamHassan
Copy link
Author

InaamHassan commented Sep 17, 2021

I was able to resolve this error.
You just have to change

loss_a*=loss_mask
loss_b*=loss_mask

to

loss_a = loss_a * loss_mask
loss_b = loss_b * loss_mask

in train_gedi.py at line 355. This occurs due to an internal inplace function happening when you write the upper mentioned code.

@InaamHassan
Copy link
Author

I am running my code on google colab with 12 GB of RAM and on CUDA. But it is giving me these errors.

RuntimeError: CUDA out of memory. Tried to allocate 12.00 MiB (GPU 0; 11.17 GiB total capacity; 10.42 GiB already allocated; 1.81 MiB free; 10.66 GiB reserved in total by PyTorch)

Just because of allocating 12MiB the CUDA memory overloads. How to free up space from PyTorch as it has reserved much of it. What i have tried on my end is

  1. Cache Cleaning
  2. Runtime Restart
  3. Reduced dataset

But to no avail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants