Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code implementation - Training #10

Open
selfcontrol7 opened this issue Nov 2, 2020 · 9 comments
Open

Code implementation - Training #10

selfcontrol7 opened this issue Nov 2, 2020 · 9 comments

Comments

@selfcontrol7
Copy link

selfcontrol7 commented Nov 2, 2020

Hello,

I am trying to run the training code but I come to this error:

2020-11-02 07:36:08.658676: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Downloading: 100% 232k/232k [00:00<00:00, 318kB/s]
Downloading: 100% 442/442 [00:00<00:00, 284kB/s]
Downloading: 100% 268M/268M [00:04<00:00, 62.7MB/s]
Traceback (most recent call last):
  File "train_ditto.py", line 103, in <module>
    run_tag)
  File "Snippext_public/snippext/mixda.py", line 253, in initialize_and_train
    alpha_aug=hp.alpha_aug)
  File "Snippext_public/snippext/mixda.py", line 152, in train
    with amp.scale_loss(loss, optimizer) as scaled_loss:
  File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.6/dist-packages/apex/amp/handle.py", line 82, in scale_loss
    raise RuntimeError("Invoked 'with amp.scale_loss`, but internal Amp state has not been initialized.  "
RuntimeError: Invoked 'with amp.scale_loss`, but internal Amp state has not been initialized.  model, optimizer = amp.initialize(model, optimizer, opt_level=...) must be called before `with amp.scale_loss`.

Please, can you guide me to solve this issue?

@oi02lyl
Copy link
Contributor

oi02lyl commented Nov 2, 2020

  1. Can you share the command that you used to run the code?
  2. Are you using GPU or CPU?

By looking at the error msg, it seems to me that fp16 is on but GPU is not in used.

@selfcontrol7
Copy link
Author

selfcontrol7 commented Nov 3, 2020

Hi, thank you for your reply.

  1. Here is the code I am running after installing all the requirements.
!CUDA_VISIBLE_DEVICES=0 python train_ditto.py \
  --task Structured/Beer \
  --batch_size 64 \
  --max_len 64 \
  --lr 3e-5 \
  --n_epochs 40 \
  --finetuning \
  --lm distilbert \
  --fp16 \
  --da del \
  --dk product \
  --summarize
  1. I am running the notebook on Colab. The runtime type is set as None. So I don't think GPU is used.
    Does it mean I must use the GPU setting instead?

I just tried the code again using GPU setting and the lines import nltk nltk.download('stopwords')before the training part and it solved the issue.

Thank you @oi02lyl .

If needed I can share the

@selfcontrol7
Copy link
Author

Now after the training part was done, I tried to run the matching code as follow:

!CUDA_VISIBLE_DEVICES=0 python matcher.py \
  --task wdc_all_small \
  --input_path input/input_small.jsonl \
  --output_path output/output_small.jsonl \
  --lm distilbert \
  --use_gpu \
  --fp16 \
  --checkpoint_path checkpoints/

but It seems that the model can not be found:

2020-11-03 05:51:50.890227: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Traceback (most recent call last):
  File "matcher.py", line 212, in <module>
    hp.lm, hp.use_gpu, hp.fp16)
  File "matcher.py", line 170, in load_model
    raise ModelNotFoundError(checkpoint)
ditto.exceptions.ModelNotFoundError: Model checkpoints/wdc_all_small.pt was not found

I also try the notebook mentioned in #9 here, but the same error appears.

Please, any help?

@oi02lyl
Copy link
Contributor

oi02lyl commented Nov 4, 2020

@selfcontrol7
Copy link
Author

Hi, thank you again for your reply.
I will try the updated notebook and come back to you ASAP.

Thanks.

@selfcontrol7
Copy link
Author

You can try the script at the bottom of the updated notebook: https://colab.research.google.com/drive/1zCg6BeCWVj62uYqoxR5rfyEG6dfGXu_2?usp=sharing&authuser=3#scrollTo=9qxLFPNvcGgH

Hello,

I tried the given updated notebook but the warning below is shown when running the matcher.
Please, can you guide me in solving it?

Thank you.

Warning:  multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback.  Original ImportError was: ModuleNotFoundError("No module named 'amp_C'",)
0it [00:00, ?it/s]/usr/local/lib/python3.6/dist-packages/apex/amp/_initialize.py:25: UserWarning: An input tensor was not cuda.
  warnings.warn("An input tensor was not cuda.")
4398it [00:07, 573.10it/s]

@oi02lyl
Copy link
Contributor

oi02lyl commented Nov 10, 2020

I see. This is because we install only the python version of apex. More details here: https://github.com/NVIDIA/apex#linux. I think the warning is safe to ignore in this case. You can also install the version with CUDA and C++ extensions following their instructions.

@braswent
Copy link

braswent commented Apr 29, 2021

You can try the script at the bottom of the updated notebook: https://colab.research.google.com/drive/1zCg6BeCWVj62uYqoxR5rfyEG6dfGXu_2?usp=sharing&authuser=3#scrollTo=9qxLFPNvcGgH

@oi02lyl I am having similar issues with the checkpoint not being found. I tried to use the link you posted however it states I do not have the correct credentials to see the file. Do you mind trying to open the notebook to public viewing? Thanks.

@saharyi
Copy link

saharyi commented Aug 18, 2021

https://colab.research.google.com/drive/1zCg6BeCWVj62uYqoxR5rfyEG6dfGXu_2?usp=sharing&authuser=3#scrollTo=9qxLFPNvcGgH

this link don't works for me!
I get this:
There was an error loading this notebook. Ensure that the file is accessible and try again.
Invalid Credentials
https://drive.google.com/drive/?action=locate&id=1zCg6BeCWVj62uYqoxR5rfyEG6dfGXu_2&authuser=3
please help me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants