Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

confidence_lossがnanになる #1

Open
kosuke55 opened this issue Aug 13, 2020 · 6 comments
Open

confidence_lossがnanになる #1

kosuke55 opened this issue Aug 13, 2020 · 6 comments

Comments

@kosuke55
Copy link
Owner

confidence_loss 0.056517358869314194
depth_loss 0.004850194789469242
rotation_loss 0.3008980453014374
epoch 4, 334/1266,train loss is confidence:21.170931436121464 rotation:91.45359233021736 depth:10.529736892785877
train epoch=4:  26%|███████████████▌                                           | 335/1266 [28:35<1:31:47,  5.92s/it]/home/kosuke55/catkin_ws/src/hanging_points_cnn/hanging_points_cnn/utils/rois_tools.py:176: RuntimeWarning: invalid value encountered in greater
  confidence_mask[confidence_mask > 1] = 1
/home/kosuke55/catkin_ws/src/hanging_points_cnn/hanging_points_cnn/utils/rois_tools.py:177: RuntimeWarning: invalid value encountered in less
  confidence_mask[confidence_mask < 0] = 0
train_hpnet.py:159: RuntimeWarning: invalid value encountered in greater_equal
  confidence_np[confidence_np >= 1] = 1.
train_hpnet.py:160: RuntimeWarning: invalid value encountered in less_equal
  confidence_np[confidence_np <= 0] = 0.
confidence_loss nan
depth_loss 0.0
rotation_loss 0.0
@kosuke55
Copy link
Owner Author

/media/kosuke/SANDISK/hanging_points_net/checkpoints/gray/hpnet_latestmodel_20200812_2224.pt

@kosuke55
Copy link
Owner Author

ネットワークの出力自体がnan

ipdb> hp_data
tensor([[[[0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
          [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
          [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
          ...,
          [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
          [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
          [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000]],

         [[0.3686, 0.3686, 0.3765,  ..., 0.4275, 0.4275, 0.4275],
          [0.3686, 0.3686, 0.3765,  ..., 0.4314, 0.4314, 0.4314],
          [0.3725, 0.3725, 0.3804,  ..., 0.4392, 0.4392, 0.4392],
          ...,
          [0.4000, 0.4000, 0.3961,  ..., 0.4196, 0.4196, 0.4196],
          [0.4039, 0.4039, 0.4000,  ..., 0.4196, 0.4196, 0.4196],
          [0.4039, 0.4039, 0.4000,  ..., 0.4196, 0.4196, 0.4196]]]],
       device='cuda:0')
ipdb> self.model(hp_data)
(tensor([[[[nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          ...,
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan]]]], device='cuda:0',
       grad_fn=<ReluBackward1>), tensor([[nan, nan, nan, nan, nan]], device='cuda:0', grad_fn=<AddmmBackward>))

@kosuke55
Copy link
Owner Author

ひとつ前のmodelとoptimizerを保存する??
https://qiita.com/syoamakase/items/a9b3146e09f9fcafbb66

            if torch.isnan(loss):
                print('loss is nan!!')
                self.model = self.prev_model
                self.optimizer = torch.optim.Adam(
                    self.prev_model.parameters(), lr=args.lr,  betas=(0.9, 0.999),
                    eps=1e-10, weight_decay=0, amsgrad=False)
                self.optimizer.load_state_dict(
                    self.prev_optimizer.state_dict())
                continue
            else:
                self.prev_model = copy.deepcopy(self.model)
                self.prev_optimizer = copy.deepcopy(self.optimizer)

@kosuke55
Copy link
Owner Author

kosuke55 commented Aug 13, 2020

どこかの勾配が大きくなっている。これで大きいところをクリップする
https://pytorch.org/docs/master/generated/torch.nn.utils.clip_grad_norm_.html

torch.nn.utils.clip_grad_norm_(parameters, max_norm, norm_type=2)

https://discuss.pytorch.org/t/how-to-check-norm-of-gradients/13795/2
より

Q: How do we choose the hyperparameter c?
A: We can train our neural networks for some epochs and look at the statistics of the gradient norms. The average value of gradient norms is a good initial trial.

clipは数エポックのgradient normの平均にするとよい。

求め方
https://towardsdatascience.com/what-is-gradient-clipping-b8e815cdfb48

In [76]:         for p in self.model.parameters():
    ...:             param_norm = p.grad.data.norm(2)
    ...:             total_norm += param_norm.item() ** 2
    ...:         total_norm = total_norm ** (1. / 2)

@kosuke55
Copy link
Owner Author

kosuke55 commented Aug 17, 2020

clip_grad_normしてもnanになってしまう。。
#1 (comment)
でバッチサイズを下げれば(64->16)とりあえずはnanで学習が止まらることはない。

@kosuke55
Copy link
Owner Author

v_predが小さくなっている?

In [122]: v_pred = torch.Tensor([1e-30, 0, 0])
     ...: print(torch.norm(v_pred))
     ...: v_pred_n = v_pred / torch.norm(v_pred )
     ...: print(v_pred_n)
     ...:
tensor(0.)
tensor([inf, nan, nan])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant