Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting ValueError randomly during training #7

Open
chaitrasj opened this issue Jan 13, 2022 · 2 comments
Open

Getting ValueError randomly during training #7

chaitrasj opened this issue Jan 13, 2022 · 2 comments

Comments

@chaitrasj
Copy link

Thank you for making the code available.
I was trying to run the same repo as it is, I just changed my batch size from 64 to 32 due to memory constraints.
I am running the code on 2 Nvidia 1080Ti GPU's each of 12 GB memory.

However, randomly after few epochs I keep getting a Value error as:
ValueError: Expected parameter scale (Tensor of shape (2048,)) of distribution Normal(loc: torch.Size([2048]), scale: torch.Size([2048])) to satisfy the constraint GreaterThan(lower_bound=0.0), but found invalid values:
tensor([1.2194e-04, 1.5050e-04, 2.8594e-03, ..., 3.8839e-05, 1.8705e-05,
1.1311e-05], device='cuda:0')

I am getting it randomly after 10 epochs. Below is the full stack trace.
Kindly help me in this regard to run your code.

Epoch: [25][160/200] Time 2.152 (2.173) Total loss 6.960 (7.223) Loss 3.233(3.638) LossMeta 3.728(3.585)
Epoch: [25][165/200] Time 2.192 (2.173) Total loss 7.966 (7.204) Loss 4.791(3.644) LossMeta 3.174(3.560)
Traceback (most recent call last):
File "main.py", line 286, in
main()
File "main.py", line 108, in main
main_worker(args)
File "main.py", line 202, in main_worker
print_freq=args.print_freq, train_iters=args.iters)
File "/home/sarosij/M3L/reid/trainers.py", line 89, in train
f_test, mte_tri = self.newMeta(testInputs, MTE=self.args.BNtype)
File "/home/sarosij/anaconda3/envs/reid/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/sarosij/anaconda3/envs/reid/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/sarosij/anaconda3/envs/reid/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/sarosij/anaconda3/envs/reid/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/sarosij/anaconda3/envs/reid/lib/python3.6/site-packages/torch/_utils.py", line 434, in reraise
raise exception
ValueError: Caught ValueError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/sarosij/anaconda3/envs/reid/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/sarosij/anaconda3/envs/reid/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/sarosij/M3L/reid/models/resMeta.py", line 180, in forward
bn_x = self.feat_bn(x, MTE, save_index)
File "/home/sarosij/anaconda3/envs/reid/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/sarosij/M3L/reid/models/MetaModules.py", line 362, in forward
Distri1 = Normal(self.meta_mean1, self.meta_var1)
File "/home/sarosij/anaconda3/envs/reid/lib/python3.6/site-packages/torch/distributions/normal.py", line 50, in init
super(Normal, self).init(batch_shape, validate_args=validate_args)
File "/home/sarosij/anaconda3/envs/reid/lib/python3.6/site-packages/torch/distributions/distribution.py", line 56, in init
f"Expected parameter {param} "
ValueError: Expected parameter scale (Tensor of shape (2048,)) of distribution Normal(loc: torch.Size([2048]), scale: torch.Size([2048])) to satisfy the constraint GreaterThan(lower_bound=0.0), but found invalid values:
tensor([1.2194e-04, 1.5050e-04, 2.8594e-03, ..., 3.8839e-05, 1.8705e-05,
1.1311e-05], device='cuda:0')

@chaitrasj
Copy link
Author

Hello,
I managed to get a higher memory GPU and ran the same code without any changes with the original batch size of 64, now I am running it on a single Quadro RTX 8000 of 48GB memory, it runs without the ValueError. (it uses only 23.7 GB of memory)

Src: MSMT17_V2+Duke+Cuhk03, Tgt: Market
I get the mAP and rank-1 values at the end of 60 epochs as:
Mean AP: 41.1%
CMC Scores:
top-1 67.8%
top-5 82.7%
top-10 87.5%
Total running time: 1 day, 17:39:26.795512

These numbers seem ~8% lower than the ones reported in the paper.
Can you please suggest to me what can be the cause, or is there can be some hardware dependencies while optimising the meta-learning pipeline?

Thanking in advance!

@Eurus2073
Copy link

I'm trying to run the repo, and I'm running into the same problem as well. When I'm running this on RTX 2080 Ti GPUs, Expected parameter scale (Tensor of shape (2048,)) of distribution Normal(loc: torch.Size([2048]), scale: torch.Size([2048])) to satisfy the constraint GreaterThan(lower_bound=0.0), but found invalid values:
tensor([1.2194e-04, 1.5050e-04, 2.8594e-03, ..., 3.8839e-05, 1.8705e-05,
1.1311e-05], device='cuda:0') happens, and when I switch to RTX A5000, a severe mAP degeneration happens.

Can you please tell me how you saw to this problem? Thank you so much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants