Getting ValueError randomly during training #7

chaitrasj · 2022-01-13T12:30:51Z

Thank you for making the code available.
I was trying to run the same repo as it is, I just changed my batch size from 64 to 32 due to memory constraints.
I am running the code on 2 Nvidia 1080Ti GPU's each of 12 GB memory.

However, randomly after few epochs I keep getting a Value error as:
ValueError: Expected parameter scale (Tensor of shape (2048,)) of distribution Normal(loc: torch.Size([2048]), scale: torch.Size([2048])) to satisfy the constraint GreaterThan(lower_bound=0.0), but found invalid values:
tensor([1.2194e-04, 1.5050e-04, 2.8594e-03, ..., 3.8839e-05, 1.8705e-05,
1.1311e-05], device='cuda:0')

I am getting it randomly after 10 epochs. Below is the full stack trace.
Kindly help me in this regard to run your code.

Epoch: [25][160/200] Time 2.152 (2.173) Total loss 6.960 (7.223) Loss 3.233(3.638) LossMeta 3.728(3.585)
Epoch: [25][165/200] Time 2.192 (2.173) Total loss 7.966 (7.204) Loss 4.791(3.644) LossMeta 3.174(3.560)
Traceback (most recent call last):
File "main.py", line 286, in
main()
File "main.py", line 108, in main
main_worker(args)
File "main.py", line 202, in main_worker
print_freq=args.print_freq, train_iters=args.iters)
File "/home/sarosij/M3L/reid/trainers.py", line 89, in train
f_test, mte_tri = self.newMeta(testInputs, MTE=self.args.BNtype)
File "/home/sarosij/anaconda3/envs/reid/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/sarosij/anaconda3/envs/reid/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/sarosij/anaconda3/envs/reid/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/sarosij/anaconda3/envs/reid/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/sarosij/anaconda3/envs/reid/lib/python3.6/site-packages/torch/_utils.py", line 434, in reraise
raise exception
ValueError: Caught ValueError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/sarosij/anaconda3/envs/reid/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/sarosij/anaconda3/envs/reid/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/sarosij/M3L/reid/models/resMeta.py", line 180, in forward
bn_x = self.feat_bn(x, MTE, save_index)
File "/home/sarosij/anaconda3/envs/reid/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/sarosij/M3L/reid/models/MetaModules.py", line 362, in forward
Distri1 = Normal(self.meta_mean1, self.meta_var1)
File "/home/sarosij/anaconda3/envs/reid/lib/python3.6/site-packages/torch/distributions/normal.py", line 50, in init
super(Normal, self).init(batch_shape, validate_args=validate_args)
File "/home/sarosij/anaconda3/envs/reid/lib/python3.6/site-packages/torch/distributions/distribution.py", line 56, in init
f"Expected parameter {param} "
ValueError: Expected parameter scale (Tensor of shape (2048,)) of distribution Normal(loc: torch.Size([2048]), scale: torch.Size([2048])) to satisfy the constraint GreaterThan(lower_bound=0.0), but found invalid values:
tensor([1.2194e-04, 1.5050e-04, 2.8594e-03, ..., 3.8839e-05, 1.8705e-05,
1.1311e-05], device='cuda:0')

chaitrasj · 2022-01-17T07:59:07Z

Hello,
I managed to get a higher memory GPU and ran the same code without any changes with the original batch size of 64, now I am running it on a single Quadro RTX 8000 of 48GB memory, it runs without the ValueError. (it uses only 23.7 GB of memory)

Src: MSMT17_V2+Duke+Cuhk03, Tgt: Market
I get the mAP and rank-1 values at the end of 60 epochs as:
Mean AP: 41.1%
CMC Scores:
top-1 67.8%
top-5 82.7%
top-10 87.5%
Total running time: 1 day, 17:39:26.795512

These numbers seem ~8% lower than the ones reported in the paper.
Can you please suggest to me what can be the cause, or is there can be some hardware dependencies while optimising the meta-learning pipeline?

Thanking in advance!

Eurus2073 · 2023-05-27T13:54:37Z

I'm trying to run the repo, and I'm running into the same problem as well. When I'm running this on RTX 2080 Ti GPUs, Expected parameter scale (Tensor of shape (2048,)) of distribution Normal(loc: torch.Size([2048]), scale: torch.Size([2048])) to satisfy the constraint GreaterThan(lower_bound=0.0), but found invalid values:
tensor([1.2194e-04, 1.5050e-04, 2.8594e-03, ..., 3.8839e-05, 1.8705e-05,
1.1311e-05], device='cuda:0') happens, and when I switch to RTX A5000, a severe mAP degeneration happens.

Can you please tell me how you saw to this problem? Thank you so much.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting ValueError randomly during training #7

Getting ValueError randomly during training #7

chaitrasj commented Jan 13, 2022

chaitrasj commented Jan 17, 2022

Eurus2073 commented May 27, 2023

Getting ValueError randomly during training #7

Getting ValueError randomly during training #7

Comments

chaitrasj commented Jan 13, 2022

chaitrasj commented Jan 17, 2022

Eurus2073 commented May 27, 2023