Would you like to ask why the training result is nan #26

yyyyet · 2022-11-01T08:04:47Z

No code changes have been made, and the version of the library is compliant
This is the result of my training

Only the first picture was correct when the 10 pictures were tested

Thanks to reassure!

alexanderkroner · 2022-11-01T20:32:27Z

Hi, I tested the code again but wasn't able to reproduce this issue. Are all the package requirements satisfied?

yyyyet · 2022-11-02T02:06:21Z

It is installed according to the Requirements you provided.
I used to have a computer that could run the program successfully, but now I have changed to a new computer and it can't work. Is it because the 3070 can't use cuda10.0, but there is no error and warning in the running process.
It is now successful only when testing with a single image for example: python main.py test -d salicon-p salicon.jpg
These are all the libraries I use:

alexanderkroner · 2022-11-02T08:15:34Z

Indeed, the package versions are the same as the ones I tested the code with. Could it be that this issue is relevant for you?

yyyyet · 2022-11-03T07:29:01Z

Well, I don't know. I'll try again. Thank you.

isksjsksk · 2023-12-08T19:36:04Z

Indeed, the package versions are the same as the ones I tested the code with. Could it be that this issue is relevant for you?

I had the same problem with RTX4090 and this helps a lot!

Vaishnavi-Na · 2024-06-06T16:46:32Z

Hi there!
I'm having a similar issue with the the training result being NaN and only the first picture generating a result. I'm currently using a Windows computer, and I believe Nvidia-TensorFlow is for Linux machines. Do you have any suggestions for this? Thank you so much!

achilatiao · 2024-06-07T14:35:04Z

same nan. using cpu with the same speed with gpu.

achilatiao · 2024-06-07T15:01:25Z

same nun problem. Just delete the model you trained first. And It will be solve.

Vaishnavi-Na · 2024-06-07T15:12:19Z

same nun problem. Just delete the model you trained first. And It will be solve.

What do you mean by deleting the model I trained first?

achilatiao · 2024-06-07T15:14:12Z

nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
1.415068
1.325778
1.373514
nan

achilatiao · 2024-06-07T15:14:28Z

seams like still bugs

achilatiao · 2024-06-07T15:21:19Z

I use only 100 pictures in this three:1.415068
1.325778
1.373514
others used 10000 and it return nan

Then I delete all the .result and weight and start the traning again and here is the output:

Epoch 01/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: 1.283717 (0:00:12)
Valid loss: 1.149827 (0:00:01)
Best model!
Epoch 02/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: 1.346064 (0:00:15)
Valid loss: 1.299754 (0:00:01)
Epoch 03/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: 1.339179 (0:00:05)
Valid loss: 2.202006 (0:00:01)
Epoch 04/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: 1.382328 (0:00:05)
Valid loss: 2.739692 (0:00:01)
Epoch 05/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: 1.381477 (0:00:05)
Valid loss: 1.158773 (0:00:01)
Epoch 06/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: 1.343635 (0:00:05)
Valid loss: 1.213211 (0:00:01)
Epoch 07/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: 1.419941 (0:00:05)
Valid loss: 1.211412 (0:00:01)
Epoch 08/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: 1.429422 (0:00:05)
Valid loss: 1.240724 (0:00:01)
Epoch 09/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: 1.385073 (0:00:05)
Valid loss: 1.238499 (0:00:01)
Epoch 10/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: 1.423644 (0:00:05)
Valid loss: 1.060307 (0:00:01)
Best model!
Epoch 11/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: nan (0:00:18)
Valid loss: nan (0:00:01)
Epoch 12/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: nan (0:00:04)
Valid loss: nan (0:00:01)
Epoch 13/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: nan (0:00:04)
Valid loss: nan (0:00:01)
Epoch 14/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: nan (0:00:04)
Valid loss: nan (0:00:01)
Epoch 15/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: nan (0:00:04)
Valid loss: nan (0:00:01)
Epoch 16/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: nan (0:00:04)
Valid loss: nan (0:00:01)
Epoch 17/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: nan (0:00:04)
Valid loss: nan (0:00:01)
Epoch 18/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: nan (0:00:04)
Valid loss: nan (0:00:01)
Epoch 19/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: nan (0:00:04)
Valid loss: nan (0:00:01)
Epoch 20/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: nan (0:00:05)
Valid loss: nan (0:00:01)

yyyyet closed this as completed Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Would you like to ask why the training result is nan #26

Would you like to ask why the training result is nan #26

yyyyet commented Nov 1, 2022

alexanderkroner commented Nov 1, 2022

yyyyet commented Nov 2, 2022

alexanderkroner commented Nov 2, 2022

yyyyet commented Nov 3, 2022

isksjsksk commented Dec 8, 2023

Vaishnavi-Na commented Jun 6, 2024

achilatiao commented Jun 7, 2024

achilatiao commented Jun 7, 2024

Vaishnavi-Na commented Jun 7, 2024

achilatiao commented Jun 7, 2024

achilatiao commented Jun 7, 2024

achilatiao commented Jun 7, 2024

Would you like to ask why the training result is nan #26

Would you like to ask why the training result is nan #26

Comments

yyyyet commented Nov 1, 2022

alexanderkroner commented Nov 1, 2022

yyyyet commented Nov 2, 2022

alexanderkroner commented Nov 2, 2022

yyyyet commented Nov 3, 2022

isksjsksk commented Dec 8, 2023

Vaishnavi-Na commented Jun 6, 2024

achilatiao commented Jun 7, 2024

achilatiao commented Jun 7, 2024

Vaishnavi-Na commented Jun 7, 2024

achilatiao commented Jun 7, 2024

achilatiao commented Jun 7, 2024

achilatiao commented Jun 7, 2024