Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Would you like to ask why the training result is nan #26

Closed
yyyyet opened this issue Nov 1, 2022 · 12 comments
Closed

Would you like to ask why the training result is nan #26

yyyyet opened this issue Nov 1, 2022 · 12 comments

Comments

@yyyyet
Copy link

yyyyet commented Nov 1, 2022

No code changes have been made, and the version of the library is compliant
This is the result of my training
image
Only the first picture was correct when the 10 pictures were tested
image
Thanks to reassure!

@alexanderkroner
Copy link
Owner

Hi, I tested the code again but wasn't able to reproduce this issue. Are all the package requirements satisfied?

@yyyyet
Copy link
Author

yyyyet commented Nov 2, 2022

It is installed according to the Requirements you provided.
I used to have a computer that could run the program successfully, but now I have changed to a new computer and it can't work. Is it because the 3070 can't use cuda10.0, but there is no error and warning in the running process.
It is now successful only when testing with a single image for example: python main.py test -d salicon-p salicon.jpg
These are all the libraries I use:
image
image

@alexanderkroner
Copy link
Owner

Indeed, the package versions are the same as the ones I tested the code with. Could it be that this issue is relevant for you?

@yyyyet
Copy link
Author

yyyyet commented Nov 3, 2022

Well, I don't know. I'll try again. Thank you.

@isksjsksk
Copy link

Indeed, the package versions are the same as the ones I tested the code with. Could it be that this issue is relevant for you?

I had the same problem with RTX4090 and this helps a lot!

@Vaishnavi-Na
Copy link

Hi there!
I'm having a similar issue with the the training result being NaN and only the first picture generating a result. I'm currently using a Windows computer, and I believe Nvidia-TensorFlow is for Linux machines. Do you have any suggestions for this? Thank you so much!

@achilatiao
Copy link

same nan. using cpu with the same speed with gpu.

@achilatiao
Copy link

same nun problem. Just delete the model you trained first. And It will be solve.

@Vaishnavi-Na
Copy link

same nun problem. Just delete the model you trained first. And It will be solve.

What do you mean by deleting the model I trained first?

@achilatiao
Copy link

nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
1.415068
1.325778
1.373514
nan

@achilatiao
Copy link

seams like still bugs

@achilatiao
Copy link

I use only 100 pictures in this three:1.415068
1.325778
1.373514
others used 10000 and it return nan

Then I delete all the .result and weight and start the traning again and here is the output:

Epoch 01/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: 1.283717 (0:00:12)
Valid loss: 1.149827 (0:00:01)
Best model!
Epoch 02/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: 1.346064 (0:00:15)
Valid loss: 1.299754 (0:00:01)
Epoch 03/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: 1.339179 (0:00:05)
Valid loss: 2.202006 (0:00:01)
Epoch 04/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: 1.382328 (0:00:05)
Valid loss: 2.739692 (0:00:01)
Epoch 05/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: 1.381477 (0:00:05)
Valid loss: 1.158773 (0:00:01)
Epoch 06/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: 1.343635 (0:00:05)
Valid loss: 1.213211 (0:00:01)
Epoch 07/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: 1.419941 (0:00:05)
Valid loss: 1.211412 (0:00:01)
Epoch 08/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: 1.429422 (0:00:05)
Valid loss: 1.240724 (0:00:01)
Epoch 09/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: 1.385073 (0:00:05)
Valid loss: 1.238499 (0:00:01)
Epoch 10/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: 1.423644 (0:00:05)
Valid loss: 1.060307 (0:00:01)
Best model!
Epoch 11/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: nan (0:00:18)
Valid loss: nan (0:00:01)
Epoch 12/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: nan (0:00:04)
Valid loss: nan (0:00:01)
Epoch 13/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: nan (0:00:04)
Valid loss: nan (0:00:01)
Epoch 14/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: nan (0:00:04)
Valid loss: nan (0:00:01)
Epoch 15/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: nan (0:00:04)
Valid loss: nan (0:00:01)
Epoch 16/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: nan (0:00:04)
Valid loss: nan (0:00:01)
Epoch 17/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: nan (0:00:04)
Valid loss: nan (0:00:01)
Epoch 18/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: nan (0:00:04)
Valid loss: nan (0:00:01)
Epoch 19/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: nan (0:00:04)
Valid loss: nan (0:00:01)
Epoch 20/20 [====================] 100/100 (ETA: 0:00:00)
Train loss: nan (0:00:05)
Valid loss: nan (0:00:01)

@yyyyet yyyyet closed this as completed Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants