Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training error? odd DIST plot #408

Open
happyqiu opened this issue Oct 29, 2022 · 6 comments
Open

training error? odd DIST plot #408

happyqiu opened this issue Oct 29, 2022 · 6 comments

Comments

@happyqiu
Copy link

Hi,
I'm using the APT-develop branch, and found that during the training, the DIST panel didn't look right while the Loss might be okay. Besides, after the training, I tried to track another video, but I didn't get any predicted labels. Do you have any ideas about what's happening here?

Thanks!
DIST
track_img

@allenleetc
Copy link
Collaborator

Hi @happyqiu,

Can you please share your project (.lbl) file and the movie you are trying to track? You will probably need to share a link to a cloud service (eg Google Drive) because these files will be too large to directly attach here.

When you track the video, have you looked at the log in the Tracking Monitor? There might be warnings or other messages printed there. If this log is available and you can upload it here that might also be useful.

Thanks!

@happyqiu
Copy link
Author

happyqiu commented Nov 1, 2022 via email

@allenleetc
Copy link
Collaborator

@happyqiu

Strange, if I try tracking with your trained tracker, I get predictions but they are 'garbage' (all in the upper-left corner). Maybe that is why you don't see them?

However, if I retrain, my loss/dist plots look normal and the tracking looks good.

Nothing jumps out yet -- maybe if it's not difficult, try doing a fresh retrain to see if anything changes? (Please save the training log just in case.) So far your training data looks normal so I wonder if it could be something in your environment/platform.

@happyqiu
Copy link
Author

happyqiu commented Nov 1, 2022 via email

@allenleetc
Copy link
Collaborator

It looks like a compatibility issue with the A5000 may be possible. In develop we are on tf1.15 and see eg

https://discuss.tensorflow.org/t/tensorflow-and-cuda-support-for-latest-nvida-a5000-ampere-gpu/3886
https://embea.de/blog/?p=114

@mkabra could @happyqiu have Ampere compatibility issues even if they switch to the multianimal branch? One of these links seems to suggest that tf2.4 is required.

In general the specific GPU can potentially matter as in eg #365.

@mkabra
Copy link
Collaborator

mkabra commented Nov 4, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants