Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

W&B ID reset on training completion #1852

Merged
merged 4 commits into from
Jan 7, 2021

Conversation

TommyZihao
Copy link
Contributor

@TommyZihao TommyZihao commented Jan 6, 2021

Fix the bug of always the same W&B ID and continue overwrite with the old logging.
BUG report
#1851

New code have been tested on my server.

🛠️ PR Summary

Made with ❤️ by Ultralytics Actions

🌟 Summary

Improved optimizer stripping function for finalizing model training.

📊 Key Changes

  • Enhanced the strip_optimizer() function to remove additional data from the model file.
  • Now also strips 'wandb_id' along with 'optimizer' and 'training_results'.

🎯 Purpose & Impact

  • The purpose of this change is to reduce the file size and remove unnecessary information from model weights files after training, making the files cleaner for deployment.
  • Potential impact includes faster model loading and simpler model files, which can be beneficial for users deploying models in production environments. 🚀

Fix the bug of always the same W&B ID  and continue  overwrite with the old logging.
BUG report
ultralytics#1851
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 Hello @TommyZihao, thank you for submitting a 🚀 PR! To allow your work to be integrated as seamlessly as possible, we advise you to:

  • ✅ Verify your PR is up-to-date with origin/master. If your PR is behind origin/master update by running the following, replacing 'feature' with the name of your local branch:
git remote add upstream https://github.com/ultralytics/yolov5.git
git fetch upstream
git checkout feature  # <----- replace 'feature' with local branch name
git rebase upstream/master
git push -u origin -f
  • ✅ Verify all Continuous Integration (CI) checks are passing.
  • ✅ Reduce changes to the absolute minimum required for your bug fix or feature addition. "It is not daily increase but daily decrease, hack away the unessential. The closer to the source, the less wastage there is." -Bruce Lee

@TommyZihao
Copy link
Contributor Author

Fix the bug of always the same W&B ID and continue overwrite with the old logging.
BUG report
#1851

New code have been tested on my server.

@glenn-jocher
Copy link
Member

@AyushExel could you take a look at this W&B ID update PR please? Thanks!

@AyushExel
Copy link
Contributor

@TommyZihao Thanks for bringing this up. About the error that you're seeing, does that happen only when resuming the old runs? If that's the case, then it's the intended use case. The logger is designed in a way that it'll append the visualizations to the old run if resumed.
But if that happens on running new training runs, it's definitely a bug. Can you please confirm this?

@TommyZihao
Copy link
Contributor Author

@TommyZihao Thanks for bringing this up. About the error that you're seeing, does that happen only when resuming the old runs? If that's the case, then it's the intended use case. The logger is designed in a way that it'll append the visualizations to the old run if resumed.
But if that happens on running new training runs, it's definitely a bug. Can you please confirm this?

It indeed happens on running new training runs. I tried all kinds of methods to fix this bug. Including change wandb account, change server, change dataset, change client computer, even re-download the whole repository and weights. But it happens every time when I run train.py. ID is always the same and wandb visualization only have one curve, one color. New curve will append old curve, instead of creating another curve with another color.
You can see bug report #1851

@TommyZihao
Copy link
Contributor Author

I think we can add another aug --resume, let the user choose whether get a new W&B ID to make a new visualization with a new color, or continue with the old mission and resume training.

@TommyZihao
Copy link
Contributor Author

My PR works fine in my situation.
I also plan to make a video tutorial series to introduce this awesome tool to Chinese AI developers.
My video channel: https://space.bilibili.com/1900783

@AyushExel
Copy link
Contributor

My PR works fine in my situation.
I also plan to make a video tutorial series to introduce this awesome tool to Chinese AI developers.
My video channel: https://space.bilibili.com/1900783

Your channel looks great. Let us know when the video is out.

I think we can add another aug --resume, let the user choose whether get a new W&B ID to make a new visualization with a new color, or continue with the old mission and resume training.

We already have a --resume argument that continues the old run. Does this solution work with that? If it does we can merge this.

@AyushExel
Copy link
Contributor

@glenn-jocher This is very strange. This problem doesn't occur in any other case. I tested this manually by initializing multiple runs in a colab and all of them had unique IDs. Can you think of any recent changes that you made regarding the logging feature that might have caused this?

@AyushExel
Copy link
Contributor

@glenn-jocher I found the cause for this error. This happens because the yolov5s.pt model that gets downloaded before training has wandb_id set to 3hdht16b, and the code automatically sets the wandb id if found in the checkpoint:

id=ckpt.get('wandb_id') if 'ckpt' in locals() else None

So every time transfer learning is done on yolov5s.pt model, it'll detect the same id. This problem doesn't occur when training from scratch.
One solution that I can think of is to remove wandb_id from the models uploaded on torch hub. Let me know if you have any other workarounds.

@TommyZihao
Copy link
Contributor Author

@glenn-jocher I found the cause for this error. This happens because the yolov5s.pt model that gets downloaded before training has wandb_id set to 3hdht16b, and the code automatically sets the wandb id if found in the checkpoint:

id=ckpt.get('wandb_id') if 'ckpt' in locals() else None

So every time transfer learning is done on yolov5s.pt model, it'll detect the same id. This problem doesn't occur when training from scratch.
One solution that I can think of is to remove wandb_id from the models uploaded on torch hub. Let me know if you have any other workarounds.

Exactly, thank you.
I think one graceful solution would be my PR. Generating unique wand_id every time when run train.py.

@AyushExel
Copy link
Contributor

Exactly, thank you.
I think one graceful solution would be my PR. Generating unique wand_id every time when run train.py.

Yes, your PR works for training but it doesn't take into account the --resume functionality. Currently, if you resume a run, the metrics and visualizations will be logged in the same run which is being resumed, but your PR will generate a new ID in every case, even when resuming a previous run. If you can make some changes to incorporate the resume feature, that'd be great

fix the bug of ultralytics#1851
If we had trained on yolov5s.pt, the program will generate a new unique W&B ID.
If we hadn't, the program will keep the old code, we can still use --resume aug.
@TommyZihao
Copy link
Contributor Author

Fix the bug of duplicate W&B ID
If we had trained on yolov5s.pt, the program will generate a new unique W&B ID.
If we hadn't, the program will keep the old code, we can still use --resume aug.

@AyushExel
Copy link
Contributor

@TommyZihao this solution will work for this particular model only(yolov5s) and not for others. And any update in those models will cause the code to break as id might be different.

I was thinking a logic that sets wandb_id from the checkpoint only if --resume is set should work.
something like this:

        wandb_run = wandb.init(config=opt, resume="allow",
                               project='YOLOv5' if opt.project == 'runs/train' else Path(opt.project).stem,
                               name=save_dir.stem,
                               id=ckpt.get('wandb_id') if 'ckpt' in locals() and opt.resume else None)

I have checked and this works for all cases and there's no need for an extra wandb_ID variable.

@glenn-jocher what do you think about this solution?

@glenn-jocher
Copy link
Member

@glenn-jocher I found the cause for this error. This happens because the yolov5s.pt model that gets downloaded before training has wandb_id set to 3hdht16b, and the code automatically sets the wandb id if found in the checkpoint:

id=ckpt.get('wandb_id') if 'ckpt' in locals() else None

So every time transfer learning is done on yolov5s.pt model, it'll detect the same id. This problem doesn't occur when training from scratch.
One solution that I can think of is to remove wandb_id from the models uploaded on torch hub. Let me know if you have any other workarounds.

Oh! This is probably due to the recent v4.0 update, which includes new models which may be the first official models logged in W&B for the first time. I wonder if this is also occurring in ultralytics/yolov3 then. The proper fix then would be to strip the WandDB ID after training fully completes. I can add this here, where the optimizers are similarly stripped from the fully trained checkpoints.

yolov5/train.py

Lines 397 to 398 in 69be8e7

if f.exists():
strip_optimizer(f) # strip optimizers

@glenn-jocher glenn-jocher changed the title Update train.py W&B ID reset on training completion Jan 7, 2021
@glenn-jocher
Copy link
Member

@TommyZihao @AyushExel ok I think this is all set. The problem was that the new models in the v4.0 release yesterday contained wandb_id's from their training. I've updated this PR to leave train.py alone, but to now strip wandb_id's from fully trained models (as a fully trained model is not meant to be --resumed, but can be used as a pretrained model to transfer learn or train another model, in which case a new W&B should be generated).

Not included in this PR I will need to manually strip the W&B ID's from the 4 pretrained models hosted in https://github.com/ultralytics/yolov5/releases/tag/v4.0. I will do this shortly and then the problem will be solved for all future users.

@TommyZihao to fix your specific model you simply need to set the wandb_id to none, or you can delete your local models, and then let a fixed model autodownload.

@glenn-jocher
Copy link
Member

@TommyZihao there's actually a built in function to reset problematic official models. If you git pull following this PR merge, then the following command will do this for all four models.

python detect.py --update

@glenn-jocher glenn-jocher linked an issue Jan 7, 2021 that may be closed by this pull request
@glenn-jocher glenn-jocher merged commit 135ec5c into ultralytics:master Jan 7, 2021
@glenn-jocher
Copy link
Member

Models have been updated now, so autodownloaded models should now show wandb_id=None.

@TommyZihao @AyushExel thank you for spotting this issue and for your contributions! Let us know if you spot any other issues.

KMint1819 pushed a commit to KMint1819/yolov5 that referenced this pull request May 12, 2021
* Update train.py

Fix the bug of always the same W&B ID  and continue  overwrite with the old logging.
BUG report
ultralytics#1851

* Fix the bug of duplicate W&B ID

fix the bug of ultralytics#1851
If we had trained on yolov5s.pt, the program will generate a new unique W&B ID.
If we hadn't, the program will keep the old code, we can still use --resume aug.

* Update general.py

* revert train.py changes

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>
taicaile pushed a commit to taicaile/yolov5 that referenced this pull request Oct 12, 2021
* Update train.py

Fix the bug of always the same W&B ID  and continue  overwrite with the old logging.
BUG report
ultralytics#1851

* Fix the bug of duplicate W&B ID

fix the bug of ultralytics#1851
If we had trained on yolov5s.pt, the program will generate a new unique W&B ID.
If we hadn't, the program will keep the old code, we can still use --resume aug.

* Update general.py

* revert train.py changes

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>
BjarneKuehl pushed a commit to fhkiel-mlaip/yolov5 that referenced this pull request Aug 26, 2022
* Update train.py

Fix the bug of always the same W&B ID  and continue  overwrite with the old logging.
BUG report
ultralytics#1851

* Fix the bug of duplicate W&B ID

fix the bug of ultralytics#1851
If we had trained on yolov5s.pt, the program will generate a new unique W&B ID.
If we hadn't, the program will keep the old code, we can still use --resume aug.

* Update general.py

* revert train.py changes

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

W&B id is always the same and continue with the old logging.
3 participants