W&B ID reset on training completion #1852

TommyZihao · 2021-01-06T07:08:28Z

Fix the bug of always the same W&B ID and continue overwrite with the old logging.
BUG report
#1851

New code have been tested on my server.

🛠️ PR Summary

_{Made with ❤️ by Ultralytics Actions}

🌟 Summary

Improved optimizer stripping function for finalizing model training.

📊 Key Changes

Enhanced the strip_optimizer() function to remove additional data from the model file.
Now also strips 'wandb_id' along with 'optimizer' and 'training_results'.

🎯 Purpose & Impact

The purpose of this change is to reduce the file size and remove unnecessary information from model weights files after training, making the files cleaner for deployment.
Potential impact includes faster model loading and simpler model files, which can be beneficial for users deploying models in production environments. 🚀

Fix the bug of always the same W&B ID and continue overwrite with the old logging. BUG report ultralytics#1851

github-actions

👋 Hello @TommyZihao, thank you for submitting a 🚀 PR! To allow your work to be integrated as seamlessly as possible, we advise you to:

✅ Verify your PR is up-to-date with origin/master. If your PR is behind origin/master update by running the following, replacing 'feature' with the name of your local branch:

git remote add upstream https://github.com/ultralytics/yolov5.git
git fetch upstream
git checkout feature  # <----- replace 'feature' with local branch name
git rebase upstream/master
git push -u origin -f

✅ Verify all Continuous Integration (CI) checks are passing.
✅ Reduce changes to the absolute minimum required for your bug fix or feature addition. "It is not daily increase but daily decrease, hack away the unessential. The closer to the source, the less wastage there is." -Bruce Lee

TommyZihao · 2021-01-06T07:10:31Z

Fix the bug of always the same W&B ID and continue overwrite with the old logging.
BUG report
#1851

New code have been tested on my server.

glenn-jocher · 2021-01-06T07:59:10Z

@AyushExel could you take a look at this W&B ID update PR please? Thanks!

AyushExel · 2021-01-06T08:15:09Z

@TommyZihao Thanks for bringing this up. About the error that you're seeing, does that happen only when resuming the old runs? If that's the case, then it's the intended use case. The logger is designed in a way that it'll append the visualizations to the old run if resumed.
But if that happens on running new training runs, it's definitely a bug. Can you please confirm this?

TommyZihao · 2021-01-06T10:04:36Z

@TommyZihao Thanks for bringing this up. About the error that you're seeing, does that happen only when resuming the old runs? If that's the case, then it's the intended use case. The logger is designed in a way that it'll append the visualizations to the old run if resumed.
But if that happens on running new training runs, it's definitely a bug. Can you please confirm this?

It indeed happens on running new training runs. I tried all kinds of methods to fix this bug. Including change wandb account, change server, change dataset, change client computer, even re-download the whole repository and weights. But it happens every time when I run train.py. ID is always the same and wandb visualization only have one curve, one color. New curve will append old curve, instead of creating another curve with another color.
You can see bug report #1851

TommyZihao · 2021-01-06T10:06:37Z

I think we can add another aug --resume, let the user choose whether get a new W&B ID to make a new visualization with a new color, or continue with the old mission and resume training.

TommyZihao · 2021-01-06T10:08:48Z

My PR works fine in my situation.
I also plan to make a video tutorial series to introduce this awesome tool to Chinese AI developers.
My video channel: https://space.bilibili.com/1900783

AyushExel · 2021-01-06T14:07:56Z

My PR works fine in my situation.
I also plan to make a video tutorial series to introduce this awesome tool to Chinese AI developers.
My video channel: https://space.bilibili.com/1900783

Your channel looks great. Let us know when the video is out.

I think we can add another aug --resume, let the user choose whether get a new W&B ID to make a new visualization with a new color, or continue with the old mission and resume training.

We already have a --resume argument that continues the old run. Does this solution work with that? If it does we can merge this.

AyushExel · 2021-01-06T14:20:08Z

@glenn-jocher This is very strange. This problem doesn't occur in any other case. I tested this manually by initializing multiple runs in a colab and all of them had unique IDs. Can you think of any recent changes that you made regarding the logging feature that might have caused this?

AyushExel · 2021-01-06T14:42:46Z

@glenn-jocher I found the cause for this error. This happens because the yolov5s.pt model that gets downloaded before training has wandb_id set to 3hdht16b, and the code automatically sets the wandb id if found in the checkpoint:

id=ckpt.get('wandb_id') if 'ckpt' in locals() else None

So every time transfer learning is done on yolov5s.pt model, it'll detect the same id. This problem doesn't occur when training from scratch.
One solution that I can think of is to remove wandb_id from the models uploaded on torch hub. Let me know if you have any other workarounds.

TommyZihao · 2021-01-06T15:16:22Z

@glenn-jocher I found the cause for this error. This happens because the yolov5s.pt model that gets downloaded before training has wandb_id set to 3hdht16b, and the code automatically sets the wandb id if found in the checkpoint:
id=ckpt.get('wandb_id') if 'ckpt' in locals() else None
So every time transfer learning is done on yolov5s.pt model, it'll detect the same id. This problem doesn't occur when training from scratch.
One solution that I can think of is to remove wandb_id from the models uploaded on torch hub. Let me know if you have any other workarounds.

Exactly, thank you.
I think one graceful solution would be my PR. Generating unique wand_id every time when run train.py.

AyushExel · 2021-01-06T15:31:17Z

Exactly, thank you.
I think one graceful solution would be my PR. Generating unique wand_id every time when run train.py.

Yes, your PR works for training but it doesn't take into account the --resume functionality. Currently, if you resume a run, the metrics and visualizations will be logged in the same run which is being resumed, but your PR will generate a new ID in every case, even when resuming a previous run. If you can make some changes to incorporate the resume feature, that'd be great

fix the bug of ultralytics#1851 If we had trained on yolov5s.pt, the program will generate a new unique W&B ID. If we hadn't, the program will keep the old code, we can still use --resume aug.

TommyZihao · 2021-01-06T16:26:23Z

Fix the bug of duplicate W&B ID
If we had trained on yolov5s.pt, the program will generate a new unique W&B ID.
If we hadn't, the program will keep the old code, we can still use --resume aug.

AyushExel · 2021-01-06T18:02:42Z

@TommyZihao this solution will work for this particular model only(yolov5s) and not for others. And any update in those models will cause the code to break as id might be different.

I was thinking a logic that sets wandb_id from the checkpoint only if --resume is set should work.
something like this:

        wandb_run = wandb.init(config=opt, resume="allow",
                               project='YOLOv5' if opt.project == 'runs/train' else Path(opt.project).stem,
                               name=save_dir.stem,
                               id=ckpt.get('wandb_id') if 'ckpt' in locals() and opt.resume else None)

I have checked and this works for all cases and there's no need for an extra wandb_ID variable.

@glenn-jocher what do you think about this solution?

glenn-jocher · 2021-01-07T00:00:45Z

@glenn-jocher I found the cause for this error. This happens because the yolov5s.pt model that gets downloaded before training has wandb_id set to 3hdht16b, and the code automatically sets the wandb id if found in the checkpoint:
id=ckpt.get('wandb_id') if 'ckpt' in locals() else None
So every time transfer learning is done on yolov5s.pt model, it'll detect the same id. This problem doesn't occur when training from scratch.
One solution that I can think of is to remove wandb_id from the models uploaded on torch hub. Let me know if you have any other workarounds.

Oh! This is probably due to the recent v4.0 update, which includes new models which may be the first official models logged in W&B for the first time. I wonder if this is also occurring in ultralytics/yolov3 then. The proper fix then would be to strip the WandDB ID after training fully completes. I can add this here, where the optimizers are similarly stripped from the fully trained checkpoints.

yolov5/train.py

Lines 397 to 398 in 69be8e7

    
           if f.exists(): 
        
               strip_optimizer(f)  # strip optimizers

glenn-jocher · 2021-01-07T00:23:32Z

@TommyZihao @AyushExel ok I think this is all set. The problem was that the new models in the v4.0 release yesterday contained wandb_id's from their training. I've updated this PR to leave train.py alone, but to now strip wandb_id's from fully trained models (as a fully trained model is not meant to be --resumed, but can be used as a pretrained model to transfer learn or train another model, in which case a new W&B should be generated).

Not included in this PR I will need to manually strip the W&B ID's from the 4 pretrained models hosted in https://github.com/ultralytics/yolov5/releases/tag/v4.0. I will do this shortly and then the problem will be solved for all future users.

@TommyZihao to fix your specific model you simply need to set the wandb_id to none, or you can delete your local models, and then let a fixed model autodownload.

glenn-jocher · 2021-01-07T00:25:06Z

@TommyZihao there's actually a built in function to reset problematic official models. If you git pull following this PR merge, then the following command will do this for all four models.

python detect.py --update

glenn-jocher · 2021-01-07T00:35:20Z

Models have been updated now, so autodownloaded models should now show wandb_id=None.

@TommyZihao @AyushExel thank you for spotting this issue and for your contributions! Let us know if you spot any other issues.

* Update train.py Fix the bug of always the same W&B ID and continue overwrite with the old logging. BUG report ultralytics#1851 * Fix the bug of duplicate W&B ID fix the bug of ultralytics#1851 If we had trained on yolov5s.pt, the program will generate a new unique W&B ID. If we hadn't, the program will keep the old code, we can still use --resume aug. * Update general.py * revert train.py changes Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

Update train.py

726206a

Fix the bug of always the same W&B ID and continue overwrite with the old logging. BUG report ultralytics#1851

github-actions bot reviewed Jan 6, 2021

View reviewed changes

TommyZihao mentioned this pull request Jan 6, 2021

W&B id is always the same and continue with the old logging. #1851

Closed

Fix the bug of duplicate W&B ID

f6d6119

fix the bug of ultralytics#1851 If we had trained on yolov5s.pt, the program will generate a new unique W&B ID. If we hadn't, the program will keep the old code, we can still use --resume aug.

Update general.py

51c31c2

glenn-jocher changed the title ~~Update train.py~~ W&B ID reset on training completion Jan 7, 2021

revert train.py changes

8c2a172

glenn-jocher linked an issue Jan 7, 2021 that may be closed by this pull request

W&B id is always the same and continue with the old logging. #1851

Closed

glenn-jocher merged commit 135ec5c into ultralytics:master Jan 7, 2021

This was referenced Jan 7, 2021

Broken pipe #1859

Closed

wandb: ERROR Error while calling W&B API: Error 1062: Duplicate entry '189160-gbp6y2en' for key 'PRIMARY' (<Response [409]>) ultralytics/yolov3#1650

Closed

glenn-jocher mentioned this pull request Jan 14, 2021

runs not logging separately in wandb.ai #1937

Closed

glenn-jocher mentioned this pull request Apr 11, 2021

YOLOv5 v5.0 Release #2762

Merged

glenn-jocher mentioned this pull request Apr 12, 2021

YOLOv5 v5.0 release compatibility update for YOLOv3 ultralytics/yolov3#1737

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

W&B ID reset on training completion #1852

W&B ID reset on training completion #1852

TommyZihao commented Jan 6, 2021 •

edited by UltralyticsAssistant

Loading

github-actions bot left a comment

TommyZihao commented Jan 6, 2021

glenn-jocher commented Jan 6, 2021

AyushExel commented Jan 6, 2021

TommyZihao commented Jan 6, 2021

TommyZihao commented Jan 6, 2021

TommyZihao commented Jan 6, 2021

AyushExel commented Jan 6, 2021

AyushExel commented Jan 6, 2021

AyushExel commented Jan 6, 2021

TommyZihao commented Jan 6, 2021

AyushExel commented Jan 6, 2021

TommyZihao commented Jan 6, 2021

AyushExel commented Jan 6, 2021

glenn-jocher commented Jan 7, 2021

glenn-jocher commented Jan 7, 2021

glenn-jocher commented Jan 7, 2021

glenn-jocher commented Jan 7, 2021

W&B ID reset on training completion #1852

W&B ID reset on training completion #1852

Conversation

TommyZihao commented Jan 6, 2021 • edited by UltralyticsAssistant Loading

🛠️ PR Summary

🌟 Summary

📊 Key Changes

🎯 Purpose & Impact

github-actions bot left a comment

Choose a reason for hiding this comment

TommyZihao commented Jan 6, 2021

glenn-jocher commented Jan 6, 2021

AyushExel commented Jan 6, 2021

TommyZihao commented Jan 6, 2021

TommyZihao commented Jan 6, 2021

TommyZihao commented Jan 6, 2021

AyushExel commented Jan 6, 2021

AyushExel commented Jan 6, 2021

AyushExel commented Jan 6, 2021

TommyZihao commented Jan 6, 2021

AyushExel commented Jan 6, 2021

TommyZihao commented Jan 6, 2021

AyushExel commented Jan 6, 2021

glenn-jocher commented Jan 7, 2021

glenn-jocher commented Jan 7, 2021

glenn-jocher commented Jan 7, 2021

glenn-jocher commented Jan 7, 2021

TommyZihao commented Jan 6, 2021 •

edited by UltralyticsAssistant

Loading