Loading network weights with wrong state_dict fails silently #1508

matt3o · 2023-07-28T13:14:23Z

Describe the bug
I spent a few hours wondering why my network won't work and which of my transforms are broken, only to find out that the loading of the network did not work. Monailabel via default uses the "model" key for the state_dict, whereas my code uses the "net" key.
However running MONAILabel, the network is just initialized randomly and the loading of the weights fails completely silently.

The easiest solution would be to change the default value of load_strict to True in BasicInferTask.
This would make a lot of sense for beginners. MONAILabel hides away the specifics of loading the network weights - the least I would expect is an error message telling me something may have gone wrong. In this case however it is certain that something went wrong so it would be even better for the program to fail imo.

The text was updated successfully, but these errors were encountered:

tangy5 · 2023-07-28T15:48:13Z

Thanks for the suggestions.
This could be an enhancing feature for the future. The key for loading model weights are not always "state_dict". This can be optimized.
The reason why the default is not set to "load_strict" to true is that most of the times users may not have all the weights matched. Falise will enable loading parts of pretrained weights. For this feature, we may need a argument that can adjust the configuration.

Thanks again for the suggestions, if you get time, you could raise a PR for your suggestions, I think they are useful to all users. Or we will put these items in TODO list and enhance in the normal dev plan. Thank you.

matt3o · 2023-07-29T07:58:20Z

Thanks for the quick response @tangy5. Hm yeah I get that for many models you want that value to be False. Why not set it to true in the BasicInferTask and then to False in every sample-app/model, where it is actually needed?
For me as a noobie trying to add a new model this was a major pain.

I can add a PR however I'd need to know which of the two options you prefer:

After loading the checkpoint, go through the keys and make sure the desired one is in the state_dict of checkpoint. If not print an error message but continue the program. This would only modify BasicInferTask._get_network(). No change in the previous behaviour apart from an error message stating that the weights might not have been loaded correctly or at least were only partial.
Set the default value of load_strict to True in BasicInferTask and change the value in all available sample apps where it is not set so far to False.

tangy5 · 2023-07-30T05:21:46Z

I think the two items you proposed makes sense, feel free to raise a PR. You could add "-s" to your commit command, and our team can edit and test your branch, we will have CI/CD integration and unit tests and env build. I will also try and help if anything is missing or need discussion.

Thanks again.

matt3o · 2023-08-29T14:03:38Z

I'll close this issue now since the code for it was already merged.

However @tangy5 it would great if you can merge the second part as well: #1533
The first PR #1521 was merged before it was fully finished since I was still waiting for feedback. As it is the dev branch probably will unnecessarily breaks other models which is not what I wanted to achieve with the PR.

matt3o changed the title ~~Loading network weights with wrong state_dict silently fails~~ Loading network weights with wrong state_dict fails silently Jul 28, 2023

tangy5 added the enhancement New feature or request label Aug 8, 2023

matt3o mentioned this issue Aug 8, 2023

Fix silent fail if network loads wrong weights #1521

Merged

2 tasks

matt3o closed this as completed Aug 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading network weights with wrong state_dict fails silently #1508

Loading network weights with wrong state_dict fails silently #1508

matt3o commented Jul 28, 2023

tangy5 commented Jul 28, 2023

matt3o commented Jul 29, 2023

tangy5 commented Jul 30, 2023

matt3o commented Aug 29, 2023

Loading network weights with wrong state_dict fails silently #1508

Loading network weights with wrong state_dict fails silently #1508

Comments

matt3o commented Jul 28, 2023

tangy5 commented Jul 28, 2023

matt3o commented Jul 29, 2023

tangy5 commented Jul 30, 2023

matt3o commented Aug 29, 2023