Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom model training fails, need to downgrade torch (and setuptools) #15

Open
glemoine62 opened this issue Nov 2, 2022 · 0 comments
Open

Comments

@glemoine62
Copy link

glemoine62 commented Nov 2, 2022

Hi,

I am using the deepquestai/deepstack:gpu-2022.01.1 container to do custom training. It comes with torch for cuda 11.3 but train.py fails after initiation (see error below). This is resolved when I downgrade to torch for cuda 11.0 (pip install torch==1.7.0+cu110 torchvision==0.8.1+cu110 torchaudio===0.7.0 -f https://download.pytorch.org/whl/torch_stable.html as per the collab notebook).

docker run --gpus all -it --rm -v /home/eouser/deepstack:/deepstack/code -w /deepstack/code/deepstack-trainer deepquestai/deepstack_updated:gpu python3 train.py --dataset-path /deepstack/code/data
Traceback (most recent call last):
File "train.py", line 530, in
train(hyp, opt, device, tb_writer, wandb)
File "train.py", line 90, in train
model = Model(opt.cfg or ckpt['model'].yaml, ch=3, nc=nc).to(device) # create
File "/deepstack/code/deepstack-trainer/models/yolo.py", line 96, in init
self._initialize_biases() # only run once
File "/deepstack/code/deepstack-trainer/models/yolo.py", line 151, in _initialize_biases
b[:, 4] += math.log(8 / (640 / s) ** 2) # obj (8 objects per 640 image)
RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

I first need to downgrade setuptools inside the container, btw, because otherwise it throws:

Traceback (most recent call last):
File "train.py", line 21, in
from torch.utils.tensorboard import SummaryWriter
File "/usr/local/lib/python3.7/dist-packages/torch/utils/tensorboard/init.py", line 4, in
LooseVersion = distutils.version.LooseVersion
AttributeError: module 'setuptools._distutils' has no attribute 'version'

(resolved with: pip install setuptools==59.5.0)

I am now happily training with the revised setup, so nothing too urgent, but maybe worth checking out.

Thx for this wonderful framework!

Guido

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant