Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

TypeError: optimizers must be either a single optimizer or a list of optimizers. #18

Closed
guerriep opened this issue Aug 24, 2020 · 20 comments

Comments

@guerriep
Copy link

guerriep commented Aug 24, 2020

Hello,

I'm trying to run main_swav.py with the following command:

python -m torch.distributed.launch --nproc_per_node=1 main_swav.py --images_path=<path to data directory> --train_annotations_path <path to data file> --epochs 400 --base_lr 0.6 --final_lr 0.0006 --warmup_epochs 0 --batch_size 32 --size_crops 224 96 --nmb_crops 2 6 --min_scale_crops 0.14 0.05 --max_scale_crops 1. 0.14 --use_fp16 true --freeze_prototypes_niters 5005 --queue_length 3840 --epoch_queue_starts 15

Some of those parameters have been added to accommodate our data. The only changes I have made to the code are minor changes to the dataset and additional/changed arguments. When I run this command I get the following error:

`Traceback (most recent call last):
File "main_swav.py", line 380, in
main()
File "main_swav.py", line 189, in main
model, optimizer = apex.amp.initialize(model, optimizer, opt_level="O1")
File "/opt/conda/lib/python3.6/site-packages/apex/amp/frontend.py", line 358, in initialize
return _initialize(models, optimizers, _amp_state.opt_properties, num_losses, cast_model_outputs)
File "/opt/conda/lib/python3.6/site-packages/apex/amp/_initialize.py", line 158, in _initialize
raise TypeError("optimizers must be either a single optimizer or a list of optimizers.")
TypeError: optimizers must be either a single optimizer or a list of optimizers.

Traceback (most recent call last):
File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 263, in
main()
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 259, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'main_swav.py', '--local_rank=0', '--images_path=/data/computer_vision_projects/rare_planes/classification_data/images/', '--train_annotations_path', '/data/computer_vision_projects/rare_planes/classification_data/annotations/instances_train_role_mislabel_category_id_033_chipped.json', '--epochs', '400', '--base_lr', '0.6', '--final_lr', '0.0006', '--warmup_epochs', '0', '--batch_size', '32', '--size_crops', '224', '96', '--nmb_crops', '2', '6', '--min_scale_crops', '0.14', '0.05', '--max_scale_crops', '1.', '0.14', '--use_fp16', 'true', '--freeze_prototypes_niters', '5005', '--queue_length', '3840', '--epoch_queue_starts', '15']' returned non-zero exit status 1.
make: *** [Makefile:69: train-rare-planes] Error 1`

Immediately before the line that throws the error I placed a couple print statements:
print("type(OPTIMIZER)", type(optimizer)) print("OPTIMIZER", optimizer)

The output from those is:
type(OPTIMIZER) <class 'apex.parallel.LARC.LARC'> OPTIMIZER SGD ( Parameter Group 0 dampening: 0 lr: 0.6 momentum: 0.9 nesterov: False weight_decay: 1e-06 )

Here are some version numbers I'm using:
Python 3.6.9 :: Anaconda, Inc. PyTorch == 1.5.0a0+8f84ded torchvision == 0.6.0a0 CUDA == 10.2 apex == 0.1

Any ideas why I would be seeing this error? Thanks in advance!

@mathildecaron31
Copy link
Contributor

mathildecaron31 commented Sep 7, 2020

Hi @guerriep,
It seems that amp initialize method does not recognize your optimizer, which is weird since it is the right type (LARC). Can you add print statements in your apex library around this line https://github.com/NVIDIA/apex/blob/4ef930c1c884fdca5f472ab2ce7cb9b505d26c1a/apex/amp/_initialize.py#L149 in order to understand why it does not return True at ('LARC' in globals() and isinstance(optimizers, LARC) ?

@mathildecaron31
Copy link
Contributor

No activity so I close the issue. Feel free to re-open if you need further assistance

@John-P
Copy link

John-P commented Oct 12, 2020

I have encountered the same issue. Added prints in apex:

type(optimizers) <class 'apex.parallel.LARC.LARC'>
else hit

that line for me is also slightly different (lines 148-160):

print("type(optimizers)", type(optimizers))
    if isinstance(optimizers, torch.optim.Optimizer) or ('LARC' in sys.modules and isinstance(optimizers, LARC)):
        print("isinstance LARC True")
        optimizers = [optimizers]
    elif optimizers is None:
        optimizers = []
    elif isinstance(optimizers, list):
        optimizers_was_list = True
        check_optimizers(optimizers)
    else:
        print("else hit")
        check_optimizers([optimizers])
        raise TypeError("optimizers must be either a single optimizer or a list of optimizers.")

I have also checked isinstance(optimizers, LARC) does indeed return True but 'LARC' in sys.modules is False.

Version numbers:

  • python 3.8.3
  • apex 0.1
  • pytorch 1.6.0
  • CUDA 11

@kdexd
Copy link

kdexd commented Oct 18, 2020

+1 facing the same issue, following this thread.

@kdexd
Copy link

kdexd commented Oct 20, 2020

NVIDIA/apex#978 is probably related.

@GuoleiSun
Copy link

+1 facing the same issue, any idea how to solve?

@John-P
Copy link

John-P commented Dec 16, 2020

@mathildecaron31 Is it possible to re-open this issue as it appears to be affecting a number of people and is unresolved? Would you also be able to share version numbers for libraries in order to re-create your environment?

@mathildecaron31
Copy link
Contributor

I tested this code with:

  • python 3.6.6
  • apex commit: 4a1aa97e31ca87514e17c3cd3bbc03f4204579d0
  • torch 1.4.0
  • cuda 10.1

@mathildecaron31
Copy link
Contributor

Here is how I installed apex:

git clone "https://github.com/NVIDIA/apex"
cd apex
git checkout 4a1aa97e31ca87514e17c3cd3bbc03f4204579d0
python setup.py install --cuda_ext

python -c 'import apex; from apex.parallel import LARC' # should run and return nothing
python -c 'import apex; from apex.parallel import SyncBatchNorm; print(SyncBatchNorm.__module__)' # should run and return apex.parallel.optimized_sync_batchnorm

Hope that helps

@John-P
Copy link

John-P commented Dec 23, 2020

I have been able to get it to run with these specific versions now. Still a bit curious as to why it does not work with newer versions of apex.

For others trying to replicate, these are my steps using anaconda and pip:

conda create --name=swav python=3.6.6
# CUDA 10.1 with torchvision
conda install pytorch==1.4.0 torchvision==0.5.0 cudatoolkit=10.1 -c pytorch
# NVCC for CUDA 10.1
conda install -c conda-forge cudatoolkit-dev=10.1.243,  pandas, opencv
# Pip should return a path with env name in it
which pip
# Apex commit 4a1aa97e31ca87514e17c3cd3bbc03f4204579d0 with cuda extentions enabled
pip install git+git://github.com/NVIDIA/apex.git@4a1aa97e31ca87514e17c3cd3bbc03f4204579d0 --install-option="--cuda_ext"

@kaushal-py
Copy link

Here is how I installed apex:

git clone "https://github.com/NVIDIA/apex"
cd apex
git checkout 4a1aa97e31ca87514e17c3cd3bbc03f4204579d0
python setup.py install --cuda_ext

python -c 'import apex; from apex.parallel import LARC' # should run and return nothing
python -c 'import apex; from apex.parallel import SyncBatchNorm; print(SyncBatchNorm.__module__)' # should run and return apex.parallel.optimized_sync_batchnorm

Hope that helps

Thanks a lot! This worked for me. The specific version of apex seems like an important dependency for the code to run. It would be beneficial if this can be added to the Readme.

@mathildecaron31
Copy link
Contributor

027a54a

@DreamMemory001
Copy link

Here is how I installed apex:

git clone "https://github.com/NVIDIA/apex"
cd apex
git checkout 4a1aa97e31ca87514e17c3cd3bbc03f4204579d0
python setup.py install --cuda_ext

python -c 'import apex; from apex.parallel import LARC' # should run and return nothing
python -c 'import apex; from apex.parallel import SyncBatchNorm; print(SyncBatchNorm.__module__)' # should run and return apex.parallel.optimized_sync_batchnorm

Hope that helps

Thanks a lot! This worked for me. The specific version of apex seems like an important dependency for the code to run. It would be beneficial if this can be added to the Readme.

I have not fix my bug, here AttributeError: module 'torch.distributed' has no attribute 'deprecated' I don't have other thought, who has check? please help me! Thanks u.

@ayl
Copy link

ayl commented Mar 14, 2021

To build on @John-P 's work,

For building apex, make sure you have gcc > 5 and < 8. For example, the NVIDIA Docker container: nvidia/cuda:10.1-base has gcc v7.5, Ubuntu 18.04 and I was able to build apex successfully.

conda create --name=swav python=3.6.6
conda activate swav
# CUDA 10.1 with torchvision
conda install pytorch==1.4.0 torchvision==0.5.0 cudatoolkit=10.1 -c pytorch
# NVCC for CUDA 10.1
conda install -c conda-forge cudatoolkit-dev=10.1.243  pandas opencv numpy scipy
# Pip should return a path with env name in it
which pip
# Apex commit 4a1aa97e31ca87514e17c3cd3bbc03f4204579d0 with cuda extentions enabled
pip install git+git://github.com/NVIDIA/apex.git@4a1aa97e31ca87514e17c3cd3bbc03f4204579d0 --install-option="--cuda_ext"

@escorciav
Copy link

The comment above made it for me but the last line should be
pip install --upgrade-strategy only-if-needed git+https://github.com/NVIDIA/apex.git@4a1aa97e31ca87514e17c3cd3bbc03f4204579d0 --install-option="--cuda_ext"

compiled in a cluster without sudo, only ubuntu 18..04 and nvidia-drivers 😉

@ClaudiaShu
Copy link

Here is how I installed apex:

git clone "https://github.com/NVIDIA/apex"
cd apex
git checkout 4a1aa97e31ca87514e17c3cd3bbc03f4204579d0
python setup.py install --cuda_ext

python -c 'import apex; from apex.parallel import LARC' # should run and return nothing
python -c 'import apex; from apex.parallel import SyncBatchNorm; print(SyncBatchNorm.__module__)' # should run and return apex.parallel.optimized_sync_batchnorm

Hope that helps

Hi, can I compile the apex with cuda 11.1?
I got this error when compiling:

torch.__version__  =  1.8.1+cu111
setup.py:46: UserWarning: Option --pyprof not specified. Not installing PyProf dependencies!
  warnings.warn("Option --pyprof not specified. Not installing PyProf dependencies!")

Compiling cuda extensions with
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
from /usr/bin

Traceback (most recent call last):
  File "setup.py", line 106, in <module>
    check_cuda_torch_binary_vs_bare_metal(torch.utils.cpp_extension.CUDA_HOME)
  File "setup.py", line 76, in check_cuda_torch_binary_vs_bare_metal
    raise RuntimeError("Cuda extensions are being compiled with a version of Cuda that does " +
RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries.  Pytorch binaries were compiled with Cuda 11.1.
In some cases, a minor-version mismatch will not cause later errors:  https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  You can try commenting out this check (at your own risk).

I'm working on RTX3090 and it only supports cuda version that is 11 or above.

@zcalhoun
Copy link

I found Mathilde's suggestion to require a slight change to ensure dependencies played nicely together. Use pip to install the checked out version of apex.

git clone "https://github.com/NVIDIA/apex"
cd apex
git checkout 4a1aa97e31ca87514e17c3cd3bbc03f4204579d0
pip install -v --disable-pip-version-check --no-cache-dir ./

python -c 'import apex; from apex.parallel import LARC' # should run and return nothing
python -c 'import apex; from apex.parallel import SyncBatchNorm; print(SyncBatchNorm.__module__)' # should run and return apex.parallel.optimized_sync_batchnorm

@yousuf907
Copy link

If anyone is getting error for "from torch._six import container_abcs" line 14 in "_amp_state.py" script of apex, you may replace that line with "import collections.abc as container_abcs" and it should work.

@GSusan
Copy link

GSusan commented Jun 3, 2023 via email

@Taoww21480
Copy link

If you encounter this”from torch. _six import string_classes“ error reported line2 in "initialize.py"script of apex, please comment out this line of code and replace it with a ”string classes=str“ is sufficient.

replace ”from torch._six import string_classes" with "string_classes = str".

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests