Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt fix on dinov2-giant-nodes #268

Merged
merged 1 commit into from
Sep 5, 2024

Conversation

satyaog
Copy link
Member

@satyaog satyaog commented Sep 5, 2024

No description provided.

@Delaunay Delaunay changed the base branch from master to staging September 5, 2024 15:28
@Delaunay Delaunay merged commit 57b5cef into mila-iqia:staging Sep 5, 2024
1 of 3 checks passed
@Delaunay
Copy link
Collaborator

Delaunay commented Sep 9, 2024

Can't fix with a PR only,
the code change I made are also needed because of SLURM

dinov2-giant-nodes.cn-d003.nolog [stderr] Traceback (most recent call last):
dinov2-giant-nodes.cn-d003.nolog [stderr]   File "/tmp/workspace/cuda/results/venv/torch/bin/voir", line 8, in <module>
dinov2-giant-nodes.cn-d003.nolog [stderr]     sys.exit(main())
dinov2-giant-nodes.cn-d003.nolog [stderr]   File "/tmp/workspace/cuda/results/venv/torch/lib/python3.10/site-packages/voir/cli.py", line 128, in main
dinov2-giant-nodes.cn-d003.nolog [stderr]     ov(sys.argv[1:] if argv is None else argv)
dinov2-giant-nodes.cn-d003.nolog [stderr]   File "/tmp/workspace/cuda/results/venv/torch/lib/python3.10/site-packages/voir/phase.py", line 331, in __call__
dinov2-giant-nodes.cn-d003.nolog [stderr]     self._run(*args, **kwargs)
dinov2-giant-nodes.cn-d003.nolog [stderr]   File "/tmp/workspace/cuda/results/venv/torch/lib/python3.10/site-packages/voir/overseer.py", line 242, in _run
dinov2-giant-nodes.cn-d003.nolog [stderr]     set_value(func())
dinov2-giant-nodes.cn-d003.nolog [stderr]   File "/tmp/workspace/cuda/results/venv/torch/lib/python3.10/site-packages/voir/scriptutils.py", line 37, in <lambda>
dinov2-giant-nodes.cn-d003.nolog [stderr]     return lambda: exec(mainsection, glb, glb)
dinov2-giant-nodes.cn-d003.nolog [stderr]   File "/home/mila/d/delaunap/milabench/benchmarks/dinov2/main.py", line 12, in <module>
dinov2-giant-nodes.cn-d003.nolog [stderr]     main(args)
dinov2-giant-nodes.cn-d003.nolog [stderr]   File "/home/mila/d/delaunap/milabench/benchmarks/dinov2/src/dinov2/train/train.py", line 298, in main
dinov2-giant-nodes.cn-d003.nolog [stderr]     cfg = setup(args)
dinov2-giant-nodes.cn-d003.nolog [stderr]   File "/home/mila/d/delaunap/milabench/benchmarks/dinov2/src/dinov2/utils/config.py", line 69, in setup
dinov2-giant-nodes.cn-d003.nolog [stderr]     default_setup(args)
dinov2-giant-nodes.cn-d003.nolog [stderr]   File "/home/mila/d/delaunap/milabench/benchmarks/dinov2/src/dinov2/utils/config.py", line 50, in default_setup
dinov2-giant-nodes.cn-d003.nolog [stderr]     distributed.enable(overwrite=True)
dinov2-giant-nodes.cn-d003.nolog [stderr]   File "/home/mila/d/delaunap/milabench/benchmarks/dinov2/src/dinov2/distributed/__init__.py", line 251, in enable
dinov2-giant-nodes.cn-d003.nolog [stderr]     torch_env = _TorchDistributedEnvironment()
dinov2-giant-nodes.cn-d003.nolog [stderr]   File "/home/mila/d/delaunap/milabench/benchmarks/dinov2/src/dinov2/distributed/__init__.py", line 161, in __init__
dinov2-giant-nodes.cn-d003.nolog [stderr]     return self._set_from_slurm_env()
dinov2-giant-nodes.cn-d003.nolog [stderr]   File "/home/mila/d/delaunap/milabench/benchmarks/dinov2/src/dinov2/distributed/__init__.py", line 184, in _set_from_slurm_env
dinov2-giant-nodes.cn-d003.nolog [stderr]     node_count = int(os.environ["SLURM_JOB_NUM_NODES"])
dinov2-giant-nodes.cn-d003.nolog [stderr]   File "/home/mila/d/delaunap/conda/envs/py310/lib/python3.10/os.py", line 680, in __getitem__
dinov2-giant-nodes.cn-d003.nolog [stderr]     raise KeyError(key) from None
dinov2-giant-nodes.cn-d003.nolog [stderr] KeyError: 'SLURM_JOB_NUM_NODES'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants