Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validations for 2.2 Release. Cherrry Pick Validation and Manual #4855

Closed
11 tasks done
atalman opened this issue Jan 4, 2024 · 3 comments
Closed
11 tasks done

Validations for 2.2 Release. Cherrry Pick Validation and Manual #4855

atalman opened this issue Jan 4, 2024 · 3 comments

Comments

@atalman atalman converted this from a draft issue Jan 4, 2024
@atalman
Copy link
Contributor Author

atalman commented Jan 18, 2024

Manual Validations

@atalman atalman changed the title Cherrry Pick Validation Validations for 2.2 Release. Cherrry Pick Validation and Manual Jan 18, 2024
@huydhn
Copy link
Contributor

huydhn commented Jan 18, 2024

For pytorch/pytorch#115193 issue with the launching of distributed device mesh API, I follow https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/README.md to run the DTensor example on devgpu torchrun --standalone --nnodes=1 --nproc-per-node=4 dtensor_example.py and it works fine:

$ torchrun --standalone --nnodes=1 --nproc-per-node=4 dtensor_example.py

[2024-01-18 14:08:13,419] torch.distributed.run: [WARNING]
[2024-01-18 14:08:13,419] torch.distributed.run: [WARNING] *****************************************
[2024-01-18 14:08:13,419] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-01-18 14:08:13,419] torch.distributed.run: [WARNING] *****************************************
NCCL version 2.19.3+cuda12.3
...
DTensor(local_tensor=tensor([[-0.9938,  1.6568, -0.0712,  ..., -0.7047,  0.1956,  0.7011],
        [ 0.0633, -0.0818,  0.0865,  ...,  0.6208, -1.3616,  0.4402],
        [ 0.7410,  0.3713, -1.0218,  ..., -0.6000, -0.3061,  0.0240],
        ...,
        [-0.2041, -0.4914, -1.4949,  ..., -0.6163, -0.6493,  0.5180],
        [ 2.5286, -0.3243,  0.5991,  ...,  0.7855,  0.3508, -0.1411],
        [ 1.6220,  1.5745,  0.4140,  ...,  0.6092, -0.7156,  1.0645]],
       device='cuda:0'), device_mesh=DeviceMesh([0, 1, 2, 3]), placements=(Shard(dim=0),))

For the purpose of doing 2.2.0 release, I think that would be good enough.

@atalman
Copy link
Contributor Author

atalman commented Jan 31, 2024

Post Release Poetry test:

curl -s https://pypi.org/pypi/torch/2.2.0/json | jq '.info.requires_dist'
[
  "filelock",
  "typing-extensions >=4.8.0",
  "sympy",
  "networkx",
  "jinja2",
  "fsspec",
  "nvidia-cuda-nvrtc-cu12 ==12.1.105 ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "nvidia-cuda-runtime-cu12 ==12.1.105 ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "nvidia-cuda-cupti-cu12 ==12.1.105 ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "nvidia-cudnn-cu12 ==8.9.2.26 ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "nvidia-cublas-cu12 ==12.1.3.1 ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "nvidia-cufft-cu12 ==11.0.2.54 ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "nvidia-curand-cu12 ==10.3.2.106 ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "nvidia-cusolver-cu12 ==11.4.5.107 ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "nvidia-cusparse-cu12 ==12.1.0.106 ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "nvidia-nccl-cu12 ==2.19.3 ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "nvidia-nvtx-cu12 ==12.1.105 ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "triton ==2.2.0 ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "opt-einsum >=3.3 ; extra == 'opt-einsum'",
  "optree >=0.9.1 ; extra == 'optree'"
]

@atalman atalman closed this as completed Jan 31, 2024
@atalman atalman moved this from Validation Required to Done in Release Milestone Review Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants