Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JAX 0.3.5 stalls on TPU Pods #10218

Closed
wilson1yan opened this issue Apr 10, 2022 · 4 comments
Closed

JAX 0.3.5 stalls on TPU Pods #10218

wilson1yan opened this issue Apr 10, 2022 · 4 comments
Labels
bug Something isn't working

Comments

@wilson1yan
Copy link

wilson1yan commented Apr 10, 2022

This tutorial does not work when using the latest jax version 0.3.5. Specifically, code will hang whenever jax.device_count() or jax.local_device_count() is called.

The code prints the following and then stalls.

>>> import jax
>>> jax.local_device_count()
E0410 21:04:57.403257842   30715 f758.cc:310]                no server name supplied in dns URI
E0410 21:04:57.403295131   30715 f872.cc:77]                 channel stack builder failed: {"created":"@1649624697.403284903","description":"the target uri is not valid: dns:","file":"f814.cc","file_line":1090}

This issue does not happen if I install 0.3.4, or run the same code (with 0.3.5) on a non-pod instance like v2-8.

@wilson1yan wilson1yan added the bug Something isn't working label Apr 10, 2022
@hawkinsp
Copy link
Collaborator

Can you please verify you have the same libtpu_nightly and jaxlib versions installed on all VMs in the TPU pod?

@wilson1yan
Copy link
Author

Yes, all VMs have jaxlib==0.3.5 and libtpu-nightly==0.1.dev20220407

@hawkinsp
Copy link
Collaborator

No updates yet, but we can reproduce the problem and are looking into it.

@yashk2810
Copy link
Collaborator

We just released jax 0.3.6 with a new libtpu to fix this issue. Please upgrade to jax 0.3.6!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants