Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for defining a global coordinator pod in the JobSet spec #617

Closed
Tracked by #523
danielvegamyhre opened this issue Jul 15, 2024 · 1 comment
Closed
Tracked by #523
Assignees

Comments

@danielvegamyhre
Copy link
Contributor

What would you like to be added:
Support for defining a global coordinator pod in the JobSet spec.

Why is this needed:
We need to be able to build automation on top of JobSet which knows the stable network endpoint of the pod assigned to be the global coordinator distributed ML training / HPC workloads.

@danielvegamyhre
Copy link
Contributor Author

Adding coordinator field and controller changes: #618

Adding validation: #627

Adding runnable example: #628

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant