-
Notifications
You must be signed in to change notification settings - Fork 168
Issues: intelligent-machine-learning/dlrover
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
The unittest cases to execut test_orphan_workers is too long.
dev issue
for development
report
#1381
opened Dec 8, 2024 by
workingloong
Support HTTP for master-worker communication.
enhancement
New feature or request
todo
issue or pr with 'todo' will ignore expiration
Will flashcheckpoint support fully parallel save in megatron core 0.7+ ?
question
Further information is requested
#1363
opened Nov 28, 2024 by
leondada
dlorver适配新的加速器类型以及实现类似Nvidia_gpu.py脚本 / dlorver should adapts to the new accelerator type and implements a script something like Nvidia_gpu.py
question
Further information is requested
#1338
opened Nov 15, 2024 by
lulu-0126
client.connect(path) error when saving checkpoint
investigating
report
#1337
opened Nov 15, 2024 by
atomrun39
AttributeError: module 'collections' has no attribute 'Sequence'
investigating
#1332
opened Nov 12, 2024 by
linzhidao1010
Could DLRover be able to apply to the diffusion transformer training? And combined with deepspeed?
question
Further information is requested
#1314
opened Oct 29, 2024 by
TomSuen
Add balance loss in atorch moe example
Hacktoberfest
todo
issue or pr with 'todo' will ignore expiration
#1300
opened Oct 18, 2024 by
skydoorkai
How does dlrover make sure all the nodes in one job are in one switch
question
Further information is requested
#1298
opened Oct 17, 2024 by
gangxie112
add xpu monitor for dlrover
Hacktoberfest
todo
issue or pr with 'todo' will ignore expiration
#1290
opened Oct 12, 2024 by
majieyue
Can you create a dlrover arm64 image for Ascend NPU?
question
Further information is requested
#1248
opened Aug 22, 2024 by
xmarker
Question: How DLRover integrate with Llama Factory?
question
Further information is requested
#1244
opened Aug 21, 2024 by
hetingyou
xpu timer python package
todo
issue or pr with 'todo' will ignore expiration
#1159
opened Jun 17, 2024 by
zxyyzx
megatron-lm flash-ckpt can not save ckpt to disk when use pipeline parallel
help wanted
Extra attention is needed
investigating
The job stops restarting workers and exits if the traceback is a code bug.
enhancement
New feature or request
question
Further information is requested
todo
issue or pr with 'todo' will ignore expiration
ProTip!
Type g p on any issue or pull request to go back to the pull request listing page.