-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a sanity checking release test for Alpa and ray nightly. #32995
Conversation
Signed-off-by: Jun Gong <jungong@anyscale.com>
Signed-off-by: Jun Gong <jungong@anyscale.com>
Signed-off-by: Jun Gong <jungong@anyscale.com>
Signed-off-by: Jun Gong <jungong@anyscale.com>
Signed-off-by: Jun Gong <jungong@anyscale.com>
Signed-off-by: Jun Gong <jungong@anyscale.com>
Signed-off-by: Jun Gong <jungong@anyscale.com>
Signed-off-by: Jun Gong <jungong@anyscale.com>
Signed-off-by: Jun Gong <jungong@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great start of many more to come !
config = AutoConfig.from_pretrained(model_args.model_name_or_path) | ||
tokenizer = AutoTokenizer.from_pretrained( | ||
model_args.model_name_or_path, | ||
use_fast=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can save some time with use_fast=True
right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't hurt.
the majority of time is spent on running the model. the model is just so slow.
changed.
f"Step... {cur_step} | " | ||
f"Loss: {train_metric['loss'].mean():.4f}, " | ||
f"Throughput: {throughput_tokens:.2f} token/s, " | ||
f"{throughput_tflops:.2f} TFLOP/s" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we continuously track this metrics in dashboard ? And generally speaking we should throw away first 1~2 steps and average the subsequent ones for perf tracking
not blocking and can be next PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea.
I made the change to write a json dict of token and tflops throughput.
will run another test before merge.
Signed-off-by: Jun Gong <jungong@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks
config = AutoConfig.from_pretrained(model_args.model_name_or_path) | ||
tokenizer = AutoTokenizer.from_pretrained( | ||
model_args.model_name_or_path, | ||
use_fast=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't hurt.
the majority of time is spent on running the model. the model is just so slow.
changed.
f"Step... {cur_step} | " | ||
f"Loss: {train_metric['loss'].mean():.4f}, " | ||
f"Throughput: {throughput_tokens:.2f} token/s, " | ||
f"{throughput_tflops:.2f} TFLOP/s" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea.
I made the change to write a json dict of token and tflops throughput.
will run another test before merge.
Signed-off-by: Jun Gong <jungong@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just quickly check the python version
working_dir: alpa_tests | ||
|
||
frequency: nightly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
working_dir: alpa_tests | |
frequency: nightly | |
working_dir: alpa_tests | |
python: "3.9" | |
frequency: nightly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good to know. keep it in my sleeve for now :)
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com> Signed-off-by: Jun Gong <gongjunoliver@hotmail.com>
@jiaodong I am gonna merge this now. we get metrics like, and we can always optimize things more later.
|
…ay-project#32995) Signed-off-by: Jun Gong <jungong@anyscale.com> Signed-off-by: Jack He <jackhe2345@gmail.com>
…ay-project#32995) Signed-off-by: Jun Gong <jungong@anyscale.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…ay-project#32995) Signed-off-by: Jun Gong <jungong@anyscale.com>
…ay-project#32995) Signed-off-by: Jun Gong <jungong@anyscale.com> Signed-off-by: elliottower <elliot@elliottower.com>
…ay-project#32995) Signed-off-by: Jun Gong <jungong@anyscale.com> Signed-off-by: Jack He <jackhe2345@gmail.com>
Why are these changes needed?
First nightly integration test for Alpa and Ray
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.