Add a sanity checking release test for Alpa and ray nightly. #32995

gjoliver · 2023-03-03T07:40:37Z

Why are these changes needed?

First nightly integration test for Alpa and Ray

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- [*] Release tests
- This PR is not tested :(

Signed-off-by: Jun Gong <jungong@anyscale.com>

jiaodong

great start of many more to come !

jiaodong · 2023-03-13T01:48:22Z

release/alpa_tests/train_opt_2_7b_minimum.py

+    config = AutoConfig.from_pretrained(model_args.model_name_or_path)
+    tokenizer = AutoTokenizer.from_pretrained(
+        model_args.model_name_or_path,
+        use_fast=False,


we can save some time with use_fast=True right ?

doesn't hurt.
the majority of time is spent on running the model. the model is just so slow.
changed.

jiaodong · 2023-03-13T01:50:51Z

release/alpa_tests/train_opt_2_7b_minimum.py

+                    f"Step... {cur_step} | "
+                    f"Loss: {train_metric['loss'].mean():.4f}, "
+                    f"Throughput: {throughput_tokens:.2f} token/s, "
+                    f"{throughput_tflops:.2f} TFLOP/s"


do we continuously track this metrics in dashboard ? And generally speaking we should throw away first 1~2 steps and average the subsequent ones for perf tracking

not blocking and can be next PR

good idea.
I made the change to write a json dict of token and tflops throughput.
will run another test before merge.

Signed-off-by: Jun Gong <jungong@anyscale.com>

gjoliver

thanks

gjoliver · 2023-03-13T17:56:10Z

release/alpa_tests/train_opt_2_7b_minimum.py

+    config = AutoConfig.from_pretrained(model_args.model_name_or_path)
+    tokenizer = AutoTokenizer.from_pretrained(
+        model_args.model_name_or_path,
+        use_fast=False,


doesn't hurt.
the majority of time is spent on running the model. the model is just so slow.
changed.

gjoliver · 2023-03-13T18:13:10Z

release/alpa_tests/train_opt_2_7b_minimum.py

+                    f"Step... {cur_step} | "
+                    f"Loss: {train_metric['loss'].mean():.4f}, "
+                    f"Throughput: {throughput_tokens:.2f} token/s, "
+                    f"{throughput_tflops:.2f} TFLOP/s"


good idea.
I made the change to write a json dict of token and tflops throughput.
will run another test before merge.

Signed-off-by: Jun Gong <jungong@anyscale.com>

krfricke

LGTM, just quickly check the python version

release/alpa_tests/app_config.yaml

krfricke · 2023-03-14T20:06:02Z

release/release_tests.yaml

+  working_dir: alpa_tests
+
+  frequency: nightly


Suggested change

working_dir: alpa_tests

frequency: nightly

working_dir: alpa_tests

python: "3.9"

frequency: nightly

good to know. keep it in my sleeve for now :)

Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com> Signed-off-by: Jun Gong <gongjunoliver@hotmail.com>

gjoliver · 2023-03-14T20:15:19Z

@jiaodong I am gonna merge this now. we get metrics like, and we can always optimize things more later.

[INFO 2023-03-14 11:47:14,557] log.py: 41  Observed the following results:
--
  |  
  | throughput_tokens = 72268.23956628374
  | throughput_tflops = 152.47067618098896

…ay-project#32995) Signed-off-by: Jun Gong <jungong@anyscale.com> Signed-off-by: Jack He <jackhe2345@gmail.com>

…ay-project#32995) Signed-off-by: Jun Gong <jungong@anyscale.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…ay-project#32995) Signed-off-by: Jun Gong <jungong@anyscale.com>

…ay-project#32995) Signed-off-by: Jun Gong <jungong@anyscale.com> Signed-off-by: elliottower <elliot@elliottower.com>

…ay-project#32995) Signed-off-by: Jun Gong <jungong@anyscale.com> Signed-off-by: Jack He <jackhe2345@gmail.com>

Jun Gong added 5 commits March 8, 2023 06:58

Add a sanity checking release test for Alpa and ray nightly.

986241b

Signed-off-by: Jun Gong <jungong@anyscale.com>

switch command line

7d47cf1

Signed-off-by: Jun Gong <jungong@anyscale.com>

use numpy 1.21

9135ff0

Signed-off-by: Jun Gong <jungong@anyscale.com>

update entry commandline

3f165ec

Signed-off-by: Jun Gong <jungong@anyscale.com>

bring alpa test up to date

bac8481

Signed-off-by: Jun Gong <jungong@anyscale.com>

gjoliver force-pushed the alpa-test branch from 5195260 to bac8481 Compare March 8, 2023 14:59

Jun Gong added 6 commits March 9, 2023 07:26

revert back to use standard jaxlib

90650ce

Signed-off-by: Jun Gong <jungong@anyscale.com>

test

c9f0921

Signed-off-by: Jun Gong <jungong@anyscale.com>

fix ray wheel installation

874286a

Signed-off-by: Jun Gong <jungong@anyscale.com>

switch nccl_mode

fb76c5b

Signed-off-by: Jun Gong <jungong@anyscale.com>

print nccl mode

f4f52fa

Signed-off-by: Jun Gong <jungong@anyscale.com>

lint

450056f

Signed-off-by: Jun Gong <jungong@anyscale.com>

gjoliver assigned krfricke and jiaodong Mar 12, 2023

jiaodong reviewed Mar 13, 2023

View reviewed changes

better metrics reporting

dbdaf14

Signed-off-by: Jun Gong <jungong@anyscale.com>

gjoliver commented Mar 13, 2023

View reviewed changes

import json

ac43d76

Signed-off-by: Jun Gong <jungong@anyscale.com>

krfricke approved these changes Mar 14, 2023

View reviewed changes

Update release/alpa_tests/app_config.yaml

a2403a5

Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com> Signed-off-by: Jun Gong <gongjunoliver@hotmail.com>

gjoliver merged commit 004cd2b into ray-project:master Mar 14, 2023

edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023

[AIR] Add a sanity checking release test for Alpa and ray nightly. (r…

9ccd36f

…ay-project#32995) Signed-off-by: Jun Gong <jungong@anyscale.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

peytondmurray pushed a commit to peytondmurray/ray that referenced this pull request Mar 22, 2023

[AIR] Add a sanity checking release test for Alpa and ray nightly. (r…

45e6496

…ay-project#32995) Signed-off-by: Jun Gong <jungong@anyscale.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a sanity checking release test for Alpa and ray nightly. #32995

Add a sanity checking release test for Alpa and ray nightly. #32995

gjoliver commented Mar 3, 2023

jiaodong left a comment

jiaodong Mar 13, 2023

gjoliver Mar 13, 2023

jiaodong Mar 13, 2023

gjoliver Mar 13, 2023

gjoliver left a comment

gjoliver Mar 13, 2023

gjoliver Mar 13, 2023

krfricke left a comment

krfricke Mar 14, 2023

gjoliver Mar 14, 2023

gjoliver commented Mar 14, 2023

Add a sanity checking release test for Alpa and ray nightly. #32995

Add a sanity checking release test for Alpa and ray nightly. #32995

Conversation

gjoliver commented Mar 3, 2023

Why are these changes needed?

Related issue number

Checks

jiaodong left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gjoliver left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krfricke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gjoliver commented Mar 14, 2023