Add multi log viewer for all serve logs to Serve overview page and Serve Deployment page #76

alanwguo · 2023-12-22T06:49:55Z

Why are these changes needed?

Part 1 of 2 changes:

Part 1: ray-project#42079

Part 2:

Add a multi log viewer to serve deployments list page and serve deployments detail page.
Make the log viewer remember the log the user is viewing in the query params so it is preserved when going back through browser history or refreshing the page.

Related issue number

ray-project#42055

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…_rlm_tf2 to flaky (ray-project#42013) Signed-off-by: can <can@anyscale.com>

…ct#42044) This PR is a fix for Parquet metadata prefetching task, when reading a large amount of Parquet files on S3 (>50k). Before this PR, the Parquet prefetch metadata task is running on head node (w/ `DEFAULT` scheduling strategy), and not retry on S3 transient exception. So it can fail very quickly because it launches too many request from same node, and throttled by S3. This PR does 3 things: * Fix scheduling strategy to use `SPREAD` same as read task, to spread out metadata prefetch task across cluster. This avoids hit S3 w/ too many requests from same node. * Auto-retry on `OSError`, where S3 throws transient error such as `Access Denied`, `Read Timeout`. * Extract `num_cpus` default value out as a variable. So we can tune the value to control the concurrency of prefetch metadata task for particular workload. Sometime `num_cpus=0.5` does not work well. Signed-off-by: Cheng Su <scnju13@gmail.com>

This adds fault tolerance and a teardown method for compiled DAGs.

will only trigger for rllib contrib Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…ay-project#41074) (ray-project#41212)

Fix another typo in windows flaky test jobs >.< Signed-off-by: can <can@anyscale.com>

--------- Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com> Signed-off-by: jonathan-anyscale <144177685+jonathan-anyscale@users.noreply.github.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>

…a but non-zero lookback. Solution counts now only data after the lookback, which is fine as length is used for comparison only and the lookback can be different between different data anyways, e.g. if a user needs 10 prior observations, but not corresponding infos. (ray-project#42009)

…ct#42057) Signed-off-by: Frank Luan <lsf@berkeley.edu>

) Improve databricks UC datasource error message Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

…ject#42039) Continuation of: ray-project#42017. UserCallableWrapper no longer depends on deployment_config at all. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…ay-project#42041) We currently have a flat deadline of 0.1s (by default). Under heavy load or high network latency conditions, this deadline might be consistently missed and cause requests to pile up because they're unable to be scheduled. ray-project#42001 made this deadline configurable, but setting it high by default defeats its purpose (to reduce tail latency when a single replica is overloaded/blocked/unresponsive). This change backs off the deadline exponentially so the initial deadline can still be low while avoiding "halting" under degraded conditions. The max is set to 1s by default but can be configured using `RAY_SERVE_MAX_QUEUE_LENGTH_RESPONSE_DEADLINE_S`. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…t#41873) Updates execution summary string to be N tasks executed, N blocks produced. The "total" blocks was removed since it was incorrect. Fix tasks per node. Previously we were assuming tasks produce one block, so we were outputting blocks per node. Adds rows per task to the output Renames StageStatsSummary to OperatorStatsSummary. Related issue number Closes ray-project#41280 --------- Signed-off-by: Andrew Xue <andewzxue@gmail.com>

…port` (ray-project#42060) As the document shows, the default value for dashboard-agent-listen-port is a random one. It may be not correct, as the Ray start script shows: It would be set a default value. Signed-off-by: surenyufuz <surenyufuz@163.com>

`cluster_resources()` will be cached in each scheduling iteration. I found this became the bottleneck for training ingestion workloads. Cache the result to reduce the overheads. --------- Signed-off-by: Hao Chen <chenh1024@gmail.com>

…42061) Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…rve Deployment page Signed-off-by: Alan Guo <aguo@anyscale.com>

…ject#42081) Skip memory profiling osx test since memray is not installed in osx image. Signed-off-by: Jocn2020 <jonathannitisastro@gmail.com>

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

…ay-project#42032) On databricks runtime, when user starts a Ray on spark cluster in a notebook, when notebook is detached, the Ray head node is killed, but we observe the Ray worker nodes are still running in some cases, so it causes the background spark job hanging and can't release resources. So this PR is for addressing this issue. When databricks notebook is detached, we enforce all spark jobs created in this notebook REPL are killed. Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

…ject#42030) With the updates to the NPU foundational software CANN, frameworks like PyTorch(NPU) and MindSpore can now detect the ASCEND_RT_VISIBLE_DEVICES environment variable and intelligently configure devices. This is the recommended usage by the official documentation. Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>

Release tests for the LLM fine-tuning template have been failing. https://buildkite.com/ray-project/release/builds/4480#018c7bd5-5e8a-4085-9dc1-1b7361dc6c87/6-415 The tests were therefore disabled before: ray-project#42038 That is because of torchvision and torchaudio version requirements that we don't have in the template Docker, but cluster env we use for testing, which makes no sense. These changes fix that.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

…ct#42099) This PR allows for more configurability in `init_torch_dist_process_group` function by enabling the passing of user defined kwargs to the `torch.distributed.init_process_group` function. Crucially, this allows for the timeout argument to be specified by the user. --------- Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com>

Update line numbers Signed-off-by: Kleber Noel <42589399+klebster2@users.noreply.github.com>

…ay-project#42122) Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…ay-project#42043) Pure refactor, moves all queue-related metrics management out of the UserCallableWrapper and into a standalone class. This is a general cleanup/improvement but also a step towards removing the usage of the hacky Ray core actor call stats dictionary and managing queue metrics directly in the ReplicaActor. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Remove code for the old stages optimizer path (`ExecutionPlan._optimize()`), as well as related methods and tests. This should all be dead code, now that the streaming executor is enabled by default. Signed-off-by: Scott Lee <sjl@anyscale.com>

Signed-off-by: pdmurray <peynmurray@gmail.com>

For Arrow nightly Data CI tests, the Pyarrow version is of the form: 15.0.0.dev404, which is not able to be handled by the previous datasets.Version utility we were using to gate the test. Use generic Version class instead, which can handle these version types. Signed-off-by: Scott Lee <sjl@anyscale.com>

Persist CI test results on every test run: Signed-off-by: can <can@anyscale.com>

A script to run CI state machine. Later on I'll create a buildkite job to run this nightly. Signed-off-by: can <can@anyscale.com>

- enabled debugpy as the ray debugger for breakpoint and post_mortem debugging - added flag RAY_DEBUG=1 to enable debugpy. If RAY_DEBUG is not set and RAY_PDB is set, then rpdb will be used. - used state api to save worker debugging port.

…worker (ray-project#42332) RuntimeEnvContext.exec_worker used " ".join(cmds) to construct the worker process command but it didn't do any shell escape. This will cause the worker process fail to start if there is any special character (e.g. ?) in the command. Instead, we should use shlex.join. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

…eAgentEnvRunner. (ray-project#42296)

…project#42079) Part 1 of 2 changes: Part 1: Updates layout of serve ray dashboard to be deployments first. Creates a deployments detail page Update recent serve card in Overview page to point to deployments instead of applications. [Optimization]: Re-use the same SWR cache key for the getServeApplications call so we don't have to refetch the API as often and so the data is kept in sync between pages. Part 2 (Future PR): Add a multi log viewer to serve deployments list page and serve deployments detail page. --------- Signed-off-by: Alan Guo <aguo@anyscale.com>

…ct#42358) Improve the section in documentation to summarize different shuffle options, so users have a place to understand what shuffle options we have. Signed-off-by: Cheng Su <scnju13@gmail.com>

Removes `_stages_before/after_snapshot` from `ExecutionPlan`. This should be merged after ray-project#41747 and ray-project#41544 --------- Signed-off-by: Andrew Xue <andewzxue@gmail.com> Signed-off-by: Scott Lee <sjl@anyscale.com> Co-authored-by: Scott Lee <sjl@anyscale.com>

Fixes a typo I just happened to notice while reading. Feel free to open as your own PR in case of any CLA or attribution concerns. Signed-off-by: Mat Schaffer <115565899+matschaffer-roblox@users.noreply.github.com>

Signed-off-by: Yuchao Zhang <418121364@qq.com>

…#42360) --------- Signed-off-by: rickyyx <rickyx@anyscale.com>

ray-project#42285) This test is flaky. The purpose of this test is to check that the producer will pause stop producing blocks after block 2 is generated and before block 0 is taken by the consumer. However it's hard to collect the timestamps for some events that happen in Data and Core internal. E.g., the time when an object is taken at the streaming generator level. We used the consumer task timestamp as an approximation. But the test is still flaky on some slow machines where the task can take long time to get started after being scheduled. Remove this flaky test as we already have another e2e backpressure test `test_large_e2e_backpressure`, which will check the amount of spilling data. Signed-off-by: Hao Chen <chenh1024@gmail.com>

Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>

Minor documentation fixes: fix syntax in example code fix link in docs --------- Signed-off-by: arunppsg <arunppsg@gmail.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>

complete state machine bot script Signed-off-by: can <can@anyscale.com>

Persist only the final test result on retries Signed-off-by: can <can@anyscale.com>

…42371) Add multipy version for corebuild images. I copy the existing wanda file into another multipy version. Will deprecate the old wanda file when the system is completely migrated into multipy. Signed-off-by: can <can@anyscale.com>

…-project#42169) * Reenable release tests * Remove requirements from testing byod * Update to flash attention 2 * move requirement to requirements txt * Kourosh's and Lonnie's comments Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com> * move flash-attn to Dockerfile * fix --no-build-isolation * up ray version * Update deepspeed version to 0.10.3 * Change deepspeed in ci to 0.10.3 * change deepspeed to 0.10.2 * downgrade pydantic so that we can use deepspeed 0.10.2 * Update release test deps * Upgrade transformers to support FAv2 --------- Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

Modifies a few global docs styles and makes `index.html` use the same `pygments`-based code highlighting as elsewhere in the docs. - Instead of inserting the raw html for `index.html` using sphinx, I made the index page use its own template. This allows us to pass in a `pygments` highlighting function to the HTML context, which produces the same code highlighting used elsewhere in the docs. - Renamed `splash.css` -> `index.css`, `splash.js` -> `index.js` Signed-off-by: pdmurray <peynmurray@gmail.com>

…backend integration of multi-turn conversation (ray-project#42244) * add anchoring functionality in chat pop up in preparation for backend integration of multi-turn conversation Signed-off-by: Chris Zhang <chris@anyscale.com> * improve UX with pressing enter and update copy * remove commented out code --------- Signed-off-by: Chris Zhang <chris@anyscale.com>

It's replaced by https://github.com/ray-project/enhancements/blob/main/reps/2023-10-13-accelerator-support.md Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

…ed. (ray-project#42380)

Move to flaky 2 serve tests that are failing on postmerge that cannot be blamed to a PR Signed-off-by: can <can@anyscale.com>

Fix a typo in workspace_template_serving_stable_diffusion test definition Signed-off-by: can <can@anyscale.com>

A temporary mitigation to the issue before we find a more complete solution. --------- Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

Signed-off-by: Alan Guo <aguo@anyscale.com>

can-anyscale and others added 17 commits December 20, 2023 17:13

[ci] move //rllib:examples/self_play_with_open_spiel_connect_4_appo_w…

71a3728

…_rlm_tf2 to flaky (ray-project#42013) Signed-off-by: can <can@anyscale.com>

[core] Fault tolerance for compiled DAGs (ray-project#41943)

f295e94

This adds fault tolerance and a teardown method for compiled DAGs.

[civ1] leave a mark to indicate sunsetting (ray-project#42062)

e27ffa0

will only trigger for rllib contrib Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[RLlib] New ConnectorV2 API #3: Introduce actual ConnectorV2 API. (r…

bd555a0

…ay-project#41074) (ray-project#41212)

[ci] fix window flaky jobs (ray-project#42056)

bd8e8bb

Fix another typo in windows flaky test jobs >.< Signed-off-by: can <can@anyscale.com>

[Data] Fix docs in streaming_output_backpressure_policy.py (ray-proje…

410c62e

…ct#42057) Signed-off-by: Frank Luan <lsf@berkeley.edu>

[DATA] Improve databricks UC datasource error message (ray-project#42031

f842ae6

) Improve databricks UC datasource error message Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

[serve] Move DeploymentConfig out of UserCallableWrapper (ray-pro…

e217d5d

…ject#42039) Continuation of: ray-project#42017. UserCallableWrapper no longer depends on deployment_config at all. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

[data] Cache cluster_resources (ray-project#41795)

c0b069c

`cluster_resources()` will be cached in each scheduling iteration. I found this became the bottleneck for training ingestion workloads. Cache the result to reduce the overheads. --------- Signed-off-by: Hao Chen <chenh1024@gmail.com>

[kuberay] add kuberay team, remove kuberay_operator tag (ray-project#…

965c4cc

…42061) Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

Add multi log viewer for all serve logs to Serve overview page and Se…

33b9f5c

…rve Deployment page Signed-off-by: Alan Guo <aguo@anyscale.com>

alanwguo force-pushed the multi-log-viewer branch from 5a655f6 to 33b9f5c Compare December 22, 2023 06:55

alanwguo mentioned this pull request Dec 22, 2023

Better Serve Dashboard layout in Ray Dashboard ray-project/ray#42055

Closed

Jocn2020 and others added 11 commits December 22, 2023 10:01

[Dashboard][Test] Fix osx tests failures on memory profiling (ray-pro…

00d605f

…ject#42081) Skip memory profiling osx test since memray is not installed in osx image. Signed-off-by: Jocn2020 <jonathannitisastro@gmail.com>

[spark] Display tips for databricks user (ray-project#42016)

d3f8216

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

Mark HPU as unit resource (ray-project#42080)

d8222b2

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

Update model_composition.md (ray-project#42108)

33f1a6f

Update line numbers Signed-off-by: Kleber Noel <42589399+klebster2@users.noreply.github.com>

[serve] Remove unnecessary call to _set_internal_replica_context (r…

7880f75

…ay-project#42122) Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

peytondmurray and others added 29 commits January 11, 2024 16:27

[Doc] Fix code block diff colors (ray-project#42281)

16b5bce

Signed-off-by: pdmurray <peynmurray@gmail.com>

[ci] persist ci test results (ray-project#42227)

eec2412

Persist CI test results on every test run: Signed-off-by: can <can@anyscale.com>

[ci] script to run state machine (ray-project#42354)

587eb0d

A script to run CI state machine. Later on I'll create a buildkite job to run this nightly. Signed-off-by: can <can@anyscale.com>

[RLlib] New ConnectorV2 API #6: Changes in SingleAgentEpisode & Singl…

6da2636

…eAgentEnvRunner. (ray-project#42296)

[Data][Doc] Add doc to summarize different shuffle options (ray-proje…

0bcdaa8

…ct#42358) Improve the section in documentation to summarize different shuffle options, so users have a place to understand what shuffle options we have. Signed-off-by: Cheng Su <scnju13@gmail.com>

Fix typo in document (ray-project#41965)

dadf905

Signed-off-by: Yuchao Zhang <418121364@qq.com>

[core][ci] Run dag microbenchmark seperately as unstable (ray-project…

f76081f

…#42360) --------- Signed-off-by: rickyyx <rickyx@anyscale.com>

[Release] Add 2.9.1 release logs (ray-project#42372)

5233896

Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>

[Doc] Minor documentation fixes (ray-project#42118)

404b764

Minor documentation fixes: fix syntax in example code fix link in docs --------- Signed-off-by: arunppsg <arunppsg@gmail.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>

[ci] complete state machine bot (ray-project#42362)

c17f2dd

complete state machine bot script Signed-off-by: can <can@anyscale.com>

[ci] persist only the final test result on retries (ray-project#42363)

bd95552

Persist only the final test result on retries Signed-off-by: can <can@anyscale.com>

[Core] Remove how to support new accelerator guide (ray-project#42379)

8af9a40

It's replaced by https://github.com/ray-project/enhancements/blob/main/reps/2023-10-13-accelerator-support.md Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

[RLlib] Remove references to vizdoomgym since it has been unpublish…

b74f925

…ed. (ray-project#42380)

[ci] unblock premerge (ray-project#42417)

1624d8b

Move to flaky 2 serve tests that are failing on postmerge that cannot be blamed to a PR Signed-off-by: can <can@anyscale.com>

[ci] fix workspace_template_serving_stable_diffusion (ray-project#42397)

6f7b66b

Fix a typo in workspace_template_serving_stable_diffusion test definition Signed-off-by: can <can@anyscale.com>

[serve] Add warning for small smoothing factors. (ray-project#42355)

eb19da2

A temporary mitigation to the issue before we find a more complete solution. --------- Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

Merge branch 'master' into multi-log-viewer

8619299

remember the last tab the user used

55af5c9

Signed-off-by: Alan Guo <aguo@anyscale.com>

alanwguo closed this Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi log viewer for all serve logs to Serve overview page and Serve Deployment page #76

Add multi log viewer for all serve logs to Serve overview page and Serve Deployment page #76

alanwguo commented Dec 22, 2023

Add multi log viewer for all serve logs to Serve overview page and Serve Deployment page #76

Add multi log viewer for all serve logs to Serve overview page and Serve Deployment page #76

Conversation

alanwguo commented Dec 22, 2023

Why are these changes needed?

Related issue number

Checks