Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi log viewer for all serve logs to Serve overview page and Serve Deployment page #76

Closed
wants to merge 163 commits into from

Conversation

alanwguo
Copy link
Owner

Why are these changes needed?

Part 1 of 2 changes:

Part 1: ray-project#42079

Part 2:

  • Add a multi log viewer to serve deployments list page and serve deployments detail page.
  • Make the log viewer remember the log the user is viewing in the query params so it is preserved when going back through browser history or refreshing the page.

Screenshot 2023-12-21 at 10 16 02 PM
Screenshot 2023-12-21 at 10 16 06 PM
Screenshot 2023-12-21 at 10 16 10 PM
Screenshot 2023-12-21 at 10 16 15 PM
Screenshot 2023-12-21 at 10 16 21 PM

Related issue number

ray-project#42055

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

can-anyscale and others added 17 commits December 20, 2023 17:13
…_rlm_tf2 to flaky (ray-project#42013)

Signed-off-by: can <can@anyscale.com>
…ct#42044)

This PR is a fix for Parquet metadata prefetching task, when reading a large amount of Parquet files on S3 (>50k). Before this PR, the Parquet prefetch metadata task is running on head node (w/ `DEFAULT` scheduling strategy), and not retry on S3 transient exception. So it can fail very quickly because it launches too many request from same node, and throttled by S3.

This PR does 3 things:
* Fix scheduling strategy to use `SPREAD` same as read task, to spread out metadata prefetch task across cluster. This avoids hit S3 w/ too many requests from same node.
* Auto-retry on `OSError`, where S3 throws transient error such as `Access Denied`, `Read Timeout`.
* Extract `num_cpus` default value out as a variable. So we can tune the value to control the concurrency of prefetch metadata task for particular workload. Sometime `num_cpus=0.5` does not work well.

Signed-off-by: Cheng Su <scnju13@gmail.com>
This adds fault tolerance and a teardown method for compiled DAGs.
will only trigger for rllib contrib

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Fix another typo in windows flaky test jobs >.<

Signed-off-by: can <can@anyscale.com>
---------

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: jonathan-anyscale <144177685+jonathan-anyscale@users.noreply.github.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
…a but non-zero lookback. Solution counts now only data after the lookback, which is fine as length is used for comparison only and the lookback can be different between different data anyways, e.g. if a user needs 10 prior observations, but not corresponding infos. (ray-project#42009)
)

Improve databricks UC datasource error message

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
…ject#42039)

Continuation of: ray-project#42017.

UserCallableWrapper no longer depends on deployment_config at all.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…ay-project#42041)

We currently have a flat deadline of 0.1s (by default). Under heavy load or high network latency conditions, this deadline might be consistently missed and cause requests to pile up because they're unable to be scheduled.

ray-project#42001 made this deadline configurable, but setting it high by default defeats its purpose (to reduce tail latency when a single replica is overloaded/blocked/unresponsive).

This change backs off the deadline exponentially so the initial deadline can still be low while avoiding "halting" under degraded conditions.

The max is set to 1s by default but can be configured using `RAY_SERVE_MAX_QUEUE_LENGTH_RESPONSE_DEADLINE_S`.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…t#41873)

Updates execution summary string to be N tasks executed, N blocks produced. The "total" blocks was removed since it was incorrect.

Fix tasks per node. Previously we were assuming tasks produce one block, so we were outputting blocks per node.

Adds rows per task to the output

Renames StageStatsSummary to OperatorStatsSummary.
Related issue number

Closes ray-project#41280

---------

Signed-off-by: Andrew Xue <andewzxue@gmail.com>
…port` (ray-project#42060)

As the document shows, the default value for dashboard-agent-listen-port is a random one.
It may be not correct, as the Ray start script shows:
It would be set a default value.

Signed-off-by: surenyufuz <surenyufuz@163.com>
`cluster_resources()` will be cached in each scheduling iteration. I found this became the bottleneck for training ingestion workloads. Cache the result to reduce the overheads. 

---------

Signed-off-by: Hao Chen <chenh1024@gmail.com>
…rve Deployment page

Signed-off-by: Alan Guo <aguo@anyscale.com>
Jocn2020 and others added 11 commits December 22, 2023 10:01
…ject#42081)

Skip memory profiling osx test since memray is not installed in osx image.

Signed-off-by: Jocn2020 <jonathannitisastro@gmail.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
…ay-project#42032)

On databricks runtime, when user starts a Ray on spark cluster in a notebook, when notebook is detached, the Ray head node is killed, but we observe the Ray worker nodes are still running in some cases, so it causes the background spark job hanging and can't release resources.

So this PR is for addressing this issue. When databricks notebook is detached, we enforce all spark jobs created in this notebook REPL are killed.

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
…ject#42030)

With the updates to the NPU foundational software CANN, frameworks like PyTorch(NPU) and MindSpore can now detect the ASCEND_RT_VISIBLE_DEVICES environment variable and intelligently configure devices. This is the recommended usage by the official documentation.

Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>
Release tests for the LLM fine-tuning template have been failing.
https://buildkite.com/ray-project/release/builds/4480#018c7bd5-5e8a-4085-9dc1-1b7361dc6c87/6-415

The tests were therefore disabled before: ray-project#42038

That is because of torchvision and torchaudio version requirements that we don't have in the template Docker, but cluster env we use for testing, which makes no sense.

These changes fix that.
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
…ct#42099)

This PR allows for more configurability in `init_torch_dist_process_group` function by enabling the passing of user defined kwargs to the `torch.distributed.init_process_group` function. Crucially, this allows for the timeout argument to be specified by the user.

---------

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Update line numbers

Signed-off-by: Kleber Noel <42589399+klebster2@users.noreply.github.com>
…ay-project#42043)

Pure refactor, moves all queue-related metrics management out of the UserCallableWrapper and into a standalone class. This is a general cleanup/improvement but also a step towards removing the usage of the hacky Ray core actor call stats dictionary and managing queue metrics directly in the ReplicaActor.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Remove code for the old stages optimizer path (`ExecutionPlan._optimize()`), as well as related methods and tests. This should all be dead code, now that the streaming executor is enabled by default.

Signed-off-by: Scott Lee <sjl@anyscale.com>
peytondmurray and others added 29 commits January 11, 2024 16:27
Signed-off-by: pdmurray <peynmurray@gmail.com>
For Arrow nightly Data CI tests, the Pyarrow version is of the form: 15.0.0.dev404, which is not able to be handled by the previous datasets.Version utility we were using to gate the test. Use generic Version class instead, which can handle these version types.

Signed-off-by: Scott Lee <sjl@anyscale.com>
Persist CI test results on every test run:

Signed-off-by: can <can@anyscale.com>
A script to run CI state machine. Later on I'll create a buildkite job to run this nightly.

Signed-off-by: can <can@anyscale.com>
- enabled debugpy as the ray debugger for breakpoint and post_mortem debugging
- added flag RAY_DEBUG=1 to enable debugpy. If RAY_DEBUG is not set and RAY_PDB is set, then rpdb will be used.
- used state api to save worker debugging port.
…worker (ray-project#42332)

RuntimeEnvContext.exec_worker used " ".join(cmds) to construct the worker process command but it didn't do any shell escape. This will cause the worker process fail to start if there is any special character (e.g. ?) in the command. Instead, we should use shlex.join.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
…project#42079)

Part 1 of 2 changes:

Part 1:

Updates layout of serve ray dashboard to be deployments first.
Creates a deployments detail page
Update recent serve card in Overview page to point to deployments instead of applications.
[Optimization]: Re-use the same SWR cache key for the getServeApplications call so we don't have to refetch the API as often and so the data is kept in sync between pages.
Part 2 (Future PR):

Add a multi log viewer to serve deployments list page and serve deployments detail page.

---------

Signed-off-by: Alan Guo <aguo@anyscale.com>
…ct#42358)

Improve the section in documentation to summarize different shuffle options, so users have a place to understand what shuffle options we have.

Signed-off-by: Cheng Su <scnju13@gmail.com>
Removes `_stages_before/after_snapshot` from `ExecutionPlan`.

This should be merged after ray-project#41747 and ray-project#41544
---------

Signed-off-by: Andrew Xue <andewzxue@gmail.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Co-authored-by: Scott Lee <sjl@anyscale.com>
Fixes a typo I just happened to notice while reading. Feel free to open as your own PR in case of any CLA or attribution concerns.

Signed-off-by: Mat Schaffer <115565899+matschaffer-roblox@users.noreply.github.com>
Signed-off-by: Yuchao Zhang <418121364@qq.com>
…#42360)

---------

Signed-off-by: rickyyx <rickyx@anyscale.com>
ray-project#42285)

This test is flaky. The purpose of this test is to check that the producer will pause stop producing blocks after block 2 is generated and before block 0 is taken by the consumer. However it's hard to collect the timestamps for some events that happen in Data and Core internal. E.g., the time when an object is taken at the streaming generator level. We used the consumer task timestamp as an approximation. But the test is still flaky on some slow machines where the task can take long time to get started after being scheduled. 

Remove this flaky test as we already have another e2e backpressure test `test_large_e2e_backpressure`, which will check the amount of spilling data. 

Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Minor documentation fixes:

fix syntax in example code
fix link in docs

---------

Signed-off-by: arunppsg <arunppsg@gmail.com>
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
complete state machine bot script

Signed-off-by: can <can@anyscale.com>
Persist only the final test result on retries

Signed-off-by: can <can@anyscale.com>
…42371)

Add multipy version for corebuild images. I copy the existing wanda file into another multipy version. Will deprecate the old wanda file when the system is completely migrated into multipy.

Signed-off-by: can <can@anyscale.com>
…-project#42169)

* Reenable release tests

* Remove requirements from testing byod

* Update to flash attention 2

* move requirement to requirements txt

* Kourosh's and Lonnie's comments

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

* move flash-attn to Dockerfile

* fix --no-build-isolation

* up ray version

* Update deepspeed version to 0.10.3

* Change deepspeed in ci to 0.10.3

* change deepspeed to 0.10.2

* downgrade pydantic so that we can use deepspeed 0.10.2

* Update release test deps

* Upgrade transformers to support FAv2

---------

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>
Modifies a few global docs styles and makes `index.html` use the same `pygments`-based code highlighting as elsewhere in the docs.

- Instead of inserting the raw html for `index.html` using sphinx, I made the index page use its own template. This allows us to pass in a `pygments` highlighting function to the HTML context, which produces the same code highlighting used elsewhere in the docs.
- Renamed `splash.css` -> `index.css`, `splash.js` -> `index.js`

Signed-off-by: pdmurray <peynmurray@gmail.com>
…backend integration of multi-turn conversation (ray-project#42244)

* add anchoring functionality in chat pop up in preparation for backend integration of multi-turn conversation

Signed-off-by: Chris Zhang <chris@anyscale.com>

* improve UX with pressing enter and update copy

* remove commented out code

---------

Signed-off-by: Chris Zhang <chris@anyscale.com>
Move to flaky 2 serve tests that are failing on postmerge that cannot be blamed to a PR

Signed-off-by: can <can@anyscale.com>
Fix a typo in workspace_template_serving_stable_diffusion test definition

Signed-off-by: can <can@anyscale.com>
A temporary mitigation to the issue before we find a more complete solution.

---------

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Alan Guo <aguo@anyscale.com>
@alanwguo alanwguo closed this Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.