forked from ray-project/ray
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multi log viewer for all serve logs to Serve overview page and Serve Deployment page #76
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…_rlm_tf2 to flaky (ray-project#42013) Signed-off-by: can <can@anyscale.com>
…ct#42044) This PR is a fix for Parquet metadata prefetching task, when reading a large amount of Parquet files on S3 (>50k). Before this PR, the Parquet prefetch metadata task is running on head node (w/ `DEFAULT` scheduling strategy), and not retry on S3 transient exception. So it can fail very quickly because it launches too many request from same node, and throttled by S3. This PR does 3 things: * Fix scheduling strategy to use `SPREAD` same as read task, to spread out metadata prefetch task across cluster. This avoids hit S3 w/ too many requests from same node. * Auto-retry on `OSError`, where S3 throws transient error such as `Access Denied`, `Read Timeout`. * Extract `num_cpus` default value out as a variable. So we can tune the value to control the concurrency of prefetch metadata task for particular workload. Sometime `num_cpus=0.5` does not work well. Signed-off-by: Cheng Su <scnju13@gmail.com>
This adds fault tolerance and a teardown method for compiled DAGs.
will only trigger for rllib contrib Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Fix another typo in windows flaky test jobs >.< Signed-off-by: can <can@anyscale.com>
--------- Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com> Signed-off-by: jonathan-anyscale <144177685+jonathan-anyscale@users.noreply.github.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
…a but non-zero lookback. Solution counts now only data after the lookback, which is fine as length is used for comparison only and the lookback can be different between different data anyways, e.g. if a user needs 10 prior observations, but not corresponding infos. (ray-project#42009)
…ct#42057) Signed-off-by: Frank Luan <lsf@berkeley.edu>
…ject#42039) Continuation of: ray-project#42017. UserCallableWrapper no longer depends on deployment_config at all. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…ay-project#42041) We currently have a flat deadline of 0.1s (by default). Under heavy load or high network latency conditions, this deadline might be consistently missed and cause requests to pile up because they're unable to be scheduled. ray-project#42001 made this deadline configurable, but setting it high by default defeats its purpose (to reduce tail latency when a single replica is overloaded/blocked/unresponsive). This change backs off the deadline exponentially so the initial deadline can still be low while avoiding "halting" under degraded conditions. The max is set to 1s by default but can be configured using `RAY_SERVE_MAX_QUEUE_LENGTH_RESPONSE_DEADLINE_S`. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…t#41873) Updates execution summary string to be N tasks executed, N blocks produced. The "total" blocks was removed since it was incorrect. Fix tasks per node. Previously we were assuming tasks produce one block, so we were outputting blocks per node. Adds rows per task to the output Renames StageStatsSummary to OperatorStatsSummary. Related issue number Closes ray-project#41280 --------- Signed-off-by: Andrew Xue <andewzxue@gmail.com>
…port` (ray-project#42060) As the document shows, the default value for dashboard-agent-listen-port is a random one. It may be not correct, as the Ray start script shows: It would be set a default value. Signed-off-by: surenyufuz <surenyufuz@163.com>
`cluster_resources()` will be cached in each scheduling iteration. I found this became the bottleneck for training ingestion workloads. Cache the result to reduce the overheads. --------- Signed-off-by: Hao Chen <chenh1024@gmail.com>
…42061) Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
…rve Deployment page Signed-off-by: Alan Guo <aguo@anyscale.com>
alanwguo
force-pushed
the
multi-log-viewer
branch
from
December 22, 2023 06:55
5a655f6
to
33b9f5c
Compare
…ject#42081) Skip memory profiling osx test since memray is not installed in osx image. Signed-off-by: Jocn2020 <jonathannitisastro@gmail.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
…ay-project#42032) On databricks runtime, when user starts a Ray on spark cluster in a notebook, when notebook is detached, the Ray head node is killed, but we observe the Ray worker nodes are still running in some cases, so it causes the background spark job hanging and can't release resources. So this PR is for addressing this issue. When databricks notebook is detached, we enforce all spark jobs created in this notebook REPL are killed. Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
…ject#42030) With the updates to the NPU foundational software CANN, frameworks like PyTorch(NPU) and MindSpore can now detect the ASCEND_RT_VISIBLE_DEVICES environment variable and intelligently configure devices. This is the recommended usage by the official documentation. Signed-off-by: Xiaoshuang Liu <liuxiaoshuang4@huawei.com>
Release tests for the LLM fine-tuning template have been failing. https://buildkite.com/ray-project/release/builds/4480#018c7bd5-5e8a-4085-9dc1-1b7361dc6c87/6-415 The tests were therefore disabled before: ray-project#42038 That is because of torchvision and torchaudio version requirements that we don't have in the template Docker, but cluster env we use for testing, which makes no sense. These changes fix that.
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
…ct#42099) This PR allows for more configurability in `init_torch_dist_process_group` function by enabling the passing of user defined kwargs to the `torch.distributed.init_process_group` function. Crucially, this allows for the timeout argument to be specified by the user. --------- Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Update line numbers Signed-off-by: Kleber Noel <42589399+klebster2@users.noreply.github.com>
…ay-project#42122) Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…ay-project#42043) Pure refactor, moves all queue-related metrics management out of the UserCallableWrapper and into a standalone class. This is a general cleanup/improvement but also a step towards removing the usage of the hacky Ray core actor call stats dictionary and managing queue metrics directly in the ReplicaActor. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Remove code for the old stages optimizer path (`ExecutionPlan._optimize()`), as well as related methods and tests. This should all be dead code, now that the streaming executor is enabled by default. Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: pdmurray <peynmurray@gmail.com>
For Arrow nightly Data CI tests, the Pyarrow version is of the form: 15.0.0.dev404, which is not able to be handled by the previous datasets.Version utility we were using to gate the test. Use generic Version class instead, which can handle these version types. Signed-off-by: Scott Lee <sjl@anyscale.com>
Persist CI test results on every test run: Signed-off-by: can <can@anyscale.com>
A script to run CI state machine. Later on I'll create a buildkite job to run this nightly. Signed-off-by: can <can@anyscale.com>
- enabled debugpy as the ray debugger for breakpoint and post_mortem debugging - added flag RAY_DEBUG=1 to enable debugpy. If RAY_DEBUG is not set and RAY_PDB is set, then rpdb will be used. - used state api to save worker debugging port.
…worker (ray-project#42332) RuntimeEnvContext.exec_worker used " ".join(cmds) to construct the worker process command but it didn't do any shell escape. This will cause the worker process fail to start if there is any special character (e.g. ?) in the command. Instead, we should use shlex.join. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
…project#42079) Part 1 of 2 changes: Part 1: Updates layout of serve ray dashboard to be deployments first. Creates a deployments detail page Update recent serve card in Overview page to point to deployments instead of applications. [Optimization]: Re-use the same SWR cache key for the getServeApplications call so we don't have to refetch the API as often and so the data is kept in sync between pages. Part 2 (Future PR): Add a multi log viewer to serve deployments list page and serve deployments detail page. --------- Signed-off-by: Alan Guo <aguo@anyscale.com>
…ct#42358) Improve the section in documentation to summarize different shuffle options, so users have a place to understand what shuffle options we have. Signed-off-by: Cheng Su <scnju13@gmail.com>
Removes `_stages_before/after_snapshot` from `ExecutionPlan`. This should be merged after ray-project#41747 and ray-project#41544 --------- Signed-off-by: Andrew Xue <andewzxue@gmail.com> Signed-off-by: Scott Lee <sjl@anyscale.com> Co-authored-by: Scott Lee <sjl@anyscale.com>
Fixes a typo I just happened to notice while reading. Feel free to open as your own PR in case of any CLA or attribution concerns. Signed-off-by: Mat Schaffer <115565899+matschaffer-roblox@users.noreply.github.com>
Signed-off-by: Yuchao Zhang <418121364@qq.com>
…#42360) --------- Signed-off-by: rickyyx <rickyx@anyscale.com>
ray-project#42285) This test is flaky. The purpose of this test is to check that the producer will pause stop producing blocks after block 2 is generated and before block 0 is taken by the consumer. However it's hard to collect the timestamps for some events that happen in Data and Core internal. E.g., the time when an object is taken at the streaming generator level. We used the consumer task timestamp as an approximation. But the test is still flaky on some slow machines where the task can take long time to get started after being scheduled. Remove this flaky test as we already have another e2e backpressure test `test_large_e2e_backpressure`, which will check the amount of spilling data. Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Minor documentation fixes: fix syntax in example code fix link in docs --------- Signed-off-by: arunppsg <arunppsg@gmail.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
complete state machine bot script Signed-off-by: can <can@anyscale.com>
Persist only the final test result on retries Signed-off-by: can <can@anyscale.com>
…42371) Add multipy version for corebuild images. I copy the existing wanda file into another multipy version. Will deprecate the old wanda file when the system is completely migrated into multipy. Signed-off-by: can <can@anyscale.com>
…-project#42169) * Reenable release tests * Remove requirements from testing byod * Update to flash attention 2 * move requirement to requirements txt * Kourosh's and Lonnie's comments Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com> * move flash-attn to Dockerfile * fix --no-build-isolation * up ray version * Update deepspeed version to 0.10.3 * Change deepspeed in ci to 0.10.3 * change deepspeed to 0.10.2 * downgrade pydantic so that we can use deepspeed 0.10.2 * Update release test deps * Upgrade transformers to support FAv2 --------- Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>
Modifies a few global docs styles and makes `index.html` use the same `pygments`-based code highlighting as elsewhere in the docs. - Instead of inserting the raw html for `index.html` using sphinx, I made the index page use its own template. This allows us to pass in a `pygments` highlighting function to the HTML context, which produces the same code highlighting used elsewhere in the docs. - Renamed `splash.css` -> `index.css`, `splash.js` -> `index.js` Signed-off-by: pdmurray <peynmurray@gmail.com>
…backend integration of multi-turn conversation (ray-project#42244) * add anchoring functionality in chat pop up in preparation for backend integration of multi-turn conversation Signed-off-by: Chris Zhang <chris@anyscale.com> * improve UX with pressing enter and update copy * remove commented out code --------- Signed-off-by: Chris Zhang <chris@anyscale.com>
It's replaced by https://github.com/ray-project/enhancements/blob/main/reps/2023-10-13-accelerator-support.md Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Move to flaky 2 serve tests that are failing on postmerge that cannot be blamed to a PR Signed-off-by: can <can@anyscale.com>
Fix a typo in workspace_template_serving_stable_diffusion test definition Signed-off-by: can <can@anyscale.com>
A temporary mitigation to the issue before we find a more complete solution. --------- Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Alan Guo <aguo@anyscale.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
Part 1 of 2 changes:
Part 1: ray-project#42079
Part 2:
Related issue number
ray-project#42055
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.