-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rebase wasm-on-ray branch to latest master #35727
rebase wasm-on-ray branch to latest master #35727
Conversation
Expected Results: Create a Recent Serve Applications card: We have a Serve page with a table containing information about the Application name, Import Path, and Status. We will create a card based on this information and the styles of the Recent Jobs Card. Different icons to show: https://files.slack.com/files-pri/TKC2KFWG3-F055XEKED98/screen_shot_2023-05-02_at_12.01.36_pm.png
In some cases, it's necessary to mock the useSWR function in Jest test cases. However, if we don't clear the SWR cache between different test cases, we may encounter an error where a test case unintentionally reuses data from a previous test case instead of creating new mock data. By implementing this fix, we can ensure that our test cases are isolated and independent, and that they accurately reflect the behavior of our code under different scenarios.
…oject#35155) Reverts ray-project#34642 This seems be breaking all pipeline builds. as it is failing to build the base container.
…ct#35128) This PR ensures that the full trial status table is printed at the end of a Ray Tune run with the new output engine. Additionally, trial status data was previously always cut off - now we enforce that when `force= True`, all trial data is reported. It also fixes a bug for showing the `more_info` field (how many more trials with a specific status are available). Signed-off-by: Kai Fricke <kai@anyscale.com>
…oject#32407) There are 2 issues. When the actor exits via sys.exit, exit_actor, or max_call=1, we didn't cancel queued tasks, which means all queued tasks will still be executed although you call exit APIs. It is an unexpected/unintuitive behavior. The segfault happened when we call disconnect() on exit_actor API & there are still queued tasks. it's because the actor won't exit until the queued tasks are all executed, but since we called disconnect(), it will break the worker with a segfault (it is not expected disconnect is called when you are executing actor tasks). This happened even when a normal actor (not an async actor) was used if there are queued tasks when exit_actor is called. This API fixes the issues by doing 2 things. First, if cpp Exit API is called, we guarantee the queued tasks won't be executed. I fixed this issue by returning ExecuteTask immediately. Alternatively, we could manually clean actor_scheduling_queue, but this will require much more complicated code to have a good error message. I am open for this approach as well. Remove disconnect call from exit_actor API. It was written before 2020, and the comment there seems irrelevant (and also all tests seem to pass, so it should be okay). I assume it was a hack, and the issue from the comment was fixed at some point of time. Also, this PR adds 2 guarantees to exit_actor APIs. Once exit_actor or exit is called on an actor, there will be no additional tasks running from that actor. Any queued or incoming requests will fail with a clear error message. When the actor is terminated via exit_actor or exit, the atexit handler is guaranteed to be called (I will add tests).
Add absolute file path to log files for each replica. ray-project#33503 (comment) Example: ``` replicas: - replica_id: foo_DAGDriver#jsrUNs state: RUNNING pid: 68276 actor_name: SERVE_REPLICA::foo_DAGDriver#jsrUNs actor_id: 7c1c702270bb634a7cf4c24f01000000 node_id: 568bf20e0658e89361a997fe57b896b15fcb97268f3b039e1513c6a5 node_ip: 192.168.1.14 start_time_s: 1679598497.387779 log_file_path_id: /serve/deployment_foo_DAGDriver_foo_DAGDriver#jsrUNs.log ```
…pyOpenSSL and cryptography (ray-project#33273) runtime_env working_dir S3 urls require a recent version of boto3 to read environment variables for authentication for downloading from private buckets. We currently include an outdated boto3 version in the Ray Docker images. This PR bumps the version in the Ray Docker images to make the S3 working_dir download feature work out of the box. The reason this is important is that users might try to use S3 URLs for runtime_env with the Ray Docker image, but it's hard to debug the failure that occurs with the outdated boto3 version (see linked issue). This is worse than not having boto3 installed, since in that case the error message is clear ("You must pip install boto3 to fetch URIs"). Related issue number Closes ray-project#33256 Closes ray-project#34752
) Why are these changes needed? With telemetry tracking since ray 2.3, we have not seen significant and recent usage of the timeout=0 behaviour: image Raw query behind firewall So we will update this behaviour as documented in ray-project#28465 cc vitrioil for the original PR: https://github.com/ray-project/ray/pull/30210/files Signed-off-by: Ricky Xu <xuchen727@hotmail.com> --------- Signed-off-by: Ricky Xu <xuchen727@hotmail.com> Co-authored-by: vitrioil <opm249@gmail.com> Co-authored-by: Prem <41074533+vitrioil@users.noreply.github.com>
…ect#34727) The niceness of the job supervisor should be set to 0. Signed-off-by: vitsai <victoria@anyscale.com>
…project#35176) Was resolved in a python 3.7 environment. Now resolving in a python 3.9 environment. Also upgraded dependencies. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
…e examples (ray-project#35151) Signed-off-by: Balaji Veeramani <balaji@anyscale.com> BatchPredictor isn't a recommended API for batch inference anymore. "Scalable Batch Inference with Ray" uses BatchPredictor, so we're removing it until it gets updated with the recommended APIs.
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com> Previously, then a connection is broken, it'll try to do reconnection immediately. Usually when network issues happened, it's going to take a while to recover. This PR adds a 2s delay before initializing a reconnect to make the workload more reasonable.
This package is available in the ubuntu:focal base images but not in the CUDA base images, but may be required by downstream dependencies in our docker ML images. Signed-off-by: Kai Fricke <kai@anyscale.com>
…ct#35150) Anyscale recently stopped supporting i3.8xlarge instance types. As a result, the pipelined_training_50_gb.aws release test -- which uses i3.8xlarge -- has been failing. This PR updates the instance type to m6i.16xlarge (a supported instance type). --------- Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
…35145) Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>
Signed-off-by: Avnish <avnishnarayan@gmail.com>
…ject#35178) As a followup to ray-project#34847, allow fusing `MapOperator` -> `Repartition` operators for the shuffle repartition case (we do not support fusing for split repartition, which only uses `ShuffleTaskSpec.reduce` and thus cannot call the upstream map function passed to `ShuffleTaskSpec.map`). Signed-off-by: Scott Lee <sjl@anyscale.com>
…5119) This PR adds the object owner and copy metrics to `GetNodeStats` RPC endpoint. The inlined small objects are not counted as one copy because it's not stored in object store and when it's used, it'll be copied inline, so no need to count it. But it's still counted as 1 as ownership for correctness because it's actually owned by worker. The local copies are retrieved from local object manager directly and owner counts needs the caller to aggregate the metrics from each core worker.
After getting further feedback about confusion from some types of users, we've decided to not proceed with the Dataset -> Datastream rename for 2.5. Instead, we will retain the data structure name and just refer to it as "streaming datasets" in the copy and emphasize its streaming nature in other ways. --------- Signed-off-by: Eric Liang <ekhliang@gmail.com>
…#34990) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: Ricky Xu <xuchen727@hotmail.com>
…e to port conflicts (ray-project#35127) The test_placement_group_3 use case occasionally fails. I have seen that the reason for the failure is that redis failed to start ("Warning: Could not create server TCP listening socket ::*:49152: bind: Address already in use"). Now for the test case with external redis, when starting redis, add the judgment of whether the redis process starts successfully, and try again if it fails to start.
…m state api (ray-project#35109) Re-Revert of ray-project#34433
…ect#35122) Now there is node_id information corresponding to each bundles in gcs_utils.PlacementGroupTableData. But there is no node_id information corresponding to bundles in ray.util.placement_group_table() interface in python. Now add a "bundles_to_node_id" field in the returned result of the ray.util.placement_group_table() interface To ensure compatibility. A "bundles_to_node_id" field is added.
And build dependencies with 3.7. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
…project#34948) This PR may have caused flakey failures in the test case 'test_placement_group_3', so it was rolled back. This is a resubmitted PR If it is confirmed that the issue was caused by this PR, then I will make the necessary modifications to address the problem
Also makes the job detail page work when accessed via the submission id in the path. This will enable future work to link to submission-only jobs. Also fixes bug where the grafana dashboard dropdowns for Deployments and Replicas don't work until after the first request was received for that replica or deployment.
…n before connection to Unity editor). (ray-project#35167)
Direct users to the new batch inference guide. Found a broken reference while doing so. Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>
@iycheng This is just a rebase action to ray master branch. |
Hi Wilson, I'm interested in the WebAssembly integration with Ray. I saw you actively developing last year, but somehow the development stopped afterwards. I'm wondering, is it due to priority shift, or you found the solution didn't work out? We're also evaluating whether the WebAssembly path is feasible, your experience can really help us a lot. : ) Thanks! |
it is just priority shift. i am not working on this repo but i am working on something similar. we can discuss offline if you are interested in more details @zzb54321 |
Why are these changes needed?
Rebase to master branch
Related issue number
N/A
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.