rebase wasm-on-ray branch to latest master #35727

wilsonwang371 · 2023-05-24T18:49:00Z

Why are these changes needed?

Rebase to master branch

Related issue number

N/A

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Expected Results: Create a Recent Serve Applications card: We have a Serve page with a table containing information about the Application name, Import Path, and Status. We will create a card based on this information and the styles of the Recent Jobs Card. Different icons to show: https://files.slack.com/files-pri/TKC2KFWG3-F055XEKED98/screen_shot_2023-05-02_at_12.01.36_pm.png

In some cases, it's necessary to mock the useSWR function in Jest test cases. However, if we don't clear the SWR cache between different test cases, we may encounter an error where a test case unintentionally reuses data from a previous test case instead of creating new mock data. By implementing this fix, we can ensure that our test cases are isolated and independent, and that they accurately reflect the behavior of our code under different scenarios.

…oject#35155) Reverts ray-project#34642 This seems be breaking all pipeline builds. as it is failing to build the base container.

…ct#35128) This PR ensures that the full trial status table is printed at the end of a Ray Tune run with the new output engine. Additionally, trial status data was previously always cut off - now we enforce that when `force= True`, all trial data is reported. It also fixes a bug for showing the `more_info` field (how many more trials with a specific status are available). Signed-off-by: Kai Fricke <kai@anyscale.com>

…oject#32407) There are 2 issues. When the actor exits via sys.exit, exit_actor, or max_call=1, we didn't cancel queued tasks, which means all queued tasks will still be executed although you call exit APIs. It is an unexpected/unintuitive behavior. The segfault happened when we call disconnect() on exit_actor API & there are still queued tasks. it's because the actor won't exit until the queued tasks are all executed, but since we called disconnect(), it will break the worker with a segfault (it is not expected disconnect is called when you are executing actor tasks). This happened even when a normal actor (not an async actor) was used if there are queued tasks when exit_actor is called. This API fixes the issues by doing 2 things. First, if cpp Exit API is called, we guarantee the queued tasks won't be executed. I fixed this issue by returning ExecuteTask immediately. Alternatively, we could manually clean actor_scheduling_queue, but this will require much more complicated code to have a good error message. I am open for this approach as well. Remove disconnect call from exit_actor API. It was written before 2020, and the comment there seems irrelevant (and also all tests seem to pass, so it should be okay). I assume it was a hack, and the issue from the comment was fixed at some point of time. Also, this PR adds 2 guarantees to exit_actor APIs. Once exit_actor or exit is called on an actor, there will be no additional tasks running from that actor. Any queued or incoming requests will fail with a clear error message. When the actor is terminated via exit_actor or exit, the atexit handler is guaranteed to be called (I will add tests).

Add absolute file path to log files for each replica. ray-project#33503 (comment) Example: ``` replicas: - replica_id: foo_DAGDriver#jsrUNs state: RUNNING pid: 68276 actor_name: SERVE_REPLICA::foo_DAGDriver#jsrUNs actor_id: 7c1c702270bb634a7cf4c24f01000000 node_id: 568bf20e0658e89361a997fe57b896b15fcb97268f3b039e1513c6a5 node_ip: 192.168.1.14 start_time_s: 1679598497.387779 log_file_path_id: /serve/deployment_foo_DAGDriver_foo_DAGDriver#jsrUNs.log ```

…pyOpenSSL and cryptography (ray-project#33273) runtime_env working_dir S3 urls require a recent version of boto3 to read environment variables for authentication for downloading from private buckets. We currently include an outdated boto3 version in the Ray Docker images. This PR bumps the version in the Ray Docker images to make the S3 working_dir download feature work out of the box. The reason this is important is that users might try to use S3 URLs for runtime_env with the Ray Docker image, but it's hard to debug the failure that occurs with the outdated boto3 version (see linked issue). This is worse than not having boto3 installed, since in that case the error message is clear ("You must pip install boto3 to fetch URIs"). Related issue number Closes ray-project#33256 Closes ray-project#34752

) Why are these changes needed? With telemetry tracking since ray 2.3, we have not seen significant and recent usage of the timeout=0 behaviour: image Raw query behind firewall So we will update this behaviour as documented in ray-project#28465 cc vitrioil for the original PR: https://github.com/ray-project/ray/pull/30210/files Signed-off-by: Ricky Xu <xuchen727@hotmail.com> --------- Signed-off-by: Ricky Xu <xuchen727@hotmail.com> Co-authored-by: vitrioil <opm249@gmail.com> Co-authored-by: Prem <41074533+vitrioil@users.noreply.github.com>

…ect#34727) The niceness of the job supervisor should be set to 0. Signed-off-by: vitsai <victoria@anyscale.com>

…project#35176) Was resolved in a python 3.7 environment. Now resolving in a python 3.9 environment. Also upgraded dependencies. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…e examples (ray-project#35151) Signed-off-by: Balaji Veeramani <balaji@anyscale.com> BatchPredictor isn't a recommended API for batch inference anymore. "Scalable Batch Inference with Ray" uses BatchPredictor, so we're removing it until it gets updated with the recommended APIs.

Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com> Previously, then a connection is broken, it'll try to do reconnection immediately. Usually when network issues happened, it's going to take a while to recover. This PR adds a 2s delay before initializing a reconnect to make the workload more reasonable.

This package is available in the ubuntu:focal base images but not in the CUDA base images, but may be required by downstream dependencies in our docker ML images. Signed-off-by: Kai Fricke <kai@anyscale.com>

…ct#35150) Anyscale recently stopped supporting i3.8xlarge instance types. As a result, the pipelined_training_50_gb.aws release test -- which uses i3.8xlarge -- has been failing. This PR updates the instance type to m6i.16xlarge (a supported instance type). --------- Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

…ct#35185)

…35145) Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

Signed-off-by: Avnish <avnishnarayan@gmail.com>

…ject#35178) As a followup to ray-project#34847, allow fusing `MapOperator` -> `Repartition` operators for the shuffle repartition case (we do not support fusing for split repartition, which only uses `ShuffleTaskSpec.reduce` and thus cannot call the upstream map function passed to `ShuffleTaskSpec.map`). Signed-off-by: Scott Lee <sjl@anyscale.com>

…5119) This PR adds the object owner and copy metrics to `GetNodeStats` RPC endpoint. The inlined small objects are not counted as one copy because it's not stored in object store and when it's used, it'll be copied inline, so no need to count it. But it's still counted as 1 as ownership for correctness because it's actually owned by worker. The local copies are retrieved from local object manager directly and owner counts needs the caller to aggregate the metrics from each core worker.

After getting further feedback about confusion from some types of users, we've decided to not proceed with the Dataset -> Datastream rename for 2.5. Instead, we will retain the data structure name and just refer to it as "streaming datasets" in the copy and emphasize its streaming nature in other ways. --------- Signed-off-by: Eric Liang <ekhliang@gmail.com>

…#34990) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

Signed-off-by: Ricky Xu <xuchen727@hotmail.com>

…e to port conflicts (ray-project#35127) The test_placement_group_3 use case occasionally fails. I have seen that the reason for the failure is that redis failed to start ("Warning: Could not create server TCP listening socket ::*:49152: bind: Address already in use"). Now for the test case with external redis, when starting redis, add the judgment of whether the redis process starts successfully, and try again if it fails to start.

…m state api (ray-project#35109) Re-Revert of ray-project#34433

…ect#35122) Now there is node_id information corresponding to each bundles in gcs_utils.PlacementGroupTableData. But there is no node_id information corresponding to bundles in ray.util.placement_group_table() interface in python. Now add a "bundles_to_node_id" field in the returned result of the ray.util.placement_group_table() interface To ensure compatibility. A "bundles_to_node_id" field is added.

And build dependencies with 3.7. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…project#34948) This PR may have caused flakey failures in the test case 'test_placement_group_3', so it was rolled back. This is a resubmitted PR If it is confirmed that the issue was caused by this PR, then I will make the necessary modifications to address the problem

Also makes the job detail page work when accessed via the submission id in the path. This will enable future work to link to submission-only jobs. Also fixes bug where the grafana dashboard dropdowns for Deployments and Replicas don't work until after the first request was received for that replica or deployment.

…n before connection to Unity editor). (ray-project#35167)

Direct users to the new batch inference guide. Found a broken reference while doing so. Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>

wilsonwang371 · 2023-05-24T18:50:37Z

@iycheng This is just a rebase action to ray master branch.

zzb54321 · 2024-05-15T14:52:35Z

Hi Wilson,

I'm interested in the WebAssembly integration with Ray. I saw you actively developing last year, but somehow the development stopped afterwards.

I'm wondering, is it due to priority shift, or you found the solution didn't work out? We're also evaluating whether the WebAssembly path is feasible, your experience can really help us a lot. : )

Thanks!

wilsonwang371 · 2024-05-15T14:55:48Z

it is just priority shift. i am not working on this repo but i am working on something similar. we can discuss offline if you are interested in more details @zzb54321

chaowanggg and others added 30 commits May 8, 2023 18:00

Revert "[Overview][Serve] Add Recent Serve Applications Card" (ray-pr…

faae43a

…oject#35155) Reverts ray-project#34642 This seems be breaking all pipeline builds. as it is failing to build the base container.

[core] Change worker niceness in job submission environment (ray-proj…

cc3fa33

…ect#34727) The niceness of the job supervisor should be set to 0. Signed-off-by: vitsai <victoria@anyscale.com>

[ci/release] Resolve dependencies with python 3.9 inside conda. (ray-…

9067905

…project#35176) Was resolved in a python 3.7 environment. Now resolving in a python 3.9 environment. Also upgraded dependencies. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[docker] Add netbase to base deps docker image (ray-project#35174)

e211688

This package is available in the ubuntu:focal base images but not in the CUDA base images, but may be required by downstream dependencies in our docker ML images. Signed-off-by: Kai Fricke <kai@anyscale.com>

[data] Update the strict mode message to be less confusing (ray-proje…

e276929

…ct#35185)

[RLlib] Activate RLModules and Learner together in docs (ray-project#…

9e0a000

…35145) Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

[RLlib] Add test utils for rllib contrib (ray-project#35056)

4a17d8c

Signed-off-by: Avnish <avnishnarayan@gmail.com>

[AIR] LightningTrainer Dolly V2 FSDP Fine-tuning Example (ray-project…

886926c

…#34990) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

[ci][core] Remove test_ray_get_timeout_zero ray-project#35196

6cd79c9

Signed-off-by: Ricky Xu <xuchen727@hotmail.com>

[core][state] Push down filtering to GCS for listing/getting task fro…

48d3b29

…m state api (ray-project#35109) Re-Revert of ray-project#34433

Downgrade hermetic python to 3.8 (ray-project#35198)

790a16d

And build dependencies with 3.7. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[RLlib] Unity3D adapter: Disable env pre-checking (agent IDs not know…

0e917ef

…n before connection to Unity editor). (ray-project#35167)

[docs] update batch guide link, fix tensor ref (ray-project#35171)

acf18fa

Direct users to the new batch inference guide. Found a broken reference while doing so. Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>

fishbone approved these changes May 24, 2023

View reviewed changes

fishbone merged commit 0f8d02b into ray-project:feature/wasm-on-ray May 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rebase wasm-on-ray branch to latest master #35727

rebase wasm-on-ray branch to latest master #35727

wilsonwang371 commented May 24, 2023

wilsonwang371 commented May 24, 2023

zzb54321 commented May 15, 2024

wilsonwang371 commented May 15, 2024 •

edited

Loading

rebase wasm-on-ray branch to latest master #35727

rebase wasm-on-ray branch to latest master #35727

Conversation

wilsonwang371 commented May 24, 2023

Why are these changes needed?

Related issue number

Checks

wilsonwang371 commented May 24, 2023

zzb54321 commented May 15, 2024

wilsonwang371 commented May 15, 2024 • edited Loading

wilsonwang371 commented May 15, 2024 •

edited

Loading