Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add service deployment instructions to stable diffusion template #37645

Closed
wants to merge 57 commits into from

Conversation

akshay-anyscale
Copy link
Contributor

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

bveeramani and others added 30 commits June 30, 2023 17:01
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: can <can@anyscale.com>
…Request` object (#37040) (#37057)

- Update test to use new API.
- Clean up test output: disable access log, print results using `print` instead of logger (which wasn't logged to stdout).
- Don't pass the raw starlette request object (it isn't serializable).
…37064) (#37066)

fixing the user experience so that readers don't face a pay wall.
code snippets for Batch Inference and Hyperparameter Tuning needed minor fixes (typo, indentation)
…37097) (#37101)

This appears to be the same issue as #36990

Pinning the version in the install in `test_backwards_compatibility.sh`

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
…) (#37110)

This PR add the metrics for the object size distribution to help the user understand how the objects are used in the script.
…n test larger (#37127) (#37154)

Signed-off-by: Avnish <avnishnarayan@gmail.com>
…37167)

Serve has recently added streaming and WebSocket support. This change adds end-to-end examples to guide users through these features.

Link to documentation: https://anyscale-ray--36961.com.readthedocs.build/en/36961/serve/tutorials/streaming.html

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
This fixes an error caused by the default batch format of Ray Data changing to numpy. We need to manually specify pandas.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Cache the computed schema to avoid re-executing.

Closes #37077.
…syncio (#37062) (#37200)

The last ref returned by a streaming generator is a sentinel ObjectRef that contains the end-of-stream error. This suppresses an error from asyncio that the exception is never retrieved (which is expected).
Related issue number

Closes #36956.

---------

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Currently we are relying on the client to wait for all the resources before shutting off the controller. This caused the issue for when they interrupt the process and can cause incomplete shutdown. In this PR we moved the shutdown logic into the event loop which would be triggered by a `_shutting_down` flag on the controller. So even if the client interrupted the process, the controller will continue to shutdown all the resources and then kill itself.

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Gene Der Su <e870252314@gmail.com>
…#37219)

`FunctionTrainable.restore_from_object` creates a temporary checkpoint directory.

This directory is kept around as we don't control how the user interacts with the checkpoint - they might load it several times, or no time at all.

Once a new checkpoint is tracked in the status reporter, there is no need to keep the temporary object around anymore. 

In this PR, we add functionality to remove these temporary directories. Additionally we adjust the number of checkpoints to keep in `pytorch_pbt_failure` to 10 to reduce disk pressure in the release test. It looks like this lead to recent failures of the test. By removing the total number of checkpoints and fixing the issue with temporary directories we should see much less disk usage.

Signed-off-by: Kai Fricke <kai@anyscale.com>
…y for cluster state reporting #37132 (#37176)

Why are these changes needed?
The labels are declared as strings, and PG will generate (anti)affinity labels. The current implementation geneates _PG_<binary_pg_id> as the label key. However, binary chars are not encodable in string.

This PR changes the pg generated dynamic labels to _PG_<hex_pg_id> which is more readable as well.
…pdated periodically (#37121) (#37175)

Why are these changes needed?
It was assumed resource update is broadcasted periodically (which isn't the case), so the idle time wasn't updated when the node keeps in the idle state.

This PR makes the raylet sent the last idle time (if idle) to the GCS, and allows GCS to calculate the duration.
---------

Signed-off-by: rickyyx <rickyx@anyscale.com>
📖 Doctest (CPU) fails 25% of runs due to a few flaky tests. This PR deflakes those tests.

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
…7284)

Signed-off-by: pdmurray <peynmurray@gmail.com>
Co-authored-by: Peyton Murray <peynmurray@gmail.com>
The following examples already use updated APIs:
* Stable Diffusion Batch Prediction with Ray AIR
* GPT-J-6B Batch Prediction with Ray AIR (LLM)

The following examples have been updated to use updated APIs:
* Training a model with distributed XGBoost
* Training a model with distributed LightGBM

I've removed batch prediction sections from the other examples, and, where appropriate, linked to the batch inference user guide.

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: rickyyx <rickyx@anyscale.com>
#37301)

* [Core] Fix the race condition where grpc requests are handled while core worker not yet initialized (#37117)

Why are these changes needed?
there is a race condition where grpc server start handling requests before the core worker is initialized. This PR fixes by waiting for initialization before handling any grpc request.

* update
#37173 changed a test in a previous iteration that is failing after additional changes. This PR reverts the changes to the test to fix broken master.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Retrieve the token from the GCS server in the GCS client while connecting, to attach to metadata in requests.

Previous PR (GCS server): #36535
Next PR (auth): #36073
…ockMetadata` (#37119) (#37263)

Currently, the stage execution time used in `StageStatsSummary` is the Dataset's total execution time: https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/stats.py#L292

Instead, we should calculate the execution time as the maximum wall time from the stage's `BlockMetadata`, so that this output is correct in cases with multiple stages.

Signed-off-by: Scott Lee <sjl@anyscale.com>
…ons in `DataIterator.iter_batches()` (#36842) (#37260)

Currently, the prefetch_batches arg of Dataset.iter_batches is used to configure the number of preloaded batches on both the CPU and GPU; therefore, in the typical case where there is much more CPU than GPU, this constrains the number of batches to prefetch on the CPU.

This PR adds a separate parameter, _finalize_fn, which allows for a user-defined function that is executed in a separate threadpool, which allows for parallelization of these steps. For example, this could be useful for host to device transfers as the last step in getting a batch; this is the default _finalize_fn used when _collate_fn is not specified. Note that when _collate_fn is provided by the user, they should also handle the host to device transfer themselves outside of _collate_fn in order to maximize performance.

---------

Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Co-authored-by: amogkam <amogkamsetty@yahoo.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.