Improved Error Messaging for Insufficient Resources and Parallel Deployment Implementation #148

movchan74 · 2024-08-02T12:27:42Z

Summary:
This update enhances deployment by introducing more robust error handling, especially for insufficient resource errors, and enabling parallel deployments. It aims to provide clearer feedback during deployment failures and reduce overall deployment time.

Key Changes:

New Exception Classes: Added DeploymentException, InsufficientResources, and FailedDeployment classes to better categorize deployment errors.
Enhanced Deployment Status Checking: Implemented a wait_for_deployment method to actively monitor deployment status, detect failures across all deployments, and detect insufficient resource errors.
Improved Resource Demand Handling: Added logic to check for resource demands and provide specific error messages for insufficient resources.
Refined Status Reporting: Updated the print_app_status method (former show_status) to display deployment status information in a more readable format.
Parallel Deployment Implementation: Modified the deploy method to use serve.api._run with _blocking=False, allowing all deployments to start simultaneously. The wait_for_deployment method then monitors the status of all deployments.
Deployment Process Refactoring: Restructured the deployment process to handle exceptions more gracefully and provide more informative error messages.
Tests: Added test to ensure deployment exceptions are thrown correctly.

Resolves: #124 and #125

Error Output Examples:

Insufficient resources error:

Error: No available node types can fulfill resource request {'CPU': 1.0, 'GPU': 0.25}. Might be due to insufficient or misconfigured GPU resources.

Deployment error:

============================================================
WhisperDeployment (asr_deployment)
============================================================
Status: UNHEALTHY
Message: The deployment failed to start 3 times in a row. This may be due to a problem with its constructor or initial health check failing. See controller logs for details. Retrying after 1 seconds. Error:
ray::ServeReplica:asr_deployment:WhisperDeployment.initialize_and_get_metadata() (pid=967293, ip=172.17.0.2, actor_id=aab745596c3b8f430fd0fc8e01000000, repr=<ray.serve._private.replica.ServeReplica:asr_deployment:WhisperDeployment object at 0x7f55009977f0>)
  File "/usr/local/python/3.10.14/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/local/python/3.10.14/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/root/.cache/pypoetry/virtualenvs/aana-vIr3-B0u-py3.10/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 631, in initialize_and_get_metadata
    raise RuntimeError(traceback.format_exc()) from None
RuntimeError: Traceback (most recent call last):
  File "/root/.cache/pypoetry/virtualenvs/aana-vIr3-B0u-py3.10/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 615, in initialize_and_get_metadata
    await self._user_callable_wrapper.call_reconfigure(
  File "/root/.cache/pypoetry/virtualenvs/aana-vIr3-B0u-py3.10/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 956, in call_reconfigure
    await self._call_func_or_gen(
  File "/root/.cache/pypoetry/virtualenvs/aana-vIr3-B0u-py3.10/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 869, in _call_func_or_gen
    result = await result
  File "/workspaces/aana_sdk/aana/deployments/base_deployment.py", line 198, in reconfigure
    await self.apply_config(config)
  File "/workspaces/aana_sdk/aana/deployments/whisper_deployment.py", line 146, in apply_config
    raise RuntimeError("Big failure")
RuntimeError: Big failure
============================================================

evanderiel

Looks good, one question

aana/cli.py

HRashidi

Do you know about the cases the user uses the autoscaler for the app?
Can we add a flag for this procedure to bypass the interruption. In case the user turn off a restart a server or any other scenarios

movchan74 · 2024-08-06T14:16:37Z

Do you know about the cases the user uses the autoscaler for the app? Can we add a flag for this procedure to bypass the interruption. In case the user turn off a restart a server or any other scenarios

If you deploy it on the cluster you should use serve config files, not aana deploy. So it is irrelevant.

Better error messaging for insufficient resources, parallel deployment.

30e1ae3

This was linked to issues Aug 2, 2024

[ENHANCEMENT] Enable Parallel Deployments #124

Closed

[ENHANCEMENT] Improve Error Messaging for Insufficient Resources in Aana Deploy #125

Closed

movchan74 requested review from HRashidi and evanderiel August 2, 2024 12:34

movchan74 marked this pull request as ready for review August 2, 2024 12:34

evanderiel approved these changes Aug 5, 2024

View reviewed changes

Aleksandr Movchan added 2 commits August 5, 2024 10:45

Raise exceptions in aana.deploy. Added tests for deploy errors.

8eb1f57

Merge branch 'main' into aana_deploy_improvements

3dc13b1

movchan74 requested a review from evanderiel August 5, 2024 11:53

evanderiel approved these changes Aug 6, 2024

View reviewed changes

aana/cli.py Show resolved Hide resolved

HRashidi approved these changes Aug 6, 2024

View reviewed changes

movchan74 merged commit 27b22dd into main Aug 7, 2024
6 checks passed

movchan74 deleted the aana_deploy_improvements branch August 7, 2024 07:34

movchan74 mentioned this pull request Aug 14, 2024

Status Endpoint #164

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved Error Messaging for Insufficient Resources and Parallel Deployment Implementation #148

Improved Error Messaging for Insufficient Resources and Parallel Deployment Implementation #148

movchan74 commented Aug 2, 2024 •

edited

Loading

evanderiel left a comment

HRashidi left a comment

movchan74 commented Aug 6, 2024

Improved Error Messaging for Insufficient Resources and Parallel Deployment Implementation #148

Improved Error Messaging for Insufficient Resources and Parallel Deployment Implementation #148

Conversation

movchan74 commented Aug 2, 2024 • edited Loading

evanderiel left a comment

Choose a reason for hiding this comment

HRashidi left a comment

Choose a reason for hiding this comment

movchan74 commented Aug 6, 2024

movchan74 commented Aug 2, 2024 •

edited

Loading