Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved Error Messaging for Insufficient Resources and Parallel Deployment Implementation #148

Merged
merged 3 commits into from
Aug 7, 2024

Conversation

movchan74
Copy link
Contributor

@movchan74 movchan74 commented Aug 2, 2024

Summary:
This update enhances deployment by introducing more robust error handling, especially for insufficient resource errors, and enabling parallel deployments. It aims to provide clearer feedback during deployment failures and reduce overall deployment time.

Key Changes:

  1. New Exception Classes: Added DeploymentException, InsufficientResources, and FailedDeployment classes to better categorize deployment errors.

  2. Enhanced Deployment Status Checking: Implemented a wait_for_deployment method to actively monitor deployment status, detect failures across all deployments, and detect insufficient resource errors.

  3. Improved Resource Demand Handling: Added logic to check for resource demands and provide specific error messages for insufficient resources.

  4. Refined Status Reporting: Updated the print_app_status method (former show_status) to display deployment status information in a more readable format.

  5. Parallel Deployment Implementation: Modified the deploy method to use serve.api._run with _blocking=False, allowing all deployments to start simultaneously. The wait_for_deployment method then monitors the status of all deployments.

  6. Deployment Process Refactoring: Restructured the deployment process to handle exceptions more gracefully and provide more informative error messages.

  7. Tests: Added test to ensure deployment exceptions are thrown correctly.

Resolves: #124 and #125

Error Output Examples:

Insufficient resources error:

Error: No available node types can fulfill resource request {'CPU': 1.0, 'GPU': 0.25}. Might be due to insufficient or misconfigured GPU resources.

Deployment error:

============================================================
WhisperDeployment (asr_deployment)
============================================================
Status: UNHEALTHY
Message: The deployment failed to start 3 times in a row. This may be due to a problem with its constructor or initial health check failing. See controller logs for details. Retrying after 1 seconds. Error:
ray::ServeReplica:asr_deployment:WhisperDeployment.initialize_and_get_metadata() (pid=967293, ip=172.17.0.2, actor_id=aab745596c3b8f430fd0fc8e01000000, repr=<ray.serve._private.replica.ServeReplica:asr_deployment:WhisperDeployment object at 0x7f55009977f0>)
  File "/usr/local/python/3.10.14/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/local/python/3.10.14/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/root/.cache/pypoetry/virtualenvs/aana-vIr3-B0u-py3.10/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 631, in initialize_and_get_metadata
    raise RuntimeError(traceback.format_exc()) from None
RuntimeError: Traceback (most recent call last):
  File "/root/.cache/pypoetry/virtualenvs/aana-vIr3-B0u-py3.10/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 615, in initialize_and_get_metadata
    await self._user_callable_wrapper.call_reconfigure(
  File "/root/.cache/pypoetry/virtualenvs/aana-vIr3-B0u-py3.10/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 956, in call_reconfigure
    await self._call_func_or_gen(
  File "/root/.cache/pypoetry/virtualenvs/aana-vIr3-B0u-py3.10/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 869, in _call_func_or_gen
    result = await result
  File "/workspaces/aana_sdk/aana/deployments/base_deployment.py", line 198, in reconfigure
    await self.apply_config(config)
  File "/workspaces/aana_sdk/aana/deployments/whisper_deployment.py", line 146, in apply_config
    raise RuntimeError("Big failure")
RuntimeError: Big failure
============================================================

Copy link
Collaborator

@evanderiel evanderiel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, one question

aana/cli.py Show resolved Hide resolved
Copy link
Contributor

@HRashidi HRashidi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know about the cases the user uses the autoscaler for the app?
Can we add a flag for this procedure to bypass the interruption. In case the user turn off a restart a server or any other scenarios

@movchan74
Copy link
Contributor Author

Do you know about the cases the user uses the autoscaler for the app? Can we add a flag for this procedure to bypass the interruption. In case the user turn off a restart a server or any other scenarios

If you deploy it on the cluster you should use serve config files, not aana deploy. So it is irrelevant.

@movchan74 movchan74 merged commit 27b22dd into main Aug 7, 2024
6 checks passed
@movchan74 movchan74 deleted the aana_deploy_improvements branch August 7, 2024 07:34
@movchan74 movchan74 mentioned this pull request Aug 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants