Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TorchServe quickstart chatbot example #3003

Merged
merged 9 commits into from
Mar 16, 2024
Merged

Conversation

agunapal
Copy link
Collaborator

@agunapal agunapal commented Mar 6, 2024

Description

This PR enables a new user of TorchServe to quickly launch a chatbot on Mac M1/M2 using TorchServe with 3 commands

# 1: Set HF Token as Env variable
export HUGGINGFACE_TOKEN=<Token> # get this from your HuggingFace account

# 2: Build TorchServe Image for Serving llama2-7b model with 4-bit quantization
./examples/llm/llama2/chat_app/docker/build_image.sh meta-llama/Llama-2-7b-chat-hf

# 3: Launch the streamlit app for server & client
docker run --rm -it --platform linux/amd64 -p 127.0.0.1:8080:8080 -p 127.0.0.1:8081:8081 -p 127.0.0.1:8082:8082 -p 127.0.0.1:8084:8084 -p 127.0.0.1:8085:8085 -v <model-store>:/home/model-server/model-store pytorch/torchserve:meta-llama---Llama-2-7b-chat-hf

Prerequsites:

  1. HuggingFace token
  2. Docker

Fixes #(issue)

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

  • Test A
    Logs for Test A

  • Test B
    Logs for Test B

Checklist:

  • Did you have fun?
  • Have you added tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

@agunapal agunapal marked this pull request as ready for review March 7, 2024 01:11
@agunapal agunapal requested a review from msaroufim March 7, 2024 01:11
@@ -9,6 +9,33 @@ We are using [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) in
You can run this example on your laptop to understand how to use TorchServe


## Quick Start Guide
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can be more ambitious and make this our new getting started

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My goal is to do a 3 part solution

  1. chatbot quickstart with streamlit -> Because chatbots are popular
  2. TS multi model app to show TS' full capability - Use this to create video series
  3. quick start script for common use-cases with curl command -> This can be the getting started guide.

# 2: Build TorchServe Image for Serving llama2-7b model with 4-bit quantization
./examples/llm/llama2/chat_app/docker/build_image.sh meta-llama/Llama-2-7b-chat-hf

# 3: Launch the streamlit app for server & client
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's not exactly what you might have in mind but I was thinking this would open a terminal based CLI

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about how to cover various scenarios..
So, my goal is to do a 3 part solution

  1. chatbot quickstart with streamlit -> Because chatbots are popular
  2. TS multi model app to show TS' full capability - Use this to create video series
  3. quick start script for common use-cases with curl command -> This can be the getting started guide.

RUN pip install -r /home/model-server/chat_bot/requirements.txt && huggingface-cli login --token $HUGGINGFACE_TOKEN
RUN pip uninstall torchtext torchdata torch torchvision torchaudio -y
RUN pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu --ignore-installed
RUN pip uninstall torchserve torch-model-archiver torch-workflow-archiver -y
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like a miss?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. This is not needed for this example. Will clean it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

ARG MODEL_NAME
ARG HUGGINGFACE_TOKEN

USER root
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need root?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we don't have permissions to install things with the default user



def start_server():
os.system("torchserve --start --ts-config /home/model-server/config.properties")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would show as success even if the the server failed to start, favor subprocess instead, check the return code and then query torchserve directly to see server health as opposed to using sleep

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh..good point..let me try. This command was returning immediately. I did try with ping but that was failing as the server was not up yet.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the logic, but it still doesn't work as expected. There is a slight difference between when the command returns and when the server starts. Add a check with ping , but needs a sleep.


### What to expect
This launches two streamlit apps
1. TorchServe Server app to start/stop TorchServe, load model, scale up/down workers, configure dynamic batch_size ( Currently llama-cpp-python doesn't support batch_size > 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit painful to use llama-cpp here was hopign we could instead showcase an example with export or with mps in eager

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a few things

  1. Use HF 7b models with quantization -> Only supported for CUDA
  2. Use HF 7b models without quantization on CPU -> Extremely slow. No one would use this.
  3. Docker with MPS -> Seems like this is still not supported. Even pytorch supports only cpu in docker. MPS-Ready, ARM64 Docker Image pytorch#81224

So, currently this seems like the best solution. Seems like some people have tried mistral7b with llama-cpp-python..Its kind of mind blowing that most existing solutions are only targeted for the GPU rich.

@agunapal agunapal requested a review from msaroufim March 7, 2024 19:37
@msaroufim msaroufim added this pull request to the merge queue Mar 16, 2024
Merged via the queue into master with commit d60ddb0 Mar 16, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants