Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Inference] Use huggingface_hub inference client for TGI adapter #53

Merged
merged 13 commits into from
Sep 12, 2024

Conversation

hanouticelina
Copy link
Contributor

@hanouticelina hanouticelina commented Sep 5, 2024

Why this PR? What does it do?

This PR refines the existing TGI integration #52. Specifically:

  • Updates the TGI client from the legacy version to the current, maintained inference client (documentation here).
  • Adds explicit documentation for using TGI with Hugging Face Inference Endpoints.

Test Plan
I tested the refined TGI integration with both local endpoint and Hugging Face Inference Endpoints.

1. Setup the distribution
Locally, we need to first start up the TGI container:

docker run --rm -it \
     --network host \
    -e HF_TOKEN=$HF_TOKEN \
    -v $HOME/.cache/huggingface:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --dtype bfloat16 \
    --model-id meta-llama/Meta-Llama-3.1-8B-Instruct
    --port 5555

Then we build a llama stack
llama stack build local-plus-tgi-inference --name 8b-instruct

This command creates a conda environment with the necessary dependencies. Then, you'll be prompted for:

  • TGI endpoint URL: For local endpoints: http://localhost:5555 and for Hugging Face Inference Endpoints, use the provided endpoint URL.
  • Hugging Face Token (if using Hugging Face Inference Endpoints).

2. Run the server
llama stack run local-plus-tgi-inference --name 8b-instruct --port 6001

3. Test the distribution
python -m llama_toolchain.inference.client localhost 6001

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 5, 2024
@hanouticelina hanouticelina marked this pull request as ready for review September 5, 2024 20:27
@ashwinb
Copy link
Contributor

ashwinb commented Sep 5, 2024

I don't think the openai-compat layer works correctly with tool calling. I specifically needed to use generate_stream() with explicit configuration for stop tokens (<|eom_id|>, <|eot_id|>) without which things don't work.

Can you show a test where you point python -m llama_toolchain.agentic_system.client to this server and show that the brave search tool calling example (Who's the president of US) works?

@ashwinb ashwinb self-requested a review September 5, 2024 22:46
@hanouticelina
Copy link
Contributor Author

I don't think the openai-compat layer works correctly with tool calling. I specifically needed to use generate_stream() with explicit configuration for stop tokens (<|eom_id|>, <|eot_id|>) without which things don't work.

Can you show a test where you point python -m llama_toolchain.agentic_system.client to this server and show that the brave search tool calling example (Who's the president of US) works?

Hi @ashwinb, you were right, we need full control of the tokens sent to the model which is not easy to get using InferenceClient.chat_completion(). I ended up using huggingface_hub.Inference_Client.text_generation(stream=True) (documentation here), which is maintained on our side and has the same behavior as text_generation.generate_stream().

I tested the brave search tool calling with the Server:

  1. Setup and build the environnement
llama stack build adhoc \
  agentic_system=meta-reference,safety=meta-reference,memory=meta-reference-faiss,inference=remote::tgi \
 --name tgi

(works also using the distribution local-plus-tgi-inference)
2. Run the server
llama stack run adhoc --name tgi --port 6001

Note: I unabled Brave Search tool only in the agentic system client.

  1. Test the server
    python -m llama_toolchain.agentic_system.client localhost 6001

Output:

User> Search web for who was 44th President of USA?
StepType.inference> brave_search.call(query="44th President of the United States")
StepType.tool_execution> Tool:BuiltinTool.brave_search Args:{'query': '44th President of the United States'}
StepType.tool_execution> Tool:BuiltinTool.brave_search Response:{"query": "44th President of the United States", "top_k": [{"title": "President Barack Obama | Barack Obama Presidential Library", "url": "https://www.obamalibrary.gov/obamas/president-barack-obama", "description": "He accepted the Democratic Party\u2019s nomination at Invesco Stadium in Denver, Colorado on August 28, 2008. On November 4, 2008, Obama became the first African-American to be elected President. He resigned his seat in the U.S. Senate on November 16, 2008. <strong>Barack Obama</strong> was inaugurated as the ...", "type": "search_result"}, {"title": "Barack Obama | The White House", "url": "https://www.whitehouse.gov/about-the-white-house/presidents/barack-obama/", "description": "<strong>Barack Obama</strong> served as the 44th President of the United States. His story is the American story \u2014 values from the heartland, a middle-class upbringing in a strong family, hard work and education as the means of getting ahead, and the conviction that a life so blessed should be lived in service ...", "type": "search_result"}, [{"type": "video_result", "url": "https://www.instagram.com/reel/CzMZbJmObn9/", "title": "Fifteen years ago, on Nov. 4, Barack Obama was elected as ...", "description": "Welcome back to Instagram. Sign in to see what your friends, family and interests have been capturing and sharing around the world."}, {"type": "video_result", "url": "https://video.alexanderstreet.com/watch/the-44th-president-barack-obama?context=channel:barack-obama", "title": "The 44th President (Barack Obama) - Alexander Street, a ...", "description": "You need to enable JavaScript to run this app"}, {"type": "video_result", "url": "https://www.youtube.com/watch?v=iyL7_2-em5k", "title": "Barack Obama for Kids | Learn about the life and contributions ...", "description": "Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube."}, {"type": "video_result", "url": "https://www.britannica.com/video/172743/overview-Barack-Obama", "title": "President of the United States of America Barack Obama | Britannica", "description": "Barack Obama was elected the 44th president of the United States in 2008, becoming the first African American to hold the office. For most of his presidency Obama faced Republican opposition to almost all his proposals. Nevertheless, he was still able to fulfill some of his biggest campaign ..."}, {"type": "video_result", "url": "https://www.youtube.com/watch?v=rvr2g8-5dcE", "title": "The 44th President: In His Own Words - Toughest Day | Special ...", "description": "President Obama reflects on his toughest day in the Presidency and seeing Secret Service cry for the first time. Watch the premiere of The 44th President: In..."}]]}
StepType.shield_call> No Violation
StepType.inference> The 44th President of the United States was Barack Obama.

Copy link
Contributor

@ashwinb ashwinb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome, thank you so much. Just a few comments.

llama_toolchain/inference/adapters/tgi/tgi.py Outdated Show resolved Hide resolved
llama_toolchain/inference/adapters/tgi/tgi.py Show resolved Hide resolved
llama_toolchain/inference/adapters/tgi/tgi.py Outdated Show resolved Hide resolved
@hanouticelina
Copy link
Contributor Author

hanouticelina commented Sep 9, 2024

Hello @ashwinb, thanks for your review! I updated the Adapter based on your comments. I split the TGI adapter for Local or Remote TGI endpoints and Hugging Face Inference Endpoints due to differences in retrieving the model_id and other information about the model:

  • Local TGI deployments: We can directly use InferenceClient.get_endpoint_info() to get the model_id and max_total_tokens.

  • Hugging Face Inference Endpoints: The inference URL doesn't provide the model_id. Users need to provide their HF endpoint name. With these, we retrieve both the inference URL and model information (model_id and max_total_tokens).

I've tested both adapters for inference and agentic function calling. Please let me know if you have any suggestions for further improvements!

Copy link
Contributor

@ashwinb ashwinb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks once again.

I do have one important comment about the get_namespace() method pulling in the huggingface_hub dep into the config file. Please move that helper method outside as a free floating function outside into the impl file.


from typing import Optional

from huggingface_hub import HfApi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want the config files to be extremely lightweight so one can use them outside of the larger containers. as such they should only have datatypes and no "behavior" or functions at all. I suggest you move at least the get_namespace() method out into the impl where it needs to be used so this dependency is not here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed ✅

@ashwinb ashwinb merged commit 736092f into meta-llama:main Sep 12, 2024
3 checks passed
yanxi0830 pushed a commit that referenced this pull request Sep 12, 2024
* Use huggingface_hub inference client for TGI inference

* Update the default value for TGI URL

* Use InferenceClient.text_generation for TGI inference

* Fixes post-review and split TGI adapter into local and Inference Endpoints ones

* Update CLI reference and add typing

* Rename TGI Adapter class

* Use HfApi to get the namespace when not provide in the hf endpoint name

* Remove unecessary method argument

* Improve TGI adapter initialization condition

* Move helper into impl file + fix merging conflicts
ashwinb added a commit that referenced this pull request Sep 25, 2024
Test Plan:

First, start a TGI container with `meta-llama/Llama-Guard-3-8B` model
serving on port 5099. See #53 and its
description for how.

Then run llama-stack with the following run config:

```
image_name: safety
docker_image: null
conda_env: safety
apis_to_serve:
- models
- inference
- shields
- safety
api_providers:
  inference:
    providers:
    - remote::tgi
  safety:
    providers:
    - meta-reference
  telemetry:
    provider_id: meta-reference
    config: {}
routing_table:
  inference:
  - provider_id: remote::tgi
    config:
      url: http://localhost:5099
      api_token: null
      hf_endpoint_name: null
    routing_key: Llama-Guard-3-8B
  safety:
  - provider_id: meta-reference
    config:
      llama_guard_shield:
        model: Llama-Guard-3-8B
        excluded_categories: []
        disable_input_check: false
        disable_output_check: false
      prompt_guard_shield: null
    routing_key: llama_guard
```

Now simply run `python -m llama_stack.apis.safety.client localhost
<port>` and check that the llama_guard shield calls run correctly. (The
injection_shield calls fail as expected since we have not set up a
router for them.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants