[Inference] Use huggingface_hub inference client for TGI adapter #53

hanouticelina · 2024-09-05T17:08:58Z

Why this PR? What does it do?

This PR refines the existing TGI integration #52. Specifically:

Updates the TGI client from the legacy version to the current, maintained inference client (documentation here).
Adds explicit documentation for using TGI with Hugging Face Inference Endpoints.

Test Plan
I tested the refined TGI integration with both local endpoint and Hugging Face Inference Endpoints.

1. Setup the distribution
Locally, we need to first start up the TGI container:

docker run --rm -it \
     --network host \
    -e HF_TOKEN=$HF_TOKEN \
    -v $HOME/.cache/huggingface:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --dtype bfloat16 \
    --model-id meta-llama/Meta-Llama-3.1-8B-Instruct
    --port 5555

Then we build a llama stack
llama stack build local-plus-tgi-inference --name 8b-instruct

This command creates a conda environment with the necessary dependencies. Then, you'll be prompted for:

TGI endpoint URL: For local endpoints: http://localhost:5555 and for Hugging Face Inference Endpoints, use the provided endpoint URL.
Hugging Face Token (if using Hugging Face Inference Endpoints).

2. Run the server
llama stack run local-plus-tgi-inference --name 8b-instruct --port 6001

3. Test the distribution
python -m llama_toolchain.inference.client localhost 6001

ashwinb · 2024-09-05T21:15:27Z

I don't think the openai-compat layer works correctly with tool calling. I specifically needed to use generate_stream() with explicit configuration for stop tokens (<|eom_id|>, <|eot_id|>) without which things don't work.

Can you show a test where you point python -m llama_toolchain.agentic_system.client to this server and show that the brave search tool calling example (Who's the president of US) works?

…k into tgi-integration

hanouticelina · 2024-09-06T16:17:30Z

I don't think the openai-compat layer works correctly with tool calling. I specifically needed to use generate_stream() with explicit configuration for stop tokens (<|eom_id|>, <|eot_id|>) without which things don't work.

Can you show a test where you point python -m llama_toolchain.agentic_system.client to this server and show that the brave search tool calling example (Who's the president of US) works?

Hi @ashwinb, you were right, we need full control of the tokens sent to the model which is not easy to get using InferenceClient.chat_completion(). I ended up using huggingface_hub.Inference_Client.text_generation(stream=True) (documentation here), which is maintained on our side and has the same behavior as text_generation.generate_stream().

I tested the brave search tool calling with the Server:

Setup and build the environnement

llama stack build adhoc \
  agentic_system=meta-reference,safety=meta-reference,memory=meta-reference-faiss,inference=remote::tgi \
 --name tgi

(works also using the distribution local-plus-tgi-inference)
2. Run the server
llama stack run adhoc --name tgi --port 6001

Note: I unabled Brave Search tool only in the agentic system client.

Test the server
python -m llama_toolchain.agentic_system.client localhost 6001

Output:

User> Search web for who was 44th President of USA?
StepType.inference> brave_search.call(query="44th President of the United States")
StepType.tool_execution> Tool:BuiltinTool.brave_search Args:{'query': '44th President of the United States'}
StepType.tool_execution> Tool:BuiltinTool.brave_search Response:{"query": "44th President of the United States", "top_k": [{"title": "President Barack Obama | Barack Obama Presidential Library", "url": "https://www.obamalibrary.gov/obamas/president-barack-obama", "description": "He accepted the Democratic Party\u2019s nomination at Invesco Stadium in Denver, Colorado on August 28, 2008. On November 4, 2008, Obama became the first African-American to be elected President. He resigned his seat in the U.S. Senate on November 16, 2008. <strong>Barack Obama</strong> was inaugurated as the ...", "type": "search_result"}, {"title": "Barack Obama | The White House", "url": "https://www.whitehouse.gov/about-the-white-house/presidents/barack-obama/", "description": "<strong>Barack Obama</strong> served as the 44th President of the United States. His story is the American story \u2014 values from the heartland, a middle-class upbringing in a strong family, hard work and education as the means of getting ahead, and the conviction that a life so blessed should be lived in service ...", "type": "search_result"}, [{"type": "video_result", "url": "https://www.instagram.com/reel/CzMZbJmObn9/", "title": "Fifteen years ago, on Nov. 4, Barack Obama was elected as ...", "description": "Welcome back to Instagram. Sign in to see what your friends, family and interests have been capturing and sharing around the world."}, {"type": "video_result", "url": "https://video.alexanderstreet.com/watch/the-44th-president-barack-obama?context=channel:barack-obama", "title": "The 44th President (Barack Obama) - Alexander Street, a ...", "description": "You need to enable JavaScript to run this app"}, {"type": "video_result", "url": "https://www.youtube.com/watch?v=iyL7_2-em5k", "title": "Barack Obama for Kids | Learn about the life and contributions ...", "description": "Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube."}, {"type": "video_result", "url": "https://www.britannica.com/video/172743/overview-Barack-Obama", "title": "President of the United States of America Barack Obama | Britannica", "description": "Barack Obama was elected the 44th president of the United States in 2008, becoming the first African American to hold the office. For most of his presidency Obama faced Republican opposition to almost all his proposals. Nevertheless, he was still able to fulfill some of his biggest campaign ..."}, {"type": "video_result", "url": "https://www.youtube.com/watch?v=rvr2g8-5dcE", "title": "The 44th President: In His Own Words - Toughest Day | Special ...", "description": "President Obama reflects on his toughest day in the Presidency and seeing Secret Service cry for the first time. Watch the premiere of The 44th President: In..."}]]}
StepType.shield_call> No Violation
StepType.inference> The 44th President of the United States was Barack Obama.

ashwinb

This is awesome, thank you so much. Just a few comments.

llama_toolchain/inference/adapters/tgi/tgi.py

…oints ones

hanouticelina · 2024-09-09T15:59:20Z

Hello @ashwinb, thanks for your review! I updated the Adapter based on your comments. I split the TGI adapter for Local or Remote TGI endpoints and Hugging Face Inference Endpoints due to differences in retrieving the model_id and other information about the model:

Local TGI deployments: We can directly use InferenceClient.get_endpoint_info() to get the model_id and max_total_tokens.
Hugging Face Inference Endpoints: The inference URL doesn't provide the model_id. Users need to provide their HF endpoint name. With these, we retrieve both the inference URL and model information (model_id and max_total_tokens).

I've tested both adapters for inference and agentic function calling. Please let me know if you have any suggestions for further improvements!

llama_toolchain/inference/adapters/tgi/__init__.py

ashwinb

Looks good. Thanks once again.

I do have one important comment about the get_namespace() method pulling in the huggingface_hub dep into the config file. Please move that helper method outside as a free floating function outside into the impl file.

ashwinb · 2024-09-11T21:44:09Z

llama_toolchain/inference/adapters/tgi/config.py

+
+from typing import Optional
+
+from huggingface_hub import HfApi


we want the config files to be extremely lightweight so one can use them outside of the larger containers. as such they should only have datatypes and no "behavior" or functions at all. I suggest you move at least the get_namespace() method out into the impl where it needs to be used so this dependency is not here.

* Use huggingface_hub inference client for TGI inference * Update the default value for TGI URL * Use InferenceClient.text_generation for TGI inference * Fixes post-review and split TGI adapter into local and Inference Endpoints ones * Update CLI reference and add typing * Rename TGI Adapter class * Use HfApi to get the namespace when not provide in the hf endpoint name * Remove unecessary method argument * Improve TGI adapter initialization condition * Move helper into impl file + fix merging conflicts

Test Plan: First, start a TGI container with `meta-llama/Llama-Guard-3-8B` model serving on port 5099. See #53 and its description for how. Then run llama-stack with the following run config: ``` image_name: safety docker_image: null conda_env: safety apis_to_serve: - models - inference - shields - safety api_providers: inference: providers: - remote::tgi safety: providers: - meta-reference telemetry: provider_id: meta-reference config: {} routing_table: inference: - provider_id: remote::tgi config: url: http://localhost:5099 api_token: null hf_endpoint_name: null routing_key: Llama-Guard-3-8B safety: - provider_id: meta-reference config: llama_guard_shield: model: Llama-Guard-3-8B excluded_categories: [] disable_input_check: false disable_output_check: false prompt_guard_shield: null routing_key: llama_guard ``` Now simply run `python -m llama_stack.apis.safety.client localhost <port>` and check that the llama_guard shield calls run correctly. (The injection_shield calls fail as expected since we have not set up a router for them.)

hanouticelina added 2 commits September 5, 2024 18:29

Use huggingface_hub inference client for TGI inference

e5bcfda

Update the default value for TGI URL

7aa5093

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 5, 2024

hanouticelina marked this pull request as ready for review September 5, 2024 20:27

ashwinb self-requested a review September 5, 2024 22:46

hanouticelina added 3 commits September 6, 2024 15:37

Merge branch 'meta-llama:main' into tgi-integration

3858d94

Use InferenceClient.text_generation for TGI inference

031dbc0

Merge branch 'tgi-integration' of github.com:hanouticelina/llama-stac…

5ab4fd3

…k into tgi-integration

ashwinb requested changes Sep 6, 2024

View reviewed changes

llama_toolchain/inference/adapters/tgi/tgi.py Outdated Show resolved Hide resolved

llama_toolchain/inference/adapters/tgi/tgi.py Show resolved Hide resolved

llama_toolchain/inference/adapters/tgi/tgi.py Outdated Show resolved Hide resolved

hanouticelina added 2 commits September 9, 2024 17:47

Fixes post-review and split TGI adapter into local and Inference Endp…

b96e705

…oints ones

Update CLI reference and add typing

eee6c69

hanouticelina added 3 commits September 9, 2024 18:30

Rename TGI Adapter class

3d660ad

Use HfApi to get the namespace when not provide in the hf endpoint name

fff1b6d

Remove unecessary method argument

2ac8e7b

Wauplin reviewed Sep 10, 2024

View reviewed changes

llama_toolchain/inference/adapters/tgi/__init__.py Outdated Show resolved Hide resolved

Improve TGI adapter initialization condition

0964b0a

ashwinb approved these changes Sep 11, 2024

View reviewed changes

hanouticelina added 2 commits September 12, 2024 15:31

Merge branch 'main' into tgi-integration

04f0b8f

Move helper into impl file + fix merging conflicts

c8808b4

hanouticelina requested review from yanxi0830, hardikjshah, dltn and raghotham as code owners September 12, 2024 14:01

ashwinb merged commit 736092f into meta-llama:main Sep 12, 2024
3 checks passed

Wauplin mentioned this pull request Sep 25, 2024

Make TGI adapter compatible with HF Inference API #97

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inference] Use huggingface_hub inference client for TGI adapter #53

[Inference] Use huggingface_hub inference client for TGI adapter #53

hanouticelina commented Sep 5, 2024 •

edited

Loading

ashwinb commented Sep 5, 2024

hanouticelina commented Sep 6, 2024

ashwinb left a comment

hanouticelina commented Sep 9, 2024 •

edited

Loading

ashwinb left a comment

ashwinb Sep 11, 2024

hanouticelina Sep 12, 2024


		from typing import Optional

		from huggingface_hub import HfApi

[Inference] Use huggingface_hub inference client for TGI adapter #53

[Inference] Use huggingface_hub inference client for TGI adapter #53

Conversation

hanouticelina commented Sep 5, 2024 • edited Loading

Why this PR? What does it do?

ashwinb commented Sep 5, 2024

hanouticelina commented Sep 6, 2024

ashwinb left a comment

Choose a reason for hiding this comment

hanouticelina commented Sep 9, 2024 • edited Loading

ashwinb left a comment

Choose a reason for hiding this comment

ashwinb Sep 11, 2024

Choose a reason for hiding this comment

hanouticelina Sep 12, 2024

Choose a reason for hiding this comment

hanouticelina commented Sep 5, 2024 •

edited

Loading

hanouticelina commented Sep 9, 2024 •

edited

Loading