ChatHuggingFace + HuggingFaceEndpoint does not properly implement `max_new_tokens` #23586

BobMerkus · 2024-06-27T14:17:33Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from transformers import AutoTokenizer
from langchain_huggingface import ChatHuggingFace
from langchain_huggingface import HuggingFaceEndpoint

import requests

sample = requests.get(
    "https://raw.githubusercontent.com/huggingface/blog/main/langchain.md"
).text


tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-70B-Instruct")


def n_tokens(text):
    return len(tokenizer(text)["input_ids"])


print(f"The number of tokens in the sample is {n_tokens(sample)}")

llm_10 = HuggingFaceEndpoint(
    repo_id="meta-llama/Meta-Llama-3-70B-Instruct",
    max_new_tokens=10,
    cache=False,
    seed=123,
)
llm_4096 = HuggingFaceEndpoint(
    repo_id="meta-llama/Meta-Llama-3-70B-Instruct",
    max_new_tokens=4096,
    cache=False,
    seed=123,
)

messages = [
    (
        "system",
        "You are a smart AI that has to describe a given text in to at least 1000 characters.",
    ),
    ("user", f"Summarize the following text:\n\n{sample}\n"),
]

# native endpoint
response_10_native = llm_10.invoke(messages)
print(f"Native response 10: {n_tokens(response_10_native)} tokens")
response_4096_native = llm_4096.invoke(messages)
print(f"Native response 4096: {n_tokens(response_4096_native)} tokens")

# make sure the native responses are different lengths
assert len(response_10_native) < len(
    response_4096_native
), f"Native response 10 should be shorter than native response 4096, 10 `max_new_tokens`: {n_tokens(response_10_native)}, 4096 `max_new_tokens`: {n_tokens(response_4096_native)}"

# chat implementation from langchain_huggingface
chat_model_10 = ChatHuggingFace(llm=llm_10)
chat_model_4096 = ChatHuggingFace(llm=llm_4096)

# chat implementation for 10 tokens
response_10 = chat_model_10.invoke(messages)
print(f"Response 10: {n_tokens(response_10.content)} tokens")
actual_response_tokens_10 = response_10.response_metadata.get(
    "token_usage"
).completion_tokens

print(
    f"Actual response 10: {actual_response_tokens_10} tokens (always 100 for some reason!)"
)

# chat implementation for 4096 tokens
response_4096 = chat_model_4096.invoke(messages)
print(f"Response 4096: {n_tokens(response_4096.content)} tokens")
actual_response_tokens_4096 = response_4096.response_metadata.get(
    "token_usage"
).completion_tokens

print(
    f"Actual response 4096: {actual_response_tokens_4096} tokens (always 100 for some reason!)"
)


# assert that the responses are different lengths, which fails because the token usage is always 100
print("-" * 20)
print(f"Output for 10 tokens: {response_10.content}")
print("-" * 20)
print(f"Output for 4096 tokens: {response_4096.content}")
print("-" * 20)
assert len(response_10.content) < len(
    response_4096.content
), f"Response 10 should be shorter than response 4096, 10 `max_new_tokens`: {n_tokens(response_10.content)}, 4096 `max_new_tokens`: {n_tokens(response_4096.content)}"

This is the output from the script:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The number of tokens in the sample is 1809
Native response 10: 11 tokens
Native response 4096: 445 tokens
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Response 10: 101 tokens
Actual response 10: 100 tokens (always 100 for some reason!)
Response 4096: 101 tokens
Actual response 4096: 100 tokens (always 100 for some reason!)

--------------------
Output for 10 tokens: The text announces the launch of a new partner package called `langchain_huggingface` in LangChain, jointly maintained by Hugging Face and LangChain. This package aims to bring the power of Hugging Face's latest developments into LangChain and keep it up-to-date. The package was created by the community, and by becoming a partner package, the time it takes to bring new features from Hugging Face's ecosystem to LangChain's users will be reduced.

The package integrates seamlessly with Lang
--------------------
Output for 4096 tokens: The text announces the launch of a new partner package called `langchain_huggingface` in LangChain, jointly maintained by Hugging Face and LangChain. This package aims to bring the power of Hugging Face's latest developments into LangChain and keep it up-to-date. The package was created by the community, and by becoming a partner package, the time it takes to bring new features from Hugging Face's ecosystem to LangChain's users will be reduced.

The package integrates seamlessly with Lang
--------------------

Error Message and Stack Trace (if applicable)

AssertionError: Response 10 should be shorter than response 4096, 10 max_new_tokens: 101, 4096 max_new_tokens: 101

Description

There seems to be an issues when using langchain_huggingface.llms.huggingface_endpoint.HuggingFaceEndpoint together with the langchain_huggingface.chat_models.huggingface.ChatHuggingFace implementation.

When just using the HuggingFaceEndpoint, the parameter max_new_tokens is properly implemented, while this does not work properly when wrapping inside ChatHuggingFace(llm=...). The latter implementation always returns a response of 100 tokens, and I am unable to get this to work properly after searching the docs + source code.

I have created a reproducible example using meta-llama/Meta-Llama-3-70B-Instruct (as this model is also supported for serverless).

System Info

System Information

OS: Darwin
OS Version: Darwin Kernel Version 23.5.0: Wed May 1 20:19:05 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T8112
Python Version: 3.12.3 (main, Apr 9 2024, 08:09:14) [Clang 15.0.0 (clang-1500.3.9.4)]

Package Information

langchain_core: 0.2.10
langchain: 0.2.6
langchain_community: 0.2.5
langsmith: 0.1.82
langchain_anthropic: 0.1.15
langchain_aws: 0.1.7
langchain_huggingface: 0.0.3
langchain_openai: 0.1.9
langchain_text_splitters: 0.2.2
langchainhub: 0.1.20

Packages not installed (Not Necessarily a Problem)

The following packages were not found:

langgraph
langserve

The text was updated successfully, but these errors were encountered:

keenborder786 · 2024-06-27T16:44:26Z

Okay, you are comparing two different things. The Huggingface Inference Client returns the following object, which has an attribute of usage that is of type ChatCompletionOutputUsage.

The ChatCompletionOutputUsage has three types of token usage:

completion_tokens: This is the number of tokens required to complete the prompt. In your case, this is always fixed because you are calling the same prompt to complete. Try something else, and it should change.
prompt_tokens: The number of tokens in the prompt.
total_tokens: The sum of completion_tokens and prompt_tokens.

So, you are implicitly comparing the total_tokens through the n_tokens function with completion_tokens, which is incorrect. You should compare the total_tokens attribute to make the correct comparison.

P.S. I double-checked the LangChain code and ensured that ChatHuggingFace is returning the correct ChatCompletionOutputUsage without any modification.

BobMerkus · 2024-06-27T17:49:36Z

Okay, you are comparing two different things. The Huggingface Inference Client returns the following object, which has an attribute of usage that is of type ChatCompletionOutputUsage.

The ChatCompletionOutputUsage has three types of token usage:

completion_tokens: This is the number of tokens required to complete the prompt. In your case, this is always fixed because you are calling the same prompt to complete. Try something else, and it should change.

prompt_tokens: The number of tokens in the prompt.

total_tokens: The sum of completion_tokens and prompt_tokens.

So, you are implicitly comparing the total_tokens through the n_tokens function with completion_tokens, which is incorrect. You should compare the total_tokens attribute to make the correct comparison.

P.S. I double-checked the LangChain code and ensured that ChatHuggingFace is returning the correct ChatCompletionOutputUsage without any modification.

I think you are misunderstanding the example code, the n_tokens() function is called on the output of the chain contents, thus the completion_tokens == n_tokens(output) - 1. The -1 one is the special end of sequence token (this is why the output says 101 tokens, not 100). The problem is that the ChatCompletionOutputUsage.output_tokens should always be less or equal to max_new_tokens, but this is 100 tokens regardless of the supplied max_new_tokens.

TiagoPinaC · 2024-07-08T13:54:11Z

I'm having the same problem... Did you find a solution?

BobMerkus · 2024-07-13T12:15:28Z

I'm having the same problem... Did you find a solution?

No I have not, this issue renders the entire Huggingface x Langchain implementation obsolete to me. I have been attempting to work around the issue using an OpenAI compatible web server through either LlamaCpp/Ollama.

michael-newsrx · 2024-08-22T15:21:53Z

I think this is a ChatHuggingFace bug.

You have to call bind on the chat object with the parameters you want changed before running invoke. They do not carry over from the HuggingFaceEndpoint object to the ChatHuggingFace object.

output_msg = chat_model.bind(max_tokens=8192, temperature=0.0).invoke(chat_sequence)

b5y · 2024-10-26T03:09:57Z

I'm also having the same problem, and the solution from @michael-newsrx doesn't work for me since the llm is invoked in create_history_aware_retriever and create_retrieval_chain to keep chat history and do RAG.

It works fine using HuggingFaceEndpoint as llm only, but if I pass the HuggingFaceEndpoint object to ChatHuggingFace as an llm argument, the response cuts off the output message. And I need ChatHuggingFace to keep track of chat conversations. Increasing max_new_tokens or using max_tokens as model_args parameter doesn't help.

Everything works great with ChatOllama, but I am planning to use ChatHuggingFace in production.

Do we have an estimated timeline for when this bug will be fixed?

CC: @baskaryan , @hwchase17 , @efriis , @eyurtsev @ccurme , @nfcampos

SMAntony · 2024-10-29T11:12:04Z

I think this is a ChatHuggingFace bug.

You have to call bind on the chat object with the parameters you want changed before running invoke. They do not carry over from the HuggingFaceEndpoint object to the ChatHuggingFace object.
output_msg = chat_model.bind(max_tokens=8192, temperature=0.0).invoke(chat_sequence)

Appreciate the workaround!

BobMerkus · 2024-10-30T09:29:21Z

I've created a bug fix proposal inside this PR that solves this issue: propagate HuggingFaceEndpoint config to ChatHuggingFace #27719

dosubot · 2025-01-29T16:01:08Z

Hi, @BobMerkus. I'm Dosu, and I'm helping the LangChain team manage their backlog. I'm marking this issue as stale.

Issue Summary:

You reported a bug with the max_new_tokens parameter not functioning correctly in the ChatHuggingFace and HuggingFaceEndpoint classes.
User keenborder786 suggested a possible misunderstanding of token usage types, but you clarified the issue persists.
Michael-newsrx proposed a workaround using the bind method, but b5y noted limitations.
You submitted a pull request to ensure proper configuration propagation, indicating a resolution is in progress.

Next Steps:

Please confirm if this issue is still relevant to the latest version of the LangChain repository. If so, you can keep the discussion open by commenting here.
Otherwise, this issue will be automatically closed in 7 days.

Thank you for your understanding and contribution!

dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Jun 27, 2024

michael-newsrx mentioned this issue Aug 30, 2024

Passing a HF endpoint URL to client.chat_completion() doesn't seem to work anymore huggingface/huggingface_hub#2484

Closed

BobMerkus mentioned this issue Oct 29, 2024

package:langchain-huggingface propagate HuggingFaceEndpoint config to ChatHuggingFace #27719

Closed

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jan 29, 2025

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 5, 2025

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ChatHuggingFace + HuggingFaceEndpoint does not properly implement `max_new_tokens` #23586

ChatHuggingFace + HuggingFaceEndpoint does not properly implement `max_new_tokens` #23586

BobMerkus commented Jun 27, 2024 •

edited

Loading

keenborder786 commented Jun 27, 2024

BobMerkus commented Jun 27, 2024

TiagoPinaC commented Jul 8, 2024

BobMerkus commented Jul 13, 2024

michael-newsrx commented Aug 22, 2024

b5y commented Oct 26, 2024 •

edited

Loading

SMAntony commented Oct 29, 2024

BobMerkus commented Oct 30, 2024

dosubot bot commented Jan 29, 2025

ChatHuggingFace + HuggingFaceEndpoint does not properly implement max_new_tokens #23586

ChatHuggingFace + HuggingFaceEndpoint does not properly implement max_new_tokens #23586

Comments

BobMerkus commented Jun 27, 2024 • edited Loading

Checked other resources

Example Code

Error Message and Stack Trace (if applicable)

Description

System Info

System Information

Package Information

Packages not installed (Not Necessarily a Problem)

keenborder786 commented Jun 27, 2024

BobMerkus commented Jun 27, 2024

TiagoPinaC commented Jul 8, 2024

BobMerkus commented Jul 13, 2024

michael-newsrx commented Aug 22, 2024

b5y commented Oct 26, 2024 • edited Loading

SMAntony commented Oct 29, 2024

BobMerkus commented Oct 30, 2024

dosubot bot commented Jan 29, 2025

ChatHuggingFace + HuggingFaceEndpoint does not properly implement `max_new_tokens` #23586

ChatHuggingFace + HuggingFaceEndpoint does not properly implement `max_new_tokens` #23586

BobMerkus commented Jun 27, 2024 •

edited

Loading

b5y commented Oct 26, 2024 •

edited

Loading