Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ChatHuggingFace + HuggingFaceEndpoint does not properly implement max_new_tokens #23586

Closed
5 tasks done
BobMerkus opened this issue Jun 27, 2024 · 9 comments
Closed
5 tasks done
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature

Comments

@BobMerkus
Copy link
Contributor

BobMerkus commented Jun 27, 2024

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from transformers import AutoTokenizer
from langchain_huggingface import ChatHuggingFace
from langchain_huggingface import HuggingFaceEndpoint

import requests

sample = requests.get(
    "https://raw.githubusercontent.com/huggingface/blog/main/langchain.md"
).text


tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-70B-Instruct")


def n_tokens(text):
    return len(tokenizer(text)["input_ids"])


print(f"The number of tokens in the sample is {n_tokens(sample)}")

llm_10 = HuggingFaceEndpoint(
    repo_id="meta-llama/Meta-Llama-3-70B-Instruct",
    max_new_tokens=10,
    cache=False,
    seed=123,
)
llm_4096 = HuggingFaceEndpoint(
    repo_id="meta-llama/Meta-Llama-3-70B-Instruct",
    max_new_tokens=4096,
    cache=False,
    seed=123,
)

messages = [
    (
        "system",
        "You are a smart AI that has to describe a given text in to at least 1000 characters.",
    ),
    ("user", f"Summarize the following text:\n\n{sample}\n"),
]

# native endpoint
response_10_native = llm_10.invoke(messages)
print(f"Native response 10: {n_tokens(response_10_native)} tokens")
response_4096_native = llm_4096.invoke(messages)
print(f"Native response 4096: {n_tokens(response_4096_native)} tokens")

# make sure the native responses are different lengths
assert len(response_10_native) < len(
    response_4096_native
), f"Native response 10 should be shorter than native response 4096, 10 `max_new_tokens`: {n_tokens(response_10_native)}, 4096 `max_new_tokens`: {n_tokens(response_4096_native)}"

# chat implementation from langchain_huggingface
chat_model_10 = ChatHuggingFace(llm=llm_10)
chat_model_4096 = ChatHuggingFace(llm=llm_4096)

# chat implementation for 10 tokens
response_10 = chat_model_10.invoke(messages)
print(f"Response 10: {n_tokens(response_10.content)} tokens")
actual_response_tokens_10 = response_10.response_metadata.get(
    "token_usage"
).completion_tokens

print(
    f"Actual response 10: {actual_response_tokens_10} tokens (always 100 for some reason!)"
)

# chat implementation for 4096 tokens
response_4096 = chat_model_4096.invoke(messages)
print(f"Response 4096: {n_tokens(response_4096.content)} tokens")
actual_response_tokens_4096 = response_4096.response_metadata.get(
    "token_usage"
).completion_tokens

print(
    f"Actual response 4096: {actual_response_tokens_4096} tokens (always 100 for some reason!)"
)


# assert that the responses are different lengths, which fails because the token usage is always 100
print("-" * 20)
print(f"Output for 10 tokens: {response_10.content}")
print("-" * 20)
print(f"Output for 4096 tokens: {response_4096.content}")
print("-" * 20)
assert len(response_10.content) < len(
    response_4096.content
), f"Response 10 should be shorter than response 4096, 10 `max_new_tokens`: {n_tokens(response_10.content)}, 4096 `max_new_tokens`: {n_tokens(response_4096.content)}"

This is the output from the script:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The number of tokens in the sample is 1809
Native response 10: 11 tokens
Native response 4096: 445 tokens
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Response 10: 101 tokens
Actual response 10: 100 tokens (always 100 for some reason!)
Response 4096: 101 tokens
Actual response 4096: 100 tokens (always 100 for some reason!)

--------------------
Output for 10 tokens: The text announces the launch of a new partner package called `langchain_huggingface` in LangChain, jointly maintained by Hugging Face and LangChain. This package aims to bring the power of Hugging Face's latest developments into LangChain and keep it up-to-date. The package was created by the community, and by becoming a partner package, the time it takes to bring new features from Hugging Face's ecosystem to LangChain's users will be reduced.

The package integrates seamlessly with Lang
--------------------
Output for 4096 tokens: The text announces the launch of a new partner package called `langchain_huggingface` in LangChain, jointly maintained by Hugging Face and LangChain. This package aims to bring the power of Hugging Face's latest developments into LangChain and keep it up-to-date. The package was created by the community, and by becoming a partner package, the time it takes to bring new features from Hugging Face's ecosystem to LangChain's users will be reduced.

The package integrates seamlessly with Lang
--------------------

Error Message and Stack Trace (if applicable)

AssertionError: Response 10 should be shorter than response 4096, 10 max_new_tokens: 101, 4096 max_new_tokens: 101

Description

There seems to be an issues when using langchain_huggingface.llms.huggingface_endpoint.HuggingFaceEndpoint together with the langchain_huggingface.chat_models.huggingface.ChatHuggingFace implementation.

When just using the HuggingFaceEndpoint, the parameter max_new_tokens is properly implemented, while this does not work properly when wrapping inside ChatHuggingFace(llm=...). The latter implementation always returns a response of 100 tokens, and I am unable to get this to work properly after searching the docs + source code.

I have created a reproducible example using meta-llama/Meta-Llama-3-70B-Instruct (as this model is also supported for serverless).

System Info

System Information

OS: Darwin
OS Version: Darwin Kernel Version 23.5.0: Wed May 1 20:19:05 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T8112
Python Version: 3.12.3 (main, Apr 9 2024, 08:09:14) [Clang 15.0.0 (clang-1500.3.9.4)]

Package Information

langchain_core: 0.2.10
langchain: 0.2.6
langchain_community: 0.2.5
langsmith: 0.1.82
langchain_anthropic: 0.1.15
langchain_aws: 0.1.7
langchain_huggingface: 0.0.3
langchain_openai: 0.1.9
langchain_text_splitters: 0.2.2
langchainhub: 0.1.20

Packages not installed (Not Necessarily a Problem)

The following packages were not found:

langgraph
langserve

@dosubot dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Jun 27, 2024
@keenborder786
Copy link
Contributor

Okay, you are comparing two different things. The Huggingface Inference Client returns the following object, which has an attribute of usage that is of type ChatCompletionOutputUsage.

The ChatCompletionOutputUsage has three types of token usage:

  1. completion_tokens: This is the number of tokens required to complete the prompt. In your case, this is always fixed because you are calling the same prompt to complete. Try something else, and it should change.
  2. prompt_tokens: The number of tokens in the prompt.
  3. total_tokens: The sum of completion_tokens and prompt_tokens.

So, you are implicitly comparing the total_tokens through the n_tokens function with completion_tokens, which is incorrect. You should compare the total_tokens attribute to make the correct comparison.

P.S. I double-checked the LangChain code and ensured that ChatHuggingFace is returning the correct ChatCompletionOutputUsage without any modification.

@BobMerkus
Copy link
Contributor Author

Okay, you are comparing two different things. The Huggingface Inference Client returns the following object, which has an attribute of usage that is of type ChatCompletionOutputUsage.

The ChatCompletionOutputUsage has three types of token usage:

  1. completion_tokens: This is the number of tokens required to complete the prompt. In your case, this is always fixed because you are calling the same prompt to complete. Try something else, and it should change.
  2. prompt_tokens: The number of tokens in the prompt.
  3. total_tokens: The sum of completion_tokens and prompt_tokens.

So, you are implicitly comparing the total_tokens through the n_tokens function with completion_tokens, which is incorrect. You should compare the total_tokens attribute to make the correct comparison.

P.S. I double-checked the LangChain code and ensured that ChatHuggingFace is returning the correct ChatCompletionOutputUsage without any modification.

I think you are misunderstanding the example code, the n_tokens() function is called on the output of the chain contents, thus the completion_tokens == n_tokens(output) - 1. The -1 one is the special end of sequence token (this is why the output says 101 tokens, not 100). The problem is that the ChatCompletionOutputUsage.output_tokens should always be less or equal to max_new_tokens, but this is 100 tokens regardless of the supplied max_new_tokens.

@TiagoPinaC
Copy link

I'm having the same problem... Did you find a solution?

@BobMerkus
Copy link
Contributor Author

I'm having the same problem... Did you find a solution?

No I have not, this issue renders the entire Huggingface x Langchain implementation obsolete to me. I have been attempting to work around the issue using an OpenAI compatible web server through either LlamaCpp/Ollama.

@michael-newsrx
Copy link

I think this is a ChatHuggingFace bug.

You have to call bind on the chat object with the parameters you want changed before running invoke. They do not carry over from the HuggingFaceEndpoint object to the ChatHuggingFace object.

output_msg = chat_model.bind(max_tokens=8192, temperature=0.0).invoke(chat_sequence)

@b5y
Copy link

b5y commented Oct 26, 2024

I'm also having the same problem, and the solution from @michael-newsrx doesn't work for me since the llm is invoked in create_history_aware_retriever and create_retrieval_chain to keep chat history and do RAG.

It works fine using HuggingFaceEndpoint as llm only, but if I pass the HuggingFaceEndpoint object to ChatHuggingFace as an llm argument, the response cuts off the output message. And I need ChatHuggingFace to keep track of chat conversations. Increasing max_new_tokens or using max_tokens as model_args parameter doesn't help.

Everything works great with ChatOllama, but I am planning to use ChatHuggingFace in production.

Do we have an estimated timeline for when this bug will be fixed?

CC: @baskaryan , @hwchase17 , @efriis , @eyurtsev @ccurme , @nfcampos

@SMAntony
Copy link

I think this is a ChatHuggingFace bug.

You have to call bind on the chat object with the parameters you want changed before running invoke. They do not carry over from the HuggingFaceEndpoint object to the ChatHuggingFace object.

output_msg = chat_model.bind(max_tokens=8192, temperature=0.0).invoke(chat_sequence)

Appreciate the workaround!

@BobMerkus
Copy link
Contributor Author

I've created a bug fix proposal inside this PR that solves this issue: propagate HuggingFaceEndpoint config to ChatHuggingFace #27719

Copy link

dosubot bot commented Jan 29, 2025

Hi, @BobMerkus. I'm Dosu, and I'm helping the LangChain team manage their backlog. I'm marking this issue as stale.

Issue Summary:

  • You reported a bug with the max_new_tokens parameter not functioning correctly in the ChatHuggingFace and HuggingFaceEndpoint classes.
  • User keenborder786 suggested a possible misunderstanding of token usage types, but you clarified the issue persists.
  • Michael-newsrx proposed a workaround using the bind method, but b5y noted limitations.
  • You submitted a pull request to ensure proper configuration propagation, indicating a resolution is in progress.

Next Steps:

  • Please confirm if this issue is still relevant to the latest version of the LangChain repository. If so, you can keep the discussion open by commenting here.
  • Otherwise, this issue will be automatically closed in 7 days.

Thank you for your understanding and contribution!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jan 29, 2025
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 5, 2025
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature
Projects
None yet
6 participants