Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to query Query Docs mode when using sagemaker #1367

Closed
LvffY opened this issue Dec 5, 2023 · 7 comments
Closed

Unable to query Query Docs mode when using sagemaker #1367

LvffY opened this issue Dec 5, 2023 · 7 comments

Comments

@LvffY
Copy link

LvffY commented Dec 5, 2023

Context

Hi everyone,

I have send a message through Discord but thought it would be easier to manage the issue here.

What I'm trying to achieve is to run privateGPT with some production-grade environment. To do so, I've tried to run something like :

  • Create a Qdrant database in Qdrant cloud
  • Run LLM model and embedding model through Sagemaker

For I have the following setups :

LLM Embedding Qdrant State
Local Local Local Success
Local Local Qdrant cloud Success
Sagemaker Local Local Failed
Sagemaker Sagemaker Qdrant cloud Failed

As we can see, whenever I'm trying to use sagemaker it seems to fail to use the RAG architecture.

How to reproduce

I've tried to have the simplest setup to reproduce, if you want me to test anything else, do not hesitate to ask me.

  • Create a new profile sagemaker with settings-sagemaker.yaml :
server:
  env_name: ${APP_ENV:prod}
  port: ${PORT:8001}

ui:
  enabled: true
  path: /

llm:
  mode: sagemaker

embedding:
  # Should be matching the value above in most cases
  mode: local
  ingest_mode: simple

sagemaker:
  llm_endpoint_name: TheBloke-Mistral-7B-Instruct-v0-1-GPTQ # This should be first deployed in your sagemaker instance
  embedding_endpoint_name: BAAI-bge-large-en-v1-5 # This should be first deployed in your sagemaker instance
  • Ingest a file through the UI
  • Run PGPT_PROFILES=sagemaker make run
  • Ask a question related to your file. (Weirdly, if you ask a question completly unrelated it may work ...)

Actual behavior

  • The UI is sending a weird output
    image
  • We see some weird warning related to llama_index
    image

Expected behavior

To be able to use the Query Docs mode even when using sagemaker.

@logan-markewich
Copy link

Oof, thats a rough one to debug

The error is somewhere in this block in LlamaIndex

image

I don't really know where token is coming from... nothing uses/mentions a variable by that name 🤔

is_function() and put_in_queue() and memory.put() are also all very simple 1-2 line functions (and also no mention token)

I'm not 100% sure how privateGPT implements sagemaker LLMs, but it might be related to that? Something to do with how the LLM is streaming is my guess

@logan-markewich
Copy link

Probably this line of code here

https://github.com/imartinez/privateGPT/blob/9302620eaca56d00818cb4db87ea1e8a8aa170f9/private_gpt/components/llm/custom/sagemaker.py#L256

@LvffY
Copy link
Author

LvffY commented Dec 11, 2023

Thanks @logan-markewich I'll try to dig further on this way.

As I'm not an expert, this could take a while, so if anyone has a solution or want to try, be my guest :)

Also, I've tried with the latest 0.2.0 release and I see the same behavior.

@LvffY
Copy link
Author

LvffY commented Dec 11, 2023

Probably this line of code here

https://github.com/imartinez/privateGPT/blob/9302620eaca56d00818cb4db87ea1e8a8aa170f9/private_gpt/components/llm/custom/sagemaker.py#L256

Based on this message, I've dig a little bit more and it seems that the prompt message is not passed to the endpoint.

And then because the prompt is empty, it seems that the endpoint sends an uncatch exception in the message (but the endpoint clearly send back a 200 api call, so this is not really an exception here.)

It seems that's this line that returns an empty prompt.

https://github.com/imartinez/privateGPT/blob/e8ac51bba4b698c8a66dfd02bda5020f4a08f0cd/private_gpt/components/llm/custom/sagemaker.py#L273

I've tried to add some prints and I got this output :

image

I'm not used to the Field object so don't know where to start. If you have any idea @logan-markewich (or @pabloogc because it's you that started this code).

I'll try to run more tests, don't hesitate if you have questions

@LvffY
Copy link
Author

LvffY commented Dec 15, 2023

I think I have a solution to my problem. Thanks to the help of @logan-markewich (in this discord thread) I've dig quite a bit through the code and it appears to be an issue with the memory of the chat.

The "cause" was the retrieval of quite a long context that would end up in the LLM just receive the context and not my question.

I fixed it with two different solutions (and keep only the last one) :

  • You can change the return _chat_engine method from the chat_service module, by adding ChatMemoryBuffer.from_defaults(token_limit=3900) to increase the token limit. (Of course, 3900 is arbitrary)

  • Digging a bit more, and looking into the very specific ChatMemoryBuffer.from_defaults method, it appeared to me that this limit could also be manipulated through the context_window parameter and this context_window parameter was already configured for the LLamaCPP LLM. Hence it seems that extracting this parameter from the settings could be a good idea. Based on that :

class LLMSettings(BaseModel):
    mode: Literal["local", "openai", "sagemaker", "mock"]
    max_new_tokens: int = Field(
        256,
        description="The maximum number of token that the LLM is authorized to generate in one completion.",
    )
    ## This is new
    context_window_size: int = Field(
        3900,
        description=(
            "Size of the context window.\n"
            "llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room.\n"
            "You may need to increase this context window, see https://github.com/imartinez/privateGPT/issues/1367."
        ),
    )
            case "local":
                from llama_index.llms import LlamaCPP

                prompt_style = get_prompt_style(settings.local.prompt_style)

                self.llm = LlamaCPP(
                    model_path=str(models_path / settings.local.llm_hf_model_file),
                    temperature=0.1,
                    max_new_tokens=settings.llm.max_new_tokens,
                    context_window=settings.llm.context_window_size, # This line is changed
                    generate_kwargs={},
                    # All to GPU
                    model_kwargs={"n_gpu_layers": -1},
                    # transform inputs into Llama2 format
                    messages_to_prompt=prompt_style.messages_to_prompt,
                    completion_to_prompt=prompt_style.completion_to_prompt,
                    verbose=True,
                )

            case "sagemaker":
                from private_gpt.components.llm.custom.sagemaker import SagemakerLLM

                self.llm = SagemakerLLM(
                    endpoint_name=settings.sagemaker.llm_endpoint_name,
                    max_new_tokens=settings.llm.max_new_tokens,
                    context_window=settings.llm.context_window_size, # This line is added
                )

If you think it's useful, I could make some PR

@imartinez
Copy link
Collaborator

I'm working on a PR that makes context_window and max_new_tokens customizable in settings.yaml

Using the right context_window is the way to go imo.

Thanks for the support and the documentation!

@LvffY
Copy link
Author

LvffY commented Dec 26, 2023

Closed by #1437

@LvffY LvffY closed this as completed Dec 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants