Unable to query `Query Docs` mode when using sagemaker #1367

LvffY · 2023-12-05T16:36:33Z

Context

Hi everyone,

I have send a message through Discord but thought it would be easier to manage the issue here.

What I'm trying to achieve is to run privateGPT with some production-grade environment. To do so, I've tried to run something like :

Create a Qdrant database in Qdrant cloud
Run LLM model and embedding model through Sagemaker

For I have the following setups :

LLM	Embedding	Qdrant	State
Local	Local	Local	Success
Local	Local	Qdrant cloud	Success
Sagemaker	Local	Local	Failed
Sagemaker	Sagemaker	Qdrant cloud	Failed

As we can see, whenever I'm trying to use sagemaker it seems to fail to use the RAG architecture.

How to reproduce

I've tried to have the simplest setup to reproduce, if you want me to test anything else, do not hesitate to ask me.

Create a new profile sagemaker with settings-sagemaker.yaml :

server:
  env_name: ${APP_ENV:prod}
  port: ${PORT:8001}

ui:
  enabled: true
  path: /

llm:
  mode: sagemaker

embedding:
  # Should be matching the value above in most cases
  mode: local
  ingest_mode: simple

sagemaker:
  llm_endpoint_name: TheBloke-Mistral-7B-Instruct-v0-1-GPTQ # This should be first deployed in your sagemaker instance
  embedding_endpoint_name: BAAI-bge-large-en-v1-5 # This should be first deployed in your sagemaker instance

Ingest a file through the UI
Run PGPT_PROFILES=sagemaker make run
Ask a question related to your file. (Weirdly, if you ask a question completly unrelated it may work ...)

Actual behavior

The UI is sending a weird output
We see some weird warning related to llama_index

Expected behavior

To be able to use the Query Docs mode even when using sagemaker.

The text was updated successfully, but these errors were encountered:

logan-markewich · 2023-12-08T15:27:51Z

Oof, thats a rough one to debug

The error is somewhere in this block in LlamaIndex

I don't really know where token is coming from... nothing uses/mentions a variable by that name 🤔

is_function() and put_in_queue() and memory.put() are also all very simple 1-2 line functions (and also no mention token)

I'm not 100% sure how privateGPT implements sagemaker LLMs, but it might be related to that? Something to do with how the LLM is streaming is my guess

logan-markewich · 2023-12-08T15:28:56Z

Probably this line of code here

https://github.com/imartinez/privateGPT/blob/9302620eaca56d00818cb4db87ea1e8a8aa170f9/private_gpt/components/llm/custom/sagemaker.py#L256

LvffY · 2023-12-11T12:20:44Z

Thanks @logan-markewich I'll try to dig further on this way.

As I'm not an expert, this could take a while, so if anyone has a solution or want to try, be my guest :)

Also, I've tried with the latest 0.2.0 release and I see the same behavior.

LvffY · 2023-12-11T16:44:59Z

Probably this line of code here

https://github.com/imartinez/privateGPT/blob/9302620eaca56d00818cb4db87ea1e8a8aa170f9/private_gpt/components/llm/custom/sagemaker.py#L256

Based on this message, I've dig a little bit more and it seems that the prompt message is not passed to the endpoint.

And then because the prompt is empty, it seems that the endpoint sends an uncatch exception in the message (but the endpoint clearly send back a 200 api call, so this is not really an exception here.)

It seems that's this line that returns an empty prompt.

https://github.com/imartinez/privateGPT/blob/e8ac51bba4b698c8a66dfd02bda5020f4a08f0cd/private_gpt/components/llm/custom/sagemaker.py#L273

I've tried to add some prints and I got this output :

I'm not used to the Field object so don't know where to start. If you have any idea @logan-markewich (or @pabloogc because it's you that started this code).

I'll try to run more tests, don't hesitate if you have questions

LvffY · 2023-12-15T09:08:48Z

I think I have a solution to my problem. Thanks to the help of @logan-markewich (in this discord thread) I've dig quite a bit through the code and it appears to be an issue with the memory of the chat.

The "cause" was the retrieval of quite a long context that would end up in the LLM just receive the context and not my question.

I fixed it with two different solutions (and keep only the last one) :

You can change the return _chat_engine method from the chat_service module, by adding ChatMemoryBuffer.from_defaults(token_limit=3900) to increase the token limit. (Of course, 3900 is arbitrary)
Digging a bit more, and looking into the very specific ChatMemoryBuffer.from_defaults method, it appeared to me that this limit could also be manipulated through the context_window parameter and this context_window parameter was already configured for the LLamaCPP LLM. Hence it seems that extracting this parameter from the settings could be a good idea. Based on that :
- Update the LLMSettings class to add a new parameter.
- Update the LLMComponent class to add a new parameter.

class LLMSettings(BaseModel):
    mode: Literal["local", "openai", "sagemaker", "mock"]
    max_new_tokens: int = Field(
        256,
        description="The maximum number of token that the LLM is authorized to generate in one completion.",
    )
    ## This is new
    context_window_size: int = Field(
        3900,
        description=(
            "Size of the context window.\n"
            "llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room.\n"
            "You may need to increase this context window, see https://github.com/imartinez/privateGPT/issues/1367."
        ),
    )

            case "local":
                from llama_index.llms import LlamaCPP

                prompt_style = get_prompt_style(settings.local.prompt_style)

                self.llm = LlamaCPP(
                    model_path=str(models_path / settings.local.llm_hf_model_file),
                    temperature=0.1,
                    max_new_tokens=settings.llm.max_new_tokens,
                    context_window=settings.llm.context_window_size, # This line is changed
                    generate_kwargs={},
                    # All to GPU
                    model_kwargs={"n_gpu_layers": -1},
                    # transform inputs into Llama2 format
                    messages_to_prompt=prompt_style.messages_to_prompt,
                    completion_to_prompt=prompt_style.completion_to_prompt,
                    verbose=True,
                )

            case "sagemaker":
                from private_gpt.components.llm.custom.sagemaker import SagemakerLLM

                self.llm = SagemakerLLM(
                    endpoint_name=settings.sagemaker.llm_endpoint_name,
                    max_new_tokens=settings.llm.max_new_tokens,
                    context_window=settings.llm.context_window_size, # This line is added
                )

If you think it's useful, I could make some PR

imartinez · 2023-12-20T14:01:33Z

I'm working on a PR that makes context_window and max_new_tokens customizable in settings.yaml

Using the right context_window is the way to go imo.

Thanks for the support and the documentation!

LvffY · 2023-12-26T09:55:00Z

Closed by #1437

LvffY closed this as completed Dec 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to query `Query Docs` mode when using sagemaker #1367

Unable to query `Query Docs` mode when using sagemaker #1367

LvffY commented Dec 5, 2023

logan-markewich commented Dec 8, 2023

logan-markewich commented Dec 8, 2023

LvffY commented Dec 11, 2023

LvffY commented Dec 11, 2023

LvffY commented Dec 15, 2023

imartinez commented Dec 20, 2023

LvffY commented Dec 26, 2023

Unable to query Query Docs mode when using sagemaker #1367

Unable to query Query Docs mode when using sagemaker #1367

Comments

LvffY commented Dec 5, 2023

Context

How to reproduce

Actual behavior

Expected behavior

logan-markewich commented Dec 8, 2023

logan-markewich commented Dec 8, 2023

LvffY commented Dec 11, 2023

LvffY commented Dec 11, 2023

LvffY commented Dec 15, 2023

imartinez commented Dec 20, 2023

LvffY commented Dec 26, 2023

Unable to query `Query Docs` mode when using sagemaker #1367

Unable to query `Query Docs` mode when using sagemaker #1367