Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: About multiple users chat slow #8972

Closed
1 task done
dinhan92 opened this issue Nov 17, 2023 · 7 comments
Closed
1 task done

[Question]: About multiple users chat slow #8972

dinhan92 opened this issue Nov 17, 2023 · 7 comments
Labels
question Further information is requested

Comments

@dinhan92
Copy link

Question Validation

  • I have searched both the documentation and discord for an answer.

Question

I have try using llama index not stream and with flask and publish on windows. I tested 2 computers using postman. However I find out that when I only test 1 computer, it response in 1 minute. However if with 2 computers, it response in 2 minutes.

So, do I have to change the system to Linux to fix this, or is there anything that I missed in llama index docs?

@dinhan92 dinhan92 added the question Further information is requested label Nov 17, 2023
Copy link

dosubot bot commented Nov 17, 2023

🤖

Hello @dinhan92,

Great to see you again and thanks for reaching out. I'm currently looking into your issue and will get back to you with a thorough response shortly.

Copy link

dosubot bot commented Nov 17, 2023

🤖

Hello @dinhan92,

Good to see you again! I hope you've been doing well since our last interaction. Let's dive right into your issue.

Based on the information you've provided, it seems like the response time of your LlamaIndex application increases when you test it on multiple computers. This could be due to a variety of factors, including network latency, the performance of the computers you're testing on, or the configuration of your LlamaIndex application.

Switching your system to Linux may not necessarily solve this issue, as the performance of LlamaIndex is not solely dependent on the operating system. It's also influenced by other factors such as the Python version, the versions of the required libraries, and the server configuration.

In a similar issue titled Reaction is very slow, is there any params to reduce reaction time?, the suggested solution was to use a newer version of the model, gpt-3.5, which is reported to be faster in response. You might want to consider this if you're not already using this version.

Also, ensure that your system meets the system requirements for running LlamaIndex optimally. This includes using the correct Python version (>=3.8.1 and <3.12) and the correct versions of the required libraries.

If you're still experiencing slow response times after checking these factors, it would be helpful if you could provide more information about your setup. For example, the specific versions of Python and the libraries you're using, the specifications of the computers you're testing on, and any error messages you're seeing. This will help in diagnosing the issue.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@dinhan92
Copy link
Author

dinhan92 commented Nov 17, 2023

I use window server 2019, python 3.11.5, I have 2 ways approaches, one use flask with wfastcgi, second is using Quart with nssm python service and use a api C# calling on the service. Both response pretty slow with 2 computers
Here is my python code:

@app.route('/api/chatbot', methods=['GET', 'POST'])
async def chatbot(**kwargs) -> str:
    try:
        data = await request.data
        input_text = data.decode("utf-8")

        report_indices = {}
        query_engine = {}
        query_engine_tools = []
        # rebuild storage context
        for name_pdf in name_pdfs:
            name_pdf_get = name_pdf.Name
            storage_context = StorageContext.from_defaults(persist_dir=f"./storage_{name_pdf_get}")
            report_indices[name_pdf_get] = load_index_from_storage(storage_context = storage_context,
                                        # embed_model=embed_model, 
                                        service_context = service_context)
            query_engine[name_pdf_get] = report_indices[name_pdf_get].as_query_engine()
            query_engine_tools.append(QueryEngineTool(query_engine=query_engine[name_pdf_get], metadata = ToolMetadata(name=name_pdf_get, description=name_pdf.Description)))

        agent = OpenAIAgent.from_tools(
            tools=query_engine_tools,
            llm=llm,
            chat_history=[
                ChatMessage(role=MessageRole.USER, content="Chào BSNH AI, chúng ta sẽ bàn về nông nghiệp ngày hôm nay."),
                ChatMessage(role=MessageRole.ASSISTANT, content="Được bạn ơi, nghe có vẻ hấp dẫn."),
                ChatMessage(role=MessageRole.USER, content="Bạn có khỏe không?"),
                ChatMessage(role=MessageRole.ASSISTANT, content="Tôi rất khỏe, cám ơn."),
                ChatMessage(role=MessageRole.USER, content="Sức khỏe của bạn như thế nào?"),
                ChatMessage(role=MessageRole.ASSISTANT, content="Tôi cực kỳ khỏe luôn ấy, sức có thể vỡ núi."),
                ChatMessage(role=MessageRole.USER, content="Cây khoai lang là cây gì?"),
                ChatMessage(role=MessageRole.ASSISTANT, content="Cây khoai lang là một loại cây lương thực."),
            ],
            prefix_messages=[
                ChatMessage(role=MessageRole.SYSTEM, content="Bạn là một trợ lý ảo của TNH99, có tên là BSNH AI, được phát triển bởi TNH99, chủ đề chính của bạn là nông nghiệp. Nếu bạn không trả lời được hãy bảo người dùng liên hệ với TNH99. Vui lòng trả lời bằng tiếng Việt."),
                ChatMessage(role=MessageRole.USER, content="Chào BSNH AI, chúng ta sẽ bàn về nông nghiệp ngày hôm nay."),
                ChatMessage(role=MessageRole.ASSISTANT, content="Được bạn ơi, nghe có vẻ hấp dẫn."),
                ChatMessage(role=MessageRole.USER, content="Bạn có khỏe không?"),
                ChatMessage(role=MessageRole.ASSISTANT, content="Tôi rất khỏe, cám ơn."),
                ChatMessage(role=MessageRole.USER, content="Sức khỏe của bạn như thế nào?"),
                ChatMessage(role=MessageRole.ASSISTANT, content="Tôi cực kỳ khỏe luôn ấy, sức có thể vỡ núi."),
                ChatMessage(role=MessageRole.USER, content="Cây khoai lang là cây gì?"),
                ChatMessage(role=MessageRole.ASSISTANT, content="Cây khoai lang là một loại cây lương thực."),
            ],
            verbose=True
        )

        agent_stream = await agent.achat(input_text)

        savedAnswer = dumps(ChatHistory(Question=input_text, Answer=agent_stream.response, IsDone=True, IdUser="").__dict__)

        return app.response_class(savedAnswer)
    except Exception as e:
        print(e)
        exception = "Xin lỗi, BSNH AI đang bận."
        return app.response_class(exception)

@logan-markewich
Copy link
Collaborator

@dinhan92 a lot of this depends on how you designed your app.

I see you are loading the index on every request. This is not ideal, you should look into using a global query engine maybe?

@dinhan92
Copy link
Author

@logan-markewich what do you mean by using a global query engine?

@gich2009
Copy link

Hi @dinhan92, each time your endpoint is hit, there is an overhead when the indices are loaded. A better way would be to load the indices outside of the controller(globally instead of within the endpoint function) and then construct the query engines and the query engine tools array. You can go further and even construct the OpenAI agent outside the controller. Try to minimise the amount of work being done within your controller. A good rule of thumb is, 'if it is something that will not change with every api call, then do it outside of your controller.'

Of course there is a memory cost to this but that's the tradeoff and, if you ask me, it's well worth it since the memory will be used either way when the endpoint is hit.

@logan-markewich
Copy link
Collaborator

@gich2009 is correct !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants