Performance and General Architcture Question - Llamaindex - Workflows or Not #16971

hristogg · 2024-11-15T13:37:21Z

hristogg
Nov 15, 2024

Hi,

I am working on implementing an internal virtual assistant and have a few questions which I am hoping you guys might be able to support me with.

My use case is to be able to anwer questions on internal documnets and integrate qna with the ability to do followup questions.
For the case I am doing my implementation on custom CondensePlusContext chat engine where I change a bit the logic so I can pass nodes directly to the chat method, rather than always go for retrieval.

Retrieve chat history (stored in Redis TTL - short ~ 200sec)
Condense new question based on new question and chat history (if chat history is not present skip)
Retrieve from QnA and Documents Store
Rerank retrieval from both through Gemini (forcing it to follow a custom jsom schema and passing the custom custom_parse_chooice_select_fn function to the rerakner)
Generate answer

I was thinking of using worfklows so that I am flexible and add more steps and route into certain aspects, however the more and more I dig deeper I am rethinking whether it actually makes sense to use workflows.

My major issue right now is actually latency - the time it takes to go through the workflow is ~15sec which is a lot.
Below is a sample code without all the steps, just a more simplified case where it also takes ~12 seconds, using llama-deploy for deployment. I am not sure where I am doing something wrong in the way I am using it or workflows is not the right approach.

Workflow.py

`

vs_index = aiplatform.MatchingEngineIndex('')
vs_endpoint = aiplatform.MatchingEngineIndexEndpoint('')
embed_model = VertexTextEmbedding(model_name="text-multilingual-embedding-002",
        project=PROJECT_ID,
        location=REGION,
        credentials=credentials,
    )
vector_store = VertexAIVectorStore(
        project_id=PROJECT_ID,
        region=REGION,
        index_id=vs_index.resource_name,
        endpoint_id=vs_endpoint.resource_name,
        gcs_bucket_name=GCS_BUCKET_NAME,
    )
index = VectorStoreIndex.from_vector_store(
        vector_store=vector_store, embed_model=embed_model
    )
retriever = index.as_retriever()


class RAGWorkflow(Workflow):
    
    @step
    async def retrieve(self, ctx: Context, ev: StartEvent) -> StopEvent:
     
        query = ev.get('query')
        user = ev.get('user')
        chat_store = RedisChatStore(redis_url=REDIS_URL, ttl=TTL)
        chat_memory = ChatMemoryBuffer.from_defaults(
                    token_limit=3000,
                    chat_store=chat_store,
                    chat_store_key=user,
                    )
        query_engine = CondensePlusContextChatEngine.from_defaults(
                    retriever=retriever,
                    llm=Settings.llm,  
                    memory=chat_memory,  
                    skip_condense=True,
                    )
       
        response = query_engine.chat(query).response
        return StopEvent(result=response)

async def main():
        await deploy_workflow(
		    RAGWorkflow(),
		    WorkflowServiceConfig(
		        host="127.0.0.1", port=8002, service_name="my_workflow"
		    ),
		    ControlPlaneConfig(),
		)
if __name__ == "__main__":
    import asyncio
    a = asyncio.run(main())

#Deploy Core

from llama_deploy import ControlPlaneConfig, SimpleMessageQueueConfig, deploy_core
async def main() -> None:
    """Launches the core services required for the Llama workflow application."""
    await deploy_core(
        control_plane_config=ControlPlaneConfig(),
        message_queue_config=SimpleMessageQueueConfig(host="0.0.0.0", port=8001),
    )


if __name__ == "__main__":
    import asyncio

    asyncio.run(main())

@dosu · 2024-11-15T13:40:06Z

dosubot[bot]
bot Nov 15, 2024

I found some similar discussions and issues related to performance and latency in LlamaIndex workflows that might be helpful for your situation:

Unpredictable response time with query method in LlamaIndex: This discussion suggests using the ResponseMode.COMPACT to balance speed and accuracy by refining responses in a compact manner, which could help reduce latency ^[1].
RAG Inference Latency: This closed issue provides several strategies to reduce latency, such as batch processing, asynchronous operations, optimizing query and indexing parameters, pre-computing and caching embeddings, infrastructure improvements, parallel processing, and profiling and optimizing code ^[2].
Make agents faster on processing query engine replies: This issue suggests strategies like caching, optimizing indexing, parallel processing, query optimization, and using specific query engines to speed up processing time ^[3].
About multiple users chat slow: This issue recommends loading indices and constructing query engines outside of the controller function to minimize overhead and improve performance ^[4].

Given your setup, here are some specific suggestions to consider:

Optimize Retrieval and Reranking: Ensure that your retrieval and reranking processes are as efficient as possible. Consider adjusting parameters like similarity_top_k to balance between retrieving enough relevant nodes and not overloading the system with too many irrelevant ones.
Use Asynchronous Operations: If not already implemented, consider making your retrieval and reranking steps asynchronous to improve throughput and reduce latency.
Pre-compute and Cache Embeddings: If possible, pre-compute and cache embeddings for frequently accessed documents to reduce the time spent on embedding during each query.
Profile and Optimize Code: Use profiling tools to identify bottlenecks in your code and optimize them. This could involve optimizing data structures, reducing unnecessary computations, or parallelizing certain operations.
Review Workflow Design: Consider whether all steps in your workflow are necessary for every query. If some steps can be skipped or simplified for certain queries, this could reduce overall latency.

These strategies should help you address the latency issues you're experiencing with your current workflow setup.

To continue talking to Dosu, mention @dosu.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance and General Architcture Question - Llamaindex - Workflows or Not #16971

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Performance and General Architcture Question - Llamaindex - Workflows or Not #16971

hristogg Nov 15, 2024

Workflow.py

Replies: 1 comment

dosubot[bot] bot Nov 15, 2024

hristogg
Nov 15, 2024

dosubot[bot]
bot Nov 15, 2024