You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After performing 3-6 inferences, the AskAsync method throws an exception:
LLama.Exceptions.LLamaDecodeError: llama_decode failed: 'NoKvSlot'
at LLama.InteractiveExecutor.InferInternal(IInferenceParams inferenceParams, InferStateArgs args)
at LLama.StatefulExecutorBase.InferAsync(String text, IInferenceParams inferenceParams, CancellationToken cancellationToken)+MoveNext()
at LLama.StatefulExecutorBase.InferAsync(String text, IInferenceParams inferenceParams, CancellationToken cancellationToken)+System.Threading.Tasks.Sources.IValueTaskSource<System.Boolean>.GetResult()
at Microsoft.KernelMemory.Search.SearchClient.AskAsync(String index, String question, ICollection`1 filters, Double minRelevance, IContext context, CancellationToken cancellationToken)
at Microsoft.KernelMemory.Search.SearchClient.AskAsync(String index, String question, ICollection`1 filters, Double minRelevance, IContext context, CancellationToken cancellationToken)
I'm not very knowledgeable about either llama.cpp or kernel-memory source code, so take this with a grain of salt:
From what I understand, the issue might be that the LlamaSharpTextGenerator does not clear the previous KV-cache after each inference. After a certain number of inferences, this results in a stack overflow, leading to the NoKvSlot error.
To support this theory, I have observed that each subsequent answer becomes less accurate and more incoherent. The responses increasingly include more noise, factual information decreases, and the model seems to less follow the prompts.
Importance
edge case
Platform, Language, Versions
Language - C#
KernelMemory version- 0.70.240803.1+9b63662.(latest)
Relevant log output
No response
The text was updated successfully, but these errors were encountered:
Since I’m newbie to C#, I’m not entirely sure what caused the problem with the KV, but I managed to replicate it in a few different projects. So, if anyone else runs into the same problem, here’s a quick guide to fix it:
Install the LLamaSharp dependencies. I use version 0.15.0
Context / Scenario
I am working on a RAG (Retrieval-Augmented Generation) application using LLaMa and a simple local vector database.
I have a singleton instance of a class ChatManager. ChatManager uploads a model and builds KernelMemoryBuilder.
Code:
What happened?
After performing 3-6 inferences, the AskAsync method throws an exception:
I'm not very knowledgeable about either llama.cpp or kernel-memory source code, so take this with a grain of salt:
From what I understand, the issue might be that the LlamaSharpTextGenerator does not clear the previous KV-cache after each inference. After a certain number of inferences, this results in a stack overflow, leading to the NoKvSlot error.
To support this theory, I have observed that each subsequent answer becomes less accurate and more incoherent. The responses increasingly include more noise, factual information decreases, and the model seems to less follow the prompts.
Importance
edge case
Platform, Language, Versions
Language - C#
KernelMemory version- 0.70.240803.1+9b63662.(latest)
Relevant log output
No response
The text was updated successfully, but these errors were encountered: