LLama.Exceptions.LLamaDecodeError: llama_decode failed: 'NoKvSlot' #746

RostislavFisher · 2024-08-19T17:10:46Z

Context / Scenario

I am working on a RAG (Retrieval-Augmented Generation) application using LLaMa and a simple local vector database.

I have a singleton instance of a class ChatManager. ChatManager uploads a model and builds KernelMemoryBuilder.

Code:

public class ChatManager
{
    private readonly LlamaSharpConfig m_configKernel;
    private readonly LLamaSharpConfig m_LLamaSharpConfig;
    private readonly SearchClientConfig m_SearchClientConfig;
    private readonly TextPartitioningOptions m_TextPartitioningOptions;
    private readonly IKernelMemory kernelMemoryBuilder;
    private readonly DocumentManager m_documentManager;

    public ChatManager(IOptions<ElasticConnectionConfiguration> elasticConfiguration)
    {
        string modelPath = @".....";

        InferenceParams infParams = new() { AntiPrompts = ["\n\n"]};
        LLamaSharpConfig lsConfig = new(modelPath) { DefaultInferenceParams = infParams };
        SearchClientConfig searchClientConfig = new() { MaxMatchesCount = 3, AnswerTokens = 200, EmptyAnswer = "ENDS", StopSequences = new List<string> { "ENDS", "Question:" } };
        m_LLamaSharpConfig = new LLamaSharpConfig(modelPath);
        m_configKernel = new LlamaSharpConfig { ModelPath = modelPath };
        
        m_TextPartitioningOptions = new TextPartitioningOptions
        {
            MaxTokensPerParagraph = 256,
            MaxTokensPerLine = 256,
            OverlappingTokens = 50
        };

        kernelMemoryBuilder = new KernelMemoryBuilder()
            .WithSimpleVectorDb(SimpleVectorDbConfig.Persistent)
            .WithCustomPromptProvider(new CustomPrompt())
            .WithSearchClientConfig(searchClientConfig)
            .WithLlamaTextGeneration(m_configKernel)
            .WithLLamaSharpTextEmbeddingGeneration(m_LLamaSharpConfig)
            .WithCustomTextPartitioningOptions(m_TextPartitioningOptions)
            .With(m_TextPartitioningOptions)
            .Build();
    }

        public async Task ImportDocument(string documentText, string index)
        {
            await kernelMemoryBuilder.ImportTextAsync(documentText, index: index);
        }

        public async Task<QuestionResponse> GenerateAnswerAsync(QuestionRequest request)
        {
            var answer = await kernelMemoryBuilder.AskAsync(request.message, minRelevance: 0.4f, index:request.documentId.ToString());
            var resultJSON = answer.ToJson();
            var resultString = answer.ToString();

            return new QuestionResponse
            {
                message = JObject.Parse(resultJSON)["text"]?.ToString()
            };

        }


        public async Task UpdateDocument(string documentText, string index)
        {
            await kernelMemoryBuilder.DeleteDocumentAsync(index);

            await kernelMemoryBuilder.ImportTextAsync(documentText, index: index, steps: Microsoft.KernelMemory.Constants.PipelineWithoutSummary);
        }

        public async Task DeleteDocument(string index)
        {
            await kernelMemoryBuilder.DeleteIndexAsync(index);
        }
}

What happened?

After performing 3-6 inferences, the AskAsync method throws an exception:

LLama.Exceptions.LLamaDecodeError: llama_decode failed: 'NoKvSlot'
   at LLama.InteractiveExecutor.InferInternal(IInferenceParams inferenceParams, InferStateArgs args)
   at LLama.StatefulExecutorBase.InferAsync(String text, IInferenceParams inferenceParams, CancellationToken cancellationToken)+MoveNext()
   at LLama.StatefulExecutorBase.InferAsync(String text, IInferenceParams inferenceParams, CancellationToken cancellationToken)+System.Threading.Tasks.Sources.IValueTaskSource<System.Boolean>.GetResult()
   at Microsoft.KernelMemory.Search.SearchClient.AskAsync(String index, String question, ICollection`1 filters, Double minRelevance, IContext context, CancellationToken cancellationToken)
   at Microsoft.KernelMemory.Search.SearchClient.AskAsync(String index, String question, ICollection`1 filters, Double minRelevance, IContext context, CancellationToken cancellationToken)

I'm not very knowledgeable about either llama.cpp or kernel-memory source code, so take this with a grain of salt:
From what I understand, the issue might be that the LlamaSharpTextGenerator does not clear the previous KV-cache after each inference. After a certain number of inferences, this results in a stack overflow, leading to the NoKvSlot error.

To support this theory, I have observed that each subsequent answer becomes less accurate and more incoherent. The responses increasingly include more noise, factual information decreases, and the model seems to less follow the prompts.

Importance

edge case

Platform, Language, Versions

Language - C#
KernelMemory version- 0.70.240803.1+9b63662.(latest)

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

dluc · 2024-08-19T19:20:43Z

hi @RostislavFisher - since this is an internal exception thrown by LLamaSharp I would report it as an issue here https://github.com/SciSharp/LLamaSharp/issues. I would check also SciSharp/LLamaSharp#660 in case it's relevant.

To unblock, you may want to try using Ollama or LM Studio, both work with the OpenAI connector, se example here:
https://github.com/microsoft/kernel-memory/tree/main/examples/208-dotnet-lmstudio

RostislavFisher · 2024-08-21T10:03:31Z

Thanks! I managed to fix it.

Since I’m newbie to C#, I’m not entirely sure what caused the problem with the KV, but I managed to replicate it in a few different projects. So, if anyone else runs into the same problem, here’s a quick guide to fix it:

Install the LLamaSharp dependencies. I use version 0.15.0
Create your own CustomTextGenerator that implements ITextGenerator. You can use LlamaSharpTextGenerator or Kernel Memory Custom LLM Example as a reference
Register your CustomTextGenerator in the kernel memory. Check out this example

dluc · 2024-08-21T17:07:51Z

FYI I've also upgraded KM to use the latest LLS packages, see v0.71. Cheers

RostislavFisher added bug Something isn't working triage labels Aug 19, 2024

dluc closed this as completed Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLama.Exceptions.LLamaDecodeError: llama_decode failed: 'NoKvSlot' #746

LLama.Exceptions.LLamaDecodeError: llama_decode failed: 'NoKvSlot' #746

RostislavFisher commented Aug 19, 2024

dluc commented Aug 19, 2024 •

edited

Loading

RostislavFisher commented Aug 21, 2024

dluc commented Aug 21, 2024

LLama.Exceptions.LLamaDecodeError: llama_decode failed: 'NoKvSlot' #746

LLama.Exceptions.LLamaDecodeError: llama_decode failed: 'NoKvSlot' #746

Comments

RostislavFisher commented Aug 19, 2024

Context / Scenario

What happened?

Importance

Platform, Language, Versions

Relevant log output

dluc commented Aug 19, 2024 • edited Loading

RostislavFisher commented Aug 21, 2024

dluc commented Aug 21, 2024

dluc commented Aug 19, 2024 •

edited

Loading