Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLama.Exceptions.LLamaDecodeError: llama_decode failed: 'NoKvSlot' #746

Closed
RostislavFisher opened this issue Aug 19, 2024 · 3 comments
Closed
Labels
bug Something isn't working triage

Comments

@RostislavFisher
Copy link

Context / Scenario

I am working on a RAG (Retrieval-Augmented Generation) application using LLaMa and a simple local vector database.

I have a singleton instance of a class ChatManager. ChatManager uploads a model and builds KernelMemoryBuilder.

Code:

public class ChatManager
{
    private readonly LlamaSharpConfig m_configKernel;
    private readonly LLamaSharpConfig m_LLamaSharpConfig;
    private readonly SearchClientConfig m_SearchClientConfig;
    private readonly TextPartitioningOptions m_TextPartitioningOptions;
    private readonly IKernelMemory kernelMemoryBuilder;
    private readonly DocumentManager m_documentManager;

    public ChatManager(IOptions<ElasticConnectionConfiguration> elasticConfiguration)
    {
        string modelPath = @".....";

        InferenceParams infParams = new() { AntiPrompts = ["\n\n"]};
        LLamaSharpConfig lsConfig = new(modelPath) { DefaultInferenceParams = infParams };
        SearchClientConfig searchClientConfig = new() { MaxMatchesCount = 3, AnswerTokens = 200, EmptyAnswer = "ENDS", StopSequences = new List<string> { "ENDS", "Question:" } };
        m_LLamaSharpConfig = new LLamaSharpConfig(modelPath);
        m_configKernel = new LlamaSharpConfig { ModelPath = modelPath };
        
        m_TextPartitioningOptions = new TextPartitioningOptions
        {
            MaxTokensPerParagraph = 256,
            MaxTokensPerLine = 256,
            OverlappingTokens = 50
        };

        kernelMemoryBuilder = new KernelMemoryBuilder()
            .WithSimpleVectorDb(SimpleVectorDbConfig.Persistent)
            .WithCustomPromptProvider(new CustomPrompt())
            .WithSearchClientConfig(searchClientConfig)
            .WithLlamaTextGeneration(m_configKernel)
            .WithLLamaSharpTextEmbeddingGeneration(m_LLamaSharpConfig)
            .WithCustomTextPartitioningOptions(m_TextPartitioningOptions)
            .With(m_TextPartitioningOptions)
            .Build();
    }

        public async Task ImportDocument(string documentText, string index)
        {
            await kernelMemoryBuilder.ImportTextAsync(documentText, index: index);
        }

        public async Task<QuestionResponse> GenerateAnswerAsync(QuestionRequest request)
        {
            var answer = await kernelMemoryBuilder.AskAsync(request.message, minRelevance: 0.4f, index:request.documentId.ToString());
            var resultJSON = answer.ToJson();
            var resultString = answer.ToString();

            return new QuestionResponse
            {
                message = JObject.Parse(resultJSON)["text"]?.ToString()
            };

        }


        public async Task UpdateDocument(string documentText, string index)
        {
            await kernelMemoryBuilder.DeleteDocumentAsync(index);

            await kernelMemoryBuilder.ImportTextAsync(documentText, index: index, steps: Microsoft.KernelMemory.Constants.PipelineWithoutSummary);
        }

        public async Task DeleteDocument(string index)
        {
            await kernelMemoryBuilder.DeleteIndexAsync(index);
        }
}

What happened?

After performing 3-6 inferences, the AskAsync method throws an exception:

LLama.Exceptions.LLamaDecodeError: llama_decode failed: 'NoKvSlot'
   at LLama.InteractiveExecutor.InferInternal(IInferenceParams inferenceParams, InferStateArgs args)
   at LLama.StatefulExecutorBase.InferAsync(String text, IInferenceParams inferenceParams, CancellationToken cancellationToken)+MoveNext()
   at LLama.StatefulExecutorBase.InferAsync(String text, IInferenceParams inferenceParams, CancellationToken cancellationToken)+System.Threading.Tasks.Sources.IValueTaskSource<System.Boolean>.GetResult()
   at Microsoft.KernelMemory.Search.SearchClient.AskAsync(String index, String question, ICollection`1 filters, Double minRelevance, IContext context, CancellationToken cancellationToken)
   at Microsoft.KernelMemory.Search.SearchClient.AskAsync(String index, String question, ICollection`1 filters, Double minRelevance, IContext context, CancellationToken cancellationToken)

I'm not very knowledgeable about either llama.cpp or kernel-memory source code, so take this with a grain of salt:
From what I understand, the issue might be that the LlamaSharpTextGenerator does not clear the previous KV-cache after each inference. After a certain number of inferences, this results in a stack overflow, leading to the NoKvSlot error.

To support this theory, I have observed that each subsequent answer becomes less accurate and more incoherent. The responses increasingly include more noise, factual information decreases, and the model seems to less follow the prompts.

Importance

edge case

Platform, Language, Versions

Language - C#
KernelMemory version- 0.70.240803.1+9b63662.(latest)

Relevant log output

No response

@RostislavFisher RostislavFisher added bug Something isn't working triage labels Aug 19, 2024
@dluc
Copy link
Collaborator

dluc commented Aug 19, 2024

hi @RostislavFisher - since this is an internal exception thrown by LLamaSharp I would report it as an issue here https://github.com/SciSharp/LLamaSharp/issues. I would check also SciSharp/LLamaSharp#660 in case it's relevant.

To unblock, you may want to try using Ollama or LM Studio, both work with the OpenAI connector, se example here:
https://github.com/microsoft/kernel-memory/tree/main/examples/208-dotnet-lmstudio

@RostislavFisher
Copy link
Author

Thanks! I managed to fix it.

Since I’m newbie to C#, I’m not entirely sure what caused the problem with the KV, but I managed to replicate it in a few different projects. So, if anyone else runs into the same problem, here’s a quick guide to fix it:

  1. Install the LLamaSharp dependencies. I use version 0.15.0
  2. Create your own CustomTextGenerator that implements ITextGenerator. You can use LlamaSharpTextGenerator or Kernel Memory Custom LLM Example as a reference
  3. Register your CustomTextGenerator in the kernel memory. Check out this example

@dluc
Copy link
Collaborator

dluc commented Aug 21, 2024

FYI I've also upgraded KM to use the latest LLS packages, see v0.71. Cheers

@dluc dluc closed this as completed Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

2 participants