Microsoft.KernelMemory version 0.68+ compatibility fix #862

SpaceAntelope · 2024-07-22T09:06:58Z

fixes #859

Issue details

The latest version of Microsoft.KernelMemory (0.68.240716.1 in my case) adds IReadOnlyList GetTokens(string) to interface Microsoft.KernelMemory.AI.ITextTokenizer

This breaks any project that would reference the latest packages of LlamaSharp.kernel-memory and Microsoft.KernelMemory.Core together, affecting mostly developers just getting into LLamaSharp.

How it's solved in this commit

This commit provides a tentative implementation using LLamaContext.Tokenizer to get the tokens in embedding form and StreamingTokenDecoder to turn them back into (parts of) words and return them.

My assumptions for the overall expected behavior are based on the implementation of CountTokens in LLamaSharpTextEmbedingsGenerator and LLamaSharpTextGenerator, This means that it breaks on null input and returns an empty token that corresponds to the BOS embedding. Unit tests also check that the result of CountTokens matches the actual count of the tokens return from GetTokens.

Other considerations

In the unit tests I trim the 'actual' result to match the 'expected' to account for the added empty space that corresponds to the BOS token. Issues such as #856 indicate that further clarity will emerge with respect to how this should be properly handled.

… 0.68

LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs

LLama.Unittest/KernelMemory/ITextTokenizerTests.cs

martindevans · 2024-07-22T12:55:24Z

I've submitted a few review comments. The one with the empty strings I'm not really sure how best to handle, and if you want to go ahead with the current implementation I'm happy with that as long as there's at least a test covering this weirdness and a comment explaining what's going on.

…of redundant tokens resulting from multi-token characters with ref to PR #862

SpaceAntelope · 2024-07-24T20:14:54Z

@martindevans I pushed the relevant changes. I created a duplicate unit test with only the unicode cases and added this comment (also referenced in the GetTokens implementations) :

  /* This is exactly the same test as the non-unicode cases. However, there are reasons why this
   * should be made a special case and may deviate in the future:
   * 
   * As of now there appears to be no final word as to how characters that consist of more than one 
   * numeric token should correspond to textual tokens, and results vary according to different 
   * models' tokenizers. For example, given a character 'Z' that corresponds to the numeric tokens {1,2,3} 
   * some (llama-2) will pad the length of the total number of tokens by returning spaces as tokens 
   * (i.e. ' ', ' ', 'Z') while others (GPT4Tokenizer) will pad with the character itself (i.e. 'Z','Z','Z').
   * 
   * This is very evident when tokenizing ideograms and emojis, but can arise with various unicode characters 
   * as well. See pull request for more relevant discussion https://github.com/SciSharp/LLamaSharp/pull/862
   *
   * Currently the method will remain consistent with the output of ITextTokenizer.CountTokens, meaning
   * any redundant tokens will not be ommited as long as they are counted by CountTokens.
   * 
   * StreamingTokenDecoder, while sufficiently useful for this task, was not designed with producing
   * output for one numeric token at a time in mind, so ITextTokenizer.GetTokens should not be considered 
   * an example of proper use.
   * 
   * Note: if this message is removed, also remove references to it in LLamaSharpTextEmbeddingGenerator.GetTokens
   * and LLamaSharpTextGenerator.GetTokens
   */

SpaceAntelope added 4 commits July 22, 2024 11:27

added ITextTokenizer.GetTokens implementation to affected generators

a018ea4

updated LLama.KernelMemory to use Microsoft.KernelMemory.Abstractions…

a2ff5fa

… 0.68

updated LLama.Unittest with reference to LLama.KernelMemory

578bfa7

added some unit tests for ITextTokenizer.GetTokens implementation

4a9b822

SpaceAntelope mentioned this pull request Jul 22, 2024

[BUG]: Method 'GetTokens' in type 'LLamaSharp.KernelMemory.LLamaSharpTextEmbeddingGenerator' from assembly 'LLamaSharp.KernelMemory, Version=0.14.0.0, Culture=neutral, PublicKeyToken=null' does not have an implementation. #859

Closed

martindevans reviewed Jul 22, 2024

View reviewed changes

LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs Show resolved Hide resolved

martindevans reviewed Jul 22, 2024

View reviewed changes

LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs Outdated Show resolved Hide resolved

martindevans reviewed Jul 22, 2024

View reviewed changes

LLama.Unittest/KernelMemory/ITextTokenizerTests.cs Show resolved Hide resolved

SpaceAntelope added 3 commits July 24, 2024 13:48

removed redundant .AsReadOnly, cleaned up usings

2532afd

changed misleading variable name

dd5ffa1

spun off unicode test cases and added short explanation of the issue …

63b50f5

…of redundant tokens resulting from multi-token characters with ref to PR #862

fixed spelling errors in comments

939d2b1

martindevans merged commit d8f5172 into SciSharp:master Jul 24, 2024
6 checks passed

SpaceAntelope deleted the kernel-memory-68-compatibility-fix branch July 25, 2024 18:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Microsoft.KernelMemory version 0.68+ compatibility fix #862

Microsoft.KernelMemory version 0.68+ compatibility fix #862

SpaceAntelope commented Jul 22, 2024 •

edited

Loading

martindevans commented Jul 22, 2024

SpaceAntelope commented Jul 24, 2024 •

edited

Loading

Microsoft.KernelMemory version 0.68+ compatibility fix #862

Microsoft.KernelMemory version 0.68+ compatibility fix #862

Conversation

SpaceAntelope commented Jul 22, 2024 • edited Loading

Issue details

How it's solved in this commit

Other considerations

martindevans commented Jul 22, 2024

SpaceAntelope commented Jul 24, 2024 • edited Loading

SpaceAntelope commented Jul 22, 2024 •

edited

Loading

SpaceAntelope commented Jul 24, 2024 •

edited

Loading