Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Microsoft.KernelMemory version 0.68+ compatibility fix #862

Merged
merged 8 commits into from
Jul 24, 2024
Merged

Microsoft.KernelMemory version 0.68+ compatibility fix #862

merged 8 commits into from
Jul 24, 2024

Conversation

SpaceAntelope
Copy link
Contributor

@SpaceAntelope SpaceAntelope commented Jul 22, 2024

fixes #859

Issue details

The latest version of Microsoft.KernelMemory (0.68.240716.1 in my case) adds IReadOnlyList GetTokens(string) to interface Microsoft.KernelMemory.AI.ITextTokenizer

This breaks any project that would reference the latest packages of LlamaSharp.kernel-memory and Microsoft.KernelMemory.Core together, affecting mostly developers just getting into LLamaSharp.

How it's solved in this commit

This commit provides a tentative implementation using LLamaContext.Tokenizer to get the tokens in embedding form and StreamingTokenDecoder to turn them back into (parts of) words and return them.

My assumptions for the overall expected behavior are based on the implementation of CountTokens in LLamaSharpTextEmbedingsGenerator and LLamaSharpTextGenerator, This means that it breaks on null input and returns an empty token that corresponds to the BOS embedding. Unit tests also check that the result of CountTokens matches the actual count of the tokens return from GetTokens.

Other considerations

In the unit tests I trim the 'actual' result to match the 'expected' to account for the added empty space that corresponds to the BOS token. Issues such as #856 indicate that further clarity will emerge with respect to how this should be properly handled.

@martindevans
Copy link
Member

I've submitted a few review comments. The one with the empty strings I'm not really sure how best to handle, and if you want to go ahead with the current implementation I'm happy with that as long as there's at least a test covering this weirdness and a comment explaining what's going on.

@SpaceAntelope
Copy link
Contributor Author

SpaceAntelope commented Jul 24, 2024

@martindevans I pushed the relevant changes. I created a duplicate unit test with only the unicode cases and added this comment (also referenced in the GetTokens implementations) :

  /* This is exactly the same test as the non-unicode cases. However, there are reasons why this
   * should be made a special case and may deviate in the future:
   * 
   * As of now there appears to be no final word as to how characters that consist of more than one 
   * numeric token should correspond to textual tokens, and results vary according to different 
   * models' tokenizers. For example, given a character 'Z' that corresponds to the numeric tokens {1,2,3} 
   * some (llama-2) will pad the length of the total number of tokens by returning spaces as tokens 
   * (i.e. ' ', ' ', 'Z') while others (GPT4Tokenizer) will pad with the character itself (i.e. 'Z','Z','Z').
   * 
   * This is very evident when tokenizing ideograms and emojis, but can arise with various unicode characters 
   * as well. See pull request for more relevant discussion https://github.com/SciSharp/LLamaSharp/pull/862
   *
   * Currently the method will remain consistent with the output of ITextTokenizer.CountTokens, meaning
   * any redundant tokens will not be ommited as long as they are counted by CountTokens.
   * 
   * StreamingTokenDecoder, while sufficiently useful for this task, was not designed with producing
   * output for one numeric token at a time in mind, so ITextTokenizer.GetTokens should not be considered 
   * an example of proper use.
   * 
   * Note: if this message is removed, also remove references to it in LLamaSharpTextEmbeddingGenerator.GetTokens
   * and LLamaSharpTextGenerator.GetTokens
   */

@martindevans martindevans merged commit d8f5172 into SciSharp:master Jul 24, 2024
6 checks passed
@SpaceAntelope SpaceAntelope deleted the kernel-memory-68-compatibility-fix branch July 25, 2024 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants