Skip to content

Latest commit

 

History

History
234 lines (172 loc) · 14.6 KB

libraries.md

File metadata and controls

234 lines (172 loc) · 14.6 KB

.NET Libraries in .NET 9 Preview 3 - Release Notes

.NET 9 Preview 3 includes several new libraries features. We focused on the following areas:

  • Enhancements to the Tokenizers Library
  • TimeSpan.From overloads
  • Added PersistableAssemblyBuilder type in System.Reflection.Emit

Libraries updates in .NET 9 Preview 3:

.NET 9 Preview 3:

Enhancements to the Tokenizers Library

Tokenization is a fundamental component in the preprocessing of natural language text for AI models. Tokenizers are responsible for breaking down a string of text into smaller, more manageable parts, often referred to as tokens.

When using services like Azure OpenAI, you can use tokenizers to get a better understanding of cost and manage context. When working with self-hosted / local models, tokens are the inputs provided to those models.

A couple of years ago, we introduced Microsoft.ML.Tokenizers, an open-source, cross-platform tokenization library. At the time, the library was scoped to the Byte-Pair Encoding (BPE) tokenization strategy to satisfy the language set of scenarios in ML.NET.

Over the past few months, we've been making enhancements to the library in the following ways:

  • Refined APIs and existing functionality
  • Added Tiktoken support
  • Added LlamaTokenizer support
  • Worked closely with the DeepDev TokenizerLib and SharpToken communities to cover scenarios covered by those libraries. If you're using DeepDev or SharpToken, we recommend migrating to Microsoft.ML.Tokenizers. For more details, see the migration guide.

The following samples demonstrate the utilization of these tokenizers for text tokenization.

Using Tiktoken tokenizer

using Microsoft.ML.Tokenizers;

Tokenizer tokenizer = Tokenizer.CreateTiktokenForModel("gpt-4");
string text = "Hello, World!";

// Encode to Ids
IReadOnlyList<int> encodedIds = tokenizer.EncodeToIds(text);
Console.WriteLine($"encodedIds = {{{string.Join(", ", encodedIds)}}}"); // encodedIds = {9906, 11, 4435, 0}

// Decode Ids to text
string decodedText = tokenizer.Decode(encodedIds);
Console.WriteLine($"decodedText = {decodedText}"); // decodedText = Hello, World!

// Get token count
int idsCount = tokenizer.CountTokens(text); 
Console.WriteLine($"idsCount = {idsCount}"); // idsCount = 4

// Full encoding
EncodingResult result = tokenizer.Encode(text);
Console.WriteLine($"result.Tokens = {{'{string.Join("', '", result.Tokens)}'}}"); // result.Tokens = {'Hello', ',', ' World', '!'}
Console.WriteLine($"result.Offsets = {{{string.Join(", ", result.Offsets)}}}");   // result.Offsets = {(0, 5), (5, 1), (6, 6), (12, 1)}
Console.WriteLine($"result.Ids = {{{string.Join(", ", result.Ids)}}}");           // result.Ids = {9906, 11, 4435, 0}

// Encode up to number of tokens limit
int index1 = tokenizer.IndexOfTokenCount(text, maxTokenCount: 1, out string processedText1, out int tokenCount1); // Encode up to one token
Console.WriteLine($"processedText1 = {processedText1}"); // processedText1 = Hello, World!
Console.WriteLine($"tokenCount1 = {tokenCount1}");       // tokenCount1 = 1
Console.WriteLine($"index1 = {index1}");                 // index1 = 5

int index2 = tokenizer.LastIndexOfTokenCount(text, maxTokenCount: 1, out string processedText2, out int tokenCount2); // Encode from end up to one token
Console.WriteLine($"processedText2 = {processedText2}"); // processedText2 = Hello, World!
Console.WriteLine($"tokenCount2 = {tokenCount2}");       // tokenCount2 = 1
Console.WriteLine($"index2 = {index2}");                 // index2 = 12

Using Llama tokenizer

using Microsoft.ML.Tokenizers;
using System.Net.Http;

// Create the Tokenizer
HttpClient httpClient = new HttpClient();
string modelUrl = @"https://huggingface.co/hf-internal-testing/llama-tokenizer/resolve/main/tokenizer.model";
using Stream remoteStream = await httpClient.GetStreamAsync(modelUrl);
Tokenizer tokenizer = Tokenizer.CreateLlama(remoteStream);

string text = "Hello, World!";

// Encode to Ids
IReadOnlyList<int> encodedIds = tokenizer.EncodeToIds(text);
Console.WriteLine($"encodedIds = {{{string.Join(", ", encodedIds)}}}"); // encodedIds = {1, 15043, 29892, 2787, 29991}

// Decode Ids to text
string? decodedText = tokenizer.Decode(encodedIds);
Console.WriteLine($"decodedText = {decodedText}"); // decodedText = Hello, World!

// Get token count
int idsCount = tokenizer.CountTokens(text); 
Console.WriteLine($"idsCount = {idsCount}"); idsCount = 5

// Full encoding
EncodingResult result = tokenizer.Encode(text);
Console.WriteLine($"result.Tokens = {{'{string.Join("', '", result.Tokens)}'}}"); // result.Tokens = {'<s>', '▁Hello', ',', '▁World', '!'}
Console.WriteLine($"result.Offsets = {{{string.Join(", ", result.Offsets)}}}");   // result.Offsets = {(0, 0), (0, 6), (6, 1), (7, 6), (13, 1)}
Console.WriteLine($"result.Ids = {{{string.Join(", ", result.Ids)}}}");           // result.Ids = {1, 15043, 29892, 2787, 29991}

// Encode up to number of tokens limit
int index1 = tokenizer.IndexOfTokenCount(text, maxTokenCount: 2, out string processedText1, out int tokenCount1); // Encode up to two token
Console.WriteLine($"processedText1 = {processedText1}"); // processedText1 =  ▁Hello,▁World!
Console.WriteLine($"tokenCount1 = {tokenCount1}");       // tokenCount1 = 2
Console.WriteLine($"index1 = {index1}");                 // index1 = 6

int index2 = tokenizer.LastIndexOfTokenCount(text, maxTokenCount: 1, out string processedText2, out int tokenCount2); // Encode from end up to one token
Console.WriteLine($"processedText2 = {processedText2}"); // processedText2 =  ▁Hello,▁World!
Console.WriteLine($"tokenCount2 = {tokenCount2}");       // tokenCount2 = 1
Console.WriteLine($"index2 = {index2}");                 // index2 = 13

Adding TimeSpan.From overloads

The TimeSpan class offers several From methods enabling users to create a TimeSpan using a double. However, since double is a binary-based floating-point format, inherent imprecision may lead to errors. For instance, TimeSpan.FromSeconds(101.832) may not precisely represent 101 seconds, 832 milliseconds, but rather approximately 101 seconds, 831.9999999999936335370875895023345947265625 milliseconds. This discrepancy has often caused user confusion and API surface bugs over time, requiring users to address and rationalize them. Moreover, it's not the most efficient way to represent such data and can pose challenges for users expecting specific behavior. To address this, new overloads have been introduced allowing users to pass integers, ensuring they achieve the desired and intended behavior.

Many thanks to Tommy Sørbråten for contributing the implementation of the added overloads.

Sample

TimeSpan timeSpan1 = TimeSpan.FromSeconds(value: 101.832);
Console.WriteLine($"timeSpan1 = {timeSpan1}"); // timeSpan1 = 00:01:41.8319999

TimeSpan timeSpan2 = TimeSpan.FromSeconds(seconds: 101, milliseconds: 832);
Console.WriteLine($"timeSpan2 = {timeSpan2}"); // timeSpan2 = 00:01:41.8320000

Added Overloads

public partial struct TimeSpan
{
    public static TimeSpan FromDays(int days);
    public static TimeSpan FromDays(int days, int hours = 0, long minutes = 0, long seconds = 0, long milliseconds = 0, long  microseconds = 0);
    public static TimeSpan FromHours(int hours);
    public static TimeSpan FromHours(int hours, long minutes = 0, long seconds = 0, long milliseconds = 0, long microseconds = 0);
    public static TimeSpan FromMinutes(long minutes);
    public static TimeSpan FromMinutes(long minutes, long seconds = 0, long milliseconds = 0, long microseconds = 0);
    public static TimeSpan FromSeconds(long seconds);
    public static TimeSpan FromSeconds(long seconds, long milliseconds = 0, long microseconds = 0);
    public static TimeSpan FromMilliseconds(long milliseconds, long microseconds = 0);
    public static TimeSpan FromMicroseconds(long microseconds);
}

Added PersistableAssemblyBuilder type in System.Reflection.Emit

We added a persisted AssemblyBuilder implementation and related APIs in .NET 9 preview 1. Further, we needed to add more APIs for setting EntryPoint and other properties of the final binary. There was no efficient way to add an API that allows setting all available options for an assembly. We decided to let the users handle their assembly building process by themselves using the parameters of PEHeaderBuilder and ManagedPEBuilder constructors. These constructor options covers all options that existed in .NET framework, (but not exactly same way), plus provide many other options that are available in .NET Core.

In order to achieve this, we provide all metadata information produced with Reflection.Emit APIs with a new MetadataBuilder GenerateMetadata(out BlobBuilder ilStream, out BlobBuilder mappedFieldData) method so that user could embed them into the corresponding section of PEBuidler. But because the MetadataBuilder and BlobBuilder types are not accessible from within CoreLib we made the PersistedAssemblyBuilder type public and moved the new APIs from the base AssemblyBuilder type into this new type.

API updates

public abstract partial class AssemblyBuilder
{
    // These APIs moved from AssemblyBuilder into PersistedAssemblyBuilder type
-   public static AssemblyBuilder DefinePersistedAssembly(AssemblyName name, Assembly coreAssembly, IEnumerable<CustomAttributeBuilder>? assemblyAttributes = null);

-   public void Save(Stream stream);
-   public void Save(string assemblyFileName);
-   protected abstract void SaveCore(Stream stream);
}

+public sealed class PersistedAssemblyBuilder : AssemblyBuilder
{
+   public PersistedAssemblyBuilder(AssemblyName name, Assembly coreAssembly, IEnumerable<CustomAttributeBuilder>? assemblyAttributes = null);

+   public void Save(Stream stream);
+   public void Save(string assemblyFileName);
    // New method that can be used for generating custom assembly with entry point and other options 
+   public MetadataBuilder GenerateMetadata(out BlobBuilder ilStream, out BlobBuilder mappedFieldData);
}

Sample usage

Because of the above changes to create a persisted AssemblyBuilder instance use PersistedAssemblyBuilder(AssemblyName name, Assembly coreAssembly, IEnumerable<CustomAttributeBuilder>? assemblyAttributes = null) constuctor instead of AssemblyBuilder.DefinePersistedAssembly(AssemblyName name, Assembly coreAssembly, IEnumerable<CustomAttributeBuilder>? assemblyAttributes = null). Then after emitting all members you can call the Save method to save the assembly with default settings.

PersistedAssemblyBuilder ab = new PersistedAssemblyBuilder(new AssemblyName("MyAssembly"), typeof(object).Assembly);
TypeBuilder tb = ab.DefineDynamicModule("MyModule").DefineType("MyType", TypeAttributes.Public | TypeAttributes.Class);
// ...
MethodBuilder entryPoint = tb.DefineMethod("Main", MethodAttributes.Public | MethodAttributes.Static);
ILGenerator il2 = entryPoint.GetILGenerator();
// ...
tb.CreateType();

ab.Save("MyAssembly.dll")

In case you want to set entry point and/or other options you can call public MetadataBuilder GenerateMetadata(out BlobBuilder ilStream, out BlobBuilder mappedFieldData) method and use the produced metadata for saving assembly as needed. Below example shows how to set entry point for an assembly and save it as executable:

PersistedAssemblyBuilder ab = new PersistedAssemblyBuilder(new AssemblyName("MyAssembly"), typeof(object).Assembly);
TypeBuilder tb = ab.DefineDynamicModule("MyModule").DefineType("MyType", TypeAttributes.Public | TypeAttributes.Class);
// ...
MethodBuilder entryPoint = tb.DefineMethod("Main", MethodAttributes.Public | MethodAttributes.Static);
ILGenerator il2 = entryPoint.GetILGenerator();
// ...
tb.CreateType();

MetadataBuilder metadataBuilder = ab.GenerateMetadata(out BlobBuilder ilStream, out BlobBuilder fieldData);
PEHeaderBuilder peHeaderBuilder = new PEHeaderBuilder(
                imageCharacteristics: Characteristics.ExecutableImage);

ManagedPEBuilder peBuilder = new ManagedPEBuilder(
                header: peHeaderBuilder,
                metadataRootBuilder: new MetadataRootBuilder(metadataBuilder),
                ilStream: ilStream,
                mappedFieldData: fieldData,
                entryPoint: MetadataTokens.MethodDefinitionHandle(entryPoint.MetadataToken));

BlobBuilder peBlob = new BlobBuilder();
peBuilder.Serialize(peBlob);
using var fileStream = new FileStream("MyAssembly.exe", FileMode.Create, FileAccess.Write);
peBlob.WriteContentTo(fileStream);

The constructor resolution for ActivatorUtilities.CreateInstance() with the attribute [ActivatorUtilitiesConstructor] has changed to always use the attribute. Previously, a constructor without the attribute but with more parameters was selected but only if it was declared after the constructor with the attribute.

The change was made to allow full, unambiguous control over which constructor is used.

Mostly due to these changes, performance has also increased; the more constructors there are, the greater the performance impact. For a simple case of 3 constructors, CreateInstance() is now twice is fast. If performance is a concern here, and only one constructor should ever be called, this may be a reason to add [ActivatorUtilitiesConstructor] to your classes.