Iterative Quantization and Comparison

This guide outlines a systematic approach for quantizing your models using iterative refinement of the importance matrix (I-matrix) and evaluating quantization quality through KL-divergence metrics. By following these steps, you can optimize your model quantization to achieve minimal performance loss across diverse data subsets.

Preparing Your Dataset for I-Matrix Generation

Before quantizing your model, you need a dataset to generate the i-matrix and evaluate the quantized models. The imatrix_dataset.py tool helps collect and preprocess this data. However, due to the way the datasets library works, it's best to download a larger dataset upfront and then split it. This avoids the overhead of repeatedly accessing the dataset library, which can be inefficient. The tool allows skipping so more can be added later as needed, so it may be comfortable if appropriate to save the first part of the dataset downloaded for testing, leaving the later part for the imatrix itself.

To optimize the process:

Estimate the largest i-matrix dataset size you'll need.
Add the size of your test dataset to this estimate.
Download a dataset that meets this combined size.
Split the downloaded dataset into two parts:
- The earlier portion for quantization testing.
- The later portion as the pool for creating and expanding the i-matrix dataset.

This approach ensures you have all the necessary data while minimizing redundant downloads and improving overall efficiency.

Generate the Dataset using imatrix_dataset.py

The imatrix_dataset.py script allows you to create a dataset tailored to your needs. It supports various data sources through a flexible plugin system (details on the plugin system are provided later in this guide).

The imatrix_dataset.py script relies on a plugin system to handle different data sources flexibly. This allows you to use data from various sources, such as Hugging Face datasets, OSCAR corpus, or your custom data sources.

Example command:
```
❯ uv run src/imatrix_dataset.py \
    --datasource_plugin src/imatrix_dataset/hf_dataset.py \
    --plugin_class HFDatasetPlugin \
    --langs en \
    --num_samples 100000 \
    --output /path/to/dataset.txt
```
- The --datasource_plugin parameter specifies the path to the data source plugin script.
- The --plugin_class parameter specifies the class name within the plugin script.
- Adjust the --langs parameter to include the languages relevant to your dataset (e.g., en, es, de).
- The --num_samples parameter specifies the number of samples to use.
- The --output parameter specifies the path where the generated dataset will be saved.

Step 1: Generate and Store Baseline Logits (One-Time Setup)

Run generate_logits.py on the Baseline Model

Begin by generating logits from your unquantized (baseline) model over the dataset you prepared. Use the generate_logits.py script to create a reusable HDF5 file containing these logits. Generating the baseline logits once saves time and storage in subsequent comparisons.

Example command:
```
❯ uv run src/generate_logits.py \
    --model /path/to/baseline_model.gguf \
    --context-size <size, ie 2048> \
    --dataset /path/to/dataset.txt \
    --output baseline_logits.hdf5
```
- Replace /path/to/baseline_model.gguf with the path to your unquantized model in GGUF format, and with your model's context size.
- The --dataset parameter uses the dataset generated by imatrix_dataset.py in the previous step.
- The --output parameter specifies the name of the output HDF5 file.
notes:
- If working with large datasets, you can use the --from and/or --to arguments to process the dataset in resumable chunks, allowing you to generate logits gradually without starting over each time.
- At this time, llama-cpp-python library will default to 512 context size, rather than reading it from the underlying model, so a choice was made to make --context-size required.
Save Baseline Logits for Reference

Store the generated baseline_logits.hdf5 file in a secure and accessible location. This file will serve as the reference point for comparing outputs from quantized models in later steps.

Step 2: Initial Quantization (Without I-Matrix) and Baseline Comparison

Quantize the Model without an I-Matrix using quantize.py

Create an initial quantized version of your model without using an I-matrix. Use the quantize.py script to perform the quantization, which streamlines the process and allows for consistent quantization settings.

Example command:
```
❯ uv run src/quantize.py quantize \
    --model-name my_model \
    --base-model /path/to/baseline_model.gguf \
    --quantizations q4_0 \
    --output-dir /path/to/output_dir
```
- Replace my_model with a name for your model.
- Replace /path/to/baseline_model.gguf with the path to your unquantized model.
- The --quantizations parameter specifies the quantization type(s); here, q4_0 is used.
- The --output-dir parameter specifies where the quantized model will be saved.
Run kl_d_bench.py for Initial Comparison

Use the kl_d_bench.py script to compare the logits of the quantized model against the baseline logits. This script processes the stored baseline logits and computes KL-divergence metrics efficiently.

Example command:
```
❯ uv run src/kl_d_bench.py \
    --baseline-logits baseline_logits.hdf5 \
    --target-model /path/to/output_dir/my_model-q4_0.gguf \
    --dataset /path/to/dataset.txt \
    --output-file initial_comparison.hdf5
```
- Replace /path/to/output_dir/my_model-q4_0.gguf with the path to your quantized model.
- The --output-file parameter specifies where to save the comparison results.
Collect and Evaluate Metrics

After running kl_d_bench.py, review the KL-divergence metrics, including the median and the 90th, 95th, and 99th percentiles for each data chunk. This initial assessment serves as a reference point for evaluating improvements from I-matrix calibration in subsequent iterations.

Step 3: Iterative I-Matrix Calibration and Quantization

Generate I-Matrices with Incrementally Larger Dataset Subsets

Begin refining the I-matrix by generating it using a small subset of your dataset. If you haven't already, use the imatrix_dataset.py script to create the I-matrix tailored to your data. As you iterate, you will increase the dataset size to improve the I-matrix.

Example command:
```
❯ uv run src/imatrix_dataset.py \
    --datasource_plugin src/imatrix_dataset/hf_dataset.py \
    --plugin_class HFDatasetPlugin \
    --langs en \
    --num_samples 50000 \
    --output imatrix_en_50k.bin
```
- Increase --num_samples in each iteration (e.g., 50,000, 100,000, 200,000).
- The --output parameter specifies the name of the I-matrix file for each iteration.
Note: Remember that imatrix_dataset.py uses a plugin system to handle various data sources, but the details are covered later in this guide.
Quantize the Model with the New I-Matrix using quantize.py

Use the updated I-matrix in the quantization process.

Example command:
```
❯ uv run src/quantize.py quantize \
    --model-name my_model \
    --base-model /path/to/baseline_model.gguf \
    --quantizations q4_0 \
    --imatrix-path imatrix_en_50k.bin \
    --output-dir /path/to/output_dir
```
- The --imatrix-path parameter specifies the path to the I-matrix file generated in the current iteration.
Use kl_d_bench.py for Comparison

For each quantized model with an updated I-matrix:
- Run the Comparison Script
```
❯ uv run src/kl_d_bench.py \
    --baseline-logits baseline_logits.hdf5 \
    --target-model /path/to/output_dir/my_model-q4_0.gguf \
    --dataset /path/to/dataset.txt \
    --output-file comparison_iteration_n.hdf5
```
  - Update comparison_iteration_n.hdf5 to reflect the iteration number (e.g., comparison_iteration_1.hdf5).
- Analyze KL-Divergence Metrics
  
  Focus on key metrics, especially the high-percentile KL-divergence values, to assess the effectiveness of the quantization with each updated I-matrix.
Evaluate Metrics to Determine When to Stop

Monitor the KL-divergence metrics across iterations. Pay special attention to the 90th, 95th, and 99th percentiles. When successive iterations show marginal improvements (diminishing returns), you can consider the I-matrix sufficiently refined for your application.

Metric for Evaluation

To balance overall performance with outlier minimization, we suggest using a composite metric that combines the median KL-divergence and higher percentile values.

Suggested Composite Metric:

$$\text{Score} = \left( \text{Median}^{1/3} \times \left( \text{KLD}_{99} \times 1 + \text{KLD}_{95} \times 4 + \text{KLD}_{90} \times 5 \right)^{2/3} \right)$$

Explanation:
- Median KL-Divergence (Median): Represents typical performance across the dataset.
- High Percentile KL-Divergence (KLD_{90}, KLD_{95}, KLD_{99}): Capture the worst-case divergences, indicating how well the model handles outlier cases.
- Weighting Factors: The weights (1 for KLD_{99}, 4 for KLD_{95}, 5 for KLD_{90}) emphasize reducing higher divergences, with greater weight on percentiles covering more data.
- Exponents: The exponents (1/3 for the median, 2/3 for the weighted sum) balance the influence of average performance and outlier cases in the overall score.

By minimizing this composite score, you ensure that the quantized model maintains strong overall performance while mitigating significant divergences in less common scenarios.

Additional Guidance on Dataset Size

Selecting an appropriate dataset size and coverage is crucial for effective I-matrix calibration. We recommend:

Starting Small: Use a representative subset that includes key languages or domains relevant to your application.
Gradual Expansion: Increase the dataset size in each iteration to include more diversity and complexity.
Balancing Diversity and Size: Ensure that the dataset remains manageable while covering the necessary range of tokens and contexts.

For detailed insights into dataset selection and initial sizing, refer to on_quantization.md. This document provides guidance on balancing dataset diversity and size to optimize the I-matrix calibration process.

Additional Tools and Options

Optimizing Batch Sizes with best_bub.py

Before generating logits or quantizing large models, you may want to optimize batch (--batch) and micro-batch (--ubatch) sizes to maximize performance given your hardware constraints.

Example command:
```
❯ uv run src/best_bub.py --model /path/to/baseline_model.gguf --context-size 2048
```
- Adjust the --context-size to match your model's maximum context size.
- This script will suggest optimal --batch-size and --ubatch-size settings.
Measuring Perplexity with quantize.py

After quantization, you can measure the perplexity of your quantized model to assess its performance.

Example command:
```
❯ uv run src/quantize.py perplexity \
    --model-name my_model \
    --base-model /path/to/output_dir/my_model-q4_0.gguf \
    --dataset /path/to/perplexity_dataset.txt
```
- Replace /path/to/perplexity_dataset.txt with a dataset suitable for perplexity measurement.

Using the Plugin System in `imatrix_dataset.py`

As mentioned earlier, the imatrix_dataset.py script uses a plugin system to support various data sources flexibly. Here's how you can utilize this system:

Selecting a Data Source Plugin

Choose an existing plugin or create a new one depending on your data source.
- Existing Plugins:
  - oscar_plugin.py: An example huggingface dataset, for the OSCAR corpus.
- Custom Plugins: Create a plugin tailored to your data source.
Specifying the Plugin in the Command

When running imatrix_dataset.py, use the following arguments to specify the plugin:
- --datasource_plugin: Path to the plugin script.
- --plugin_class: The class name of the plugin within the script.
Example command with an existing plugin:
```
❯ uv run src/imatrix_dataset.py \
    --datasource_plugin src/imatrix_dataset/oscar_plugin.py \
    --plugin_class OscarDataSource \
    --langs en es de \
    --num_samples 100000 \
    --output /path/to/dataset.txt
```

Creating a Custom Data Source Plugin

If your data source isn't covered by existing plugins, you can create your own.

Steps to Create a Custom Plugin:

Create a New Python File: Save it with a descriptive name, e.g., my_custom_plugin.py.

Import the Base Class:

from plugin_base import DataSourcePluginBase

Define Your Plugin Class: Inherit from DataSourcePluginBase.

class MyCustomPlugin(DataSourcePluginBase):
    def __init__(self, name="my_dataset", **kwargs):
        super().__init__(name, **kwargs)
        self.schema = {'content': 'path.to.text.field'}

Implement the load_data Method:

def load_data(self, lang, num_samples=200, skip_samples=0):
    # Implement your data loading logic here
    data_samples = []
    # Fetch data samples based on lang, num_samples, skip_samples
    return data_samples

Optionally Override get_content Method:

If your data records have a different structure, override this method to extract the text content.

def get_content(self, record):
    # Extract and return the text content from the record
    return record['desired_text_field']

Using Your Custom Plugin:

❯ uv run src/imatrix_dataset.py \
    --datasource_plugin /path/to/my_custom_plugin.py \
    --plugin_class MyCustomPlugin \
    --langs en \
    --num_samples 100000 \
    --output /path/to/dataset.txt

Understanding the Plugin Base Class

The DataSourcePluginBase class defines the interface that all plugins must implement. It requires:
- Initialization: Set up any necessary configurations or parameters.
- load_data Method: Must return a list of data records for the specified language and sample counts.
- get_content Method: Extracts the textual content from a data record, used in building the combined dataset.

Example of Using the OSCAR Plugin

The OSCAR dataset is a large multilingual corpus. Here's how to use the oscar_plugin.py:

❯ uv run src/imatrix_dataset.py \
    --datasource_plugin src/imatrix_dataset/oscar_plugin.py \
    --plugin_class OscarDataSource \
    --langs en fr es \
    --num_samples 50000 \
    --output /path/to/dataset.txt

This command generates a dataset using 50,000 samples from each of the specified languages.

This set of tools is designed to simplify each stage of model quantization, from setting up datasets and generating I-matrices to quantizing models and evaluating performance. By following these steps, you can make targeted, data-driven adjustments at every stage, helping you achieve quantization results that preserve model quality while accommodating diverse data requirements.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

USAGE.md

USAGE.md

Iterative Quantization and Comparison

Preparing Your Dataset for I-Matrix Generation

Step 1: Generate and Store Baseline Logits (One-Time Setup)

Step 2: Initial Quantization (Without I-Matrix) and Baseline Comparison

Step 3: Iterative I-Matrix Calibration and Quantization

Metric for Evaluation

Additional Guidance on Dataset Size

Additional Tools and Options

Using the Plugin System in `imatrix_dataset.py`

Files

USAGE.md

Latest commit

History

USAGE.md

File metadata and controls

Iterative Quantization and Comparison

Preparing Your Dataset for I-Matrix Generation

Step 1: Generate and Store Baseline Logits (One-Time Setup)

Step 2: Initial Quantization (Without I-Matrix) and Baseline Comparison

Step 3: Iterative I-Matrix Calibration and Quantization

Metric for Evaluation

Additional Guidance on Dataset Size

Additional Tools and Options

Using the Plugin System in imatrix_dataset.py

Using the Plugin System in `imatrix_dataset.py`