This guide outlines a systematic approach for quantizing your models using iterative refinement of the importance matrix (I-matrix) and evaluating quantization quality through KL-divergence metrics. By following these steps, you can optimize your model quantization to achieve minimal performance loss across diverse data subsets.
Before quantizing your model, you need a dataset to generate the i-matrix and evaluate the quantized models. The imatrix_dataset.py
tool helps collect and preprocess this data. However, due to the way the datasets library works, it's best to download a larger dataset upfront and then split it. This avoids the overhead of repeatedly accessing the dataset library, which can be inefficient. The tool allows skipping so more can be added later as needed, so it may be comfortable if appropriate to save the first part of the dataset downloaded for testing, leaving the later part for the imatrix itself.
To optimize the process:
- Estimate the largest i-matrix dataset size you'll need.
- Add the size of your test dataset to this estimate.
- Download a dataset that meets this combined size.
- Split the downloaded dataset into two parts:
- The earlier portion for quantization testing.
- The later portion as the pool for creating and expanding the i-matrix dataset.
This approach ensures you have all the necessary data while minimizing redundant downloads and improving overall efficiency.
-
Generate the Dataset using
imatrix_dataset.py
The
imatrix_dataset.py
script allows you to create a dataset tailored to your needs. It supports various data sources through a flexible plugin system (details on the plugin system are provided later in this guide).The
imatrix_dataset.py
script relies on a plugin system to handle different data sources flexibly. This allows you to use data from various sources, such as Hugging Face datasets, OSCAR corpus, or your custom data sources.Example command:
❯ uv run src/imatrix_dataset.py \ --datasource_plugin src/imatrix_dataset/hf_dataset.py \ --plugin_class HFDatasetPlugin \ --langs en \ --num_samples 100000 \ --output /path/to/dataset.txt
- The
--datasource_plugin
parameter specifies the path to the data source plugin script. - The
--plugin_class
parameter specifies the class name within the plugin script. - Adjust the
--langs
parameter to include the languages relevant to your dataset (e.g.,en
,es
,de
). - The
--num_samples
parameter specifies the number of samples to use. - The
--output
parameter specifies the path where the generated dataset will be saved.
- The
-
Run
generate_logits.py
on the Baseline ModelBegin by generating logits from your unquantized (baseline) model over the dataset you prepared. Use the
generate_logits.py
script to create a reusable HDF5 file containing these logits. Generating the baseline logits once saves time and storage in subsequent comparisons.Example command:
❯ uv run src/generate_logits.py \ --model /path/to/baseline_model.gguf \ --context-size <size, ie 2048> \ --dataset /path/to/dataset.txt \ --output baseline_logits.hdf5
- Replace
/path/to/baseline_model.gguf
with the path to your unquantized model in GGUF format, and with your model's context size. - The
--dataset
parameter uses the dataset generated byimatrix_dataset.py
in the previous step. - The
--output
parameter specifies the name of the output HDF5 file.
notes:
- If working with large datasets, you can use the
--from
and/or--to
arguments to process the dataset in resumable chunks, allowing you to generate logits gradually without starting over each time. - At this time, llama-cpp-python library will default to 512 context size, rather than reading it from the underlying model, so a choice was made to make
--context-size
required.
- Replace
-
Save Baseline Logits for Reference
Store the generated
baseline_logits.hdf5
file in a secure and accessible location. This file will serve as the reference point for comparing outputs from quantized models in later steps.
-
Quantize the Model without an I-Matrix using
quantize.py
Create an initial quantized version of your model without using an I-matrix. Use the
quantize.py
script to perform the quantization, which streamlines the process and allows for consistent quantization settings.Example command:
❯ uv run src/quantize.py quantize \ --model-name my_model \ --base-model /path/to/baseline_model.gguf \ --quantizations q4_0 \ --output-dir /path/to/output_dir
- Replace
my_model
with a name for your model. - Replace
/path/to/baseline_model.gguf
with the path to your unquantized model. - The
--quantizations
parameter specifies the quantization type(s); here,q4_0
is used. - The
--output-dir
parameter specifies where the quantized model will be saved.
- Replace
-
Run
kl_d_bench.py
for Initial ComparisonUse the
kl_d_bench.py
script to compare the logits of the quantized model against the baseline logits. This script processes the stored baseline logits and computes KL-divergence metrics efficiently.Example command:
❯ uv run src/kl_d_bench.py \ --baseline-logits baseline_logits.hdf5 \ --target-model /path/to/output_dir/my_model-q4_0.gguf \ --dataset /path/to/dataset.txt \ --output-file initial_comparison.hdf5
- Replace
/path/to/output_dir/my_model-q4_0.gguf
with the path to your quantized model. - The
--output-file
parameter specifies where to save the comparison results.
- Replace
-
Collect and Evaluate Metrics
After running
kl_d_bench.py
, review the KL-divergence metrics, including the median and the 90th, 95th, and 99th percentiles for each data chunk. This initial assessment serves as a reference point for evaluating improvements from I-matrix calibration in subsequent iterations.
-
Generate I-Matrices with Incrementally Larger Dataset Subsets
Begin refining the I-matrix by generating it using a small subset of your dataset. If you haven't already, use the
imatrix_dataset.py
script to create the I-matrix tailored to your data. As you iterate, you will increase the dataset size to improve the I-matrix.Example command:
❯ uv run src/imatrix_dataset.py \ --datasource_plugin src/imatrix_dataset/hf_dataset.py \ --plugin_class HFDatasetPlugin \ --langs en \ --num_samples 50000 \ --output imatrix_en_50k.bin
- Increase
--num_samples
in each iteration (e.g., 50,000, 100,000, 200,000). - The
--output
parameter specifies the name of the I-matrix file for each iteration.
Note: Remember that
imatrix_dataset.py
uses a plugin system to handle various data sources, but the details are covered later in this guide. - Increase
-
Quantize the Model with the New I-Matrix using
quantize.py
Use the updated I-matrix in the quantization process.
Example command:
❯ uv run src/quantize.py quantize \ --model-name my_model \ --base-model /path/to/baseline_model.gguf \ --quantizations q4_0 \ --imatrix-path imatrix_en_50k.bin \ --output-dir /path/to/output_dir
- The
--imatrix-path
parameter specifies the path to the I-matrix file generated in the current iteration.
- The
-
Use
kl_d_bench.py
for ComparisonFor each quantized model with an updated I-matrix:
-
Run the Comparison Script
❯ uv run src/kl_d_bench.py \ --baseline-logits baseline_logits.hdf5 \ --target-model /path/to/output_dir/my_model-q4_0.gguf \ --dataset /path/to/dataset.txt \ --output-file comparison_iteration_n.hdf5
- Update
comparison_iteration_n.hdf5
to reflect the iteration number (e.g.,comparison_iteration_1.hdf5
).
- Update
-
Analyze KL-Divergence Metrics
Focus on key metrics, especially the high-percentile KL-divergence values, to assess the effectiveness of the quantization with each updated I-matrix.
-
-
Evaluate Metrics to Determine When to Stop
Monitor the KL-divergence metrics across iterations. Pay special attention to the 90th, 95th, and 99th percentiles. When successive iterations show marginal improvements (diminishing returns), you can consider the I-matrix sufficiently refined for your application.
To balance overall performance with outlier minimization, we suggest using a composite metric that combines the median KL-divergence and higher percentile values.
Suggested Composite Metric:
-
Explanation:
- Median KL-Divergence (
Median
): Represents typical performance across the dataset. - High Percentile KL-Divergence (
KLD_{90}
,KLD_{95}
,KLD_{99}
): Capture the worst-case divergences, indicating how well the model handles outlier cases. - Weighting Factors: The weights (1 for
KLD_{99}
, 4 forKLD_{95}
, 5 forKLD_{90}
) emphasize reducing higher divergences, with greater weight on percentiles covering more data. - Exponents: The exponents (1/3 for the median, 2/3 for the weighted sum) balance the influence of average performance and outlier cases in the overall score.
- Median KL-Divergence (
By minimizing this composite score, you ensure that the quantized model maintains strong overall performance while mitigating significant divergences in less common scenarios.
Selecting an appropriate dataset size and coverage is crucial for effective I-matrix calibration. We recommend:
- Starting Small: Use a representative subset that includes key languages or domains relevant to your application.
- Gradual Expansion: Increase the dataset size in each iteration to include more diversity and complexity.
- Balancing Diversity and Size: Ensure that the dataset remains manageable while covering the necessary range of tokens and contexts.
For detailed insights into dataset selection and initial sizing, refer to on_quantization.md. This document provides guidance on balancing dataset diversity and size to optimize the I-matrix calibration process.
-
Optimizing Batch Sizes with
best_bub.py
Before generating logits or quantizing large models, you may want to optimize batch (
--batch
) and micro-batch (--ubatch
) sizes to maximize performance given your hardware constraints.Example command:
❯ uv run src/best_bub.py --model /path/to/baseline_model.gguf --context-size 2048
- Adjust the
--context-size
to match your model's maximum context size. - This script will suggest optimal
--batch-size
and--ubatch-size
settings.
- Adjust the
-
Measuring Perplexity with
quantize.py
After quantization, you can measure the perplexity of your quantized model to assess its performance.
Example command:
❯ uv run src/quantize.py perplexity \ --model-name my_model \ --base-model /path/to/output_dir/my_model-q4_0.gguf \ --dataset /path/to/perplexity_dataset.txt
- Replace
/path/to/perplexity_dataset.txt
with a dataset suitable for perplexity measurement.
- Replace
As mentioned earlier, the imatrix_dataset.py
script uses a plugin system to support various data sources flexibly. Here's how you can utilize this system:
-
Selecting a Data Source Plugin
Choose an existing plugin or create a new one depending on your data source.
- Existing Plugins:
oscar_plugin.py
: An example huggingface dataset, for the OSCAR corpus.
- Custom Plugins: Create a plugin tailored to your data source.
- Existing Plugins:
-
Specifying the Plugin in the Command
When running
imatrix_dataset.py
, use the following arguments to specify the plugin:--datasource_plugin
: Path to the plugin script.--plugin_class
: The class name of the plugin within the script.
Example command with an existing plugin:
❯ uv run src/imatrix_dataset.py \ --datasource_plugin src/imatrix_dataset/oscar_plugin.py \ --plugin_class OscarDataSource \ --langs en es de \ --num_samples 100000 \ --output /path/to/dataset.txt
-
Creating a Custom Data Source Plugin
If your data source isn't covered by existing plugins, you can create your own.
Steps to Create a Custom Plugin:
-
Create a New Python File: Save it with a descriptive name, e.g.,
my_custom_plugin.py
. -
Import the Base Class:
from plugin_base import DataSourcePluginBase
-
Define Your Plugin Class: Inherit from
DataSourcePluginBase
.class MyCustomPlugin(DataSourcePluginBase): def __init__(self, name="my_dataset", **kwargs): super().__init__(name, **kwargs) self.schema = {'content': 'path.to.text.field'}
-
Implement the
load_data
Method:def load_data(self, lang, num_samples=200, skip_samples=0): # Implement your data loading logic here data_samples = [] # Fetch data samples based on lang, num_samples, skip_samples return data_samples
-
Optionally Override
get_content
Method:If your data records have a different structure, override this method to extract the text content.
def get_content(self, record): # Extract and return the text content from the record return record['desired_text_field']
Using Your Custom Plugin:
❯ uv run src/imatrix_dataset.py \ --datasource_plugin /path/to/my_custom_plugin.py \ --plugin_class MyCustomPlugin \ --langs en \ --num_samples 100000 \ --output /path/to/dataset.txt
-
-
Understanding the Plugin Base Class
The
DataSourcePluginBase
class defines the interface that all plugins must implement. It requires:- Initialization: Set up any necessary configurations or parameters.
load_data
Method: Must return a list of data records for the specified language and sample counts.get_content
Method: Extracts the textual content from a data record, used in building the combined dataset.
-
Example of Using the OSCAR Plugin
The OSCAR dataset is a large multilingual corpus. Here's how to use the
oscar_plugin.py
:❯ uv run src/imatrix_dataset.py \ --datasource_plugin src/imatrix_dataset/oscar_plugin.py \ --plugin_class OscarDataSource \ --langs en fr es \ --num_samples 50000 \ --output /path/to/dataset.txt
- This command generates a dataset using 50,000 samples from each of the specified languages.
This set of tools is designed to simplify each stage of model quantization, from setting up datasets and generating I-matrices to quantizing models and evaluating performance. By following these steps, you can make targeted, data-driven adjustments at every stage, helping you achieve quantization results that preserve model quality while accommodating diverse data requirements.