Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Refactor Compression Doc #3371

Merged
merged 23 commits into from
Feb 25, 2021
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 47 additions & 5 deletions docs/en_US/Compression/CompressionReference.rst
Original file line number Diff line number Diff line change
@@ -1,16 +1,58 @@
Python API Reference of Compression Utilities
=============================================
Compression Reference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Model Compression API Reference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

=====================

.. contents::

Sensitivity Utilities
Compressors
-----------

Compressor
^^^^^^^^^^
QuanluZhang marked this conversation as resolved.
Show resolved Hide resolved

.. autoclass:: nni.compression.pytorch.compressor.Compressor
:members:


.. autoclass:: nni.compression.pytorch.compressor.Pruner
:members:

.. autoclass:: nni.algorithms.compression.pytorch.pruning.one_shot.OneshotPruner
:members:

.. autoclass:: nni.compression.pytorch.compressor.Quantizer
:members:


Module Wrapper
^^^^^^^^^^^^^^

.. autoclass:: nni.compression.pytorch.compressor.PrunerModuleWrapper
:members:


.. autoclass:: nni.compression.pytorch.compressor.QuantizerModuleWrapper
:members:

Weight Masker
^^^^^^^^^^^^^
.. autoclass:: nni.algorithms.compression.pytorch.pruning.weight_masker.WeightMasker
:members:

.. autoclass:: nni.algorithms.compression.pytorch.pruning.structured_pruning.StructuredWeightMasker
:members:


Compression Utilities
---------------------

Sensitivity Utilities
^^^^^^^^^^^^^^^^^^^^^

.. autoclass:: nni.compression.pytorch.utils.sensitivity_analysis.SensitivityAnalysis
:members:

Topology Utilities
------------------
^^^^^^^^^^^^^^^^^^

.. autoclass:: nni.compression.pytorch.utils.shape_dependency.ChannelDependency
:members:
Expand All @@ -28,6 +70,6 @@ Topology Utilities
:members:

Model FLOPs/Parameters Counter
------------------------------
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. autofunction:: nni.compression.pytorch.utils.counter.count_flops_params
12 changes: 4 additions & 8 deletions docs/en_US/Compression/Overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -87,11 +87,6 @@ Quantization algorithms compress the original network by reducing the number of
- Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. `Reference Paper <https://arxiv.org/abs/1602.02830>`__


Automatic Model Compression
---------------------------

Given targeted compression ratio, it is pretty hard to obtain the best compressed ratio in a one shot manner. An automatic model compression algorithm usually need to explore the compression space by compressing different layers with different sparsities. NNI provides such algorithms to free users from specifying sparsity of each layer in a model. Moreover, users could leverage NNI's auto tuning power to automatically compress a model. Detailed document can be found `here <./AutoPruningUsingTuners.rst>`__.

Model Speedup
-------------

Expand All @@ -102,10 +97,11 @@ Compression Utilities

Compression utilities include some useful tools for users to understand and analyze the model they want to compress. For example, users could check sensitivity of each layer to pruning. Users could easily calculate the FLOPs and parameter size of a model. Please refer to `here <./CompressionUtils.rst>`__ for a complete list of compression utilities.

Customize Your Own Compression Algorithms
-----------------------------------------
Advanced Usage
--------------

NNI model compression leaves simple interface for users to customize a new compression algorithm. The design philosophy of the interface is making users focus on the compression logic while hiding framework specific implementation details from users. Users can learn more about our compression framework and customize a new compression algorithm (pruning algorithm or quantization algorithm) based on our framework. Moreover, users could leverage NNI's auto tuning power to automatically compress a model. Please refer to `here <./advanced.rst>`__ for more details.

NNI model compression leaves simple interface for users to customize a new compression algorithm. The design philosophy of the interface is making users focus on the compression logic while hiding framework specific implementation details from users. The detailed tutorial for customizing a new compression algorithm (pruning algorithm or quantization algorithm) can be found `here <./Framework.rst>`__.

Reference and Feedback
----------------------
Expand Down
252 changes: 59 additions & 193 deletions docs/en_US/Compression/QuickStart.rst

Large diffs are not rendered by default.

190 changes: 190 additions & 0 deletions docs/en_US/Compression/Tutorial.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,190 @@
Tutorial
========

.. contents::

In this tutorial, we will explain more detailed usage about the model compression in NNI.

Setup compression goal
----------------------

Specify the configuration
^^^^^^^^^^^^^^^^^^^^^^^^^

Users can specify the configuration (i.e., ``config_list``\ ) for a compression algorithm. For example, when compressing a model, users may want to specify the sparsity ratio, to specify different ratios for different types of operations, to exclude certain types of operations, or to compress only a certain types of operations. For users to express these kinds of requirements, we define a configuration specification. It can be seen as a python ``list`` object, where each element is a ``dict`` object.

The ``dict``\ s in the ``list`` are applied one by one, that is, the configurations in latter ``dict`` will overwrite the configurations in former ones on the operations that are within the scope of both of them.

There are different keys in a ``dict``. Some of them are common keys supported by all the compression algorithms:

* **op_types**\ : This is to specify what types of operations to be compressed. 'default' means following the algorithm's default setting.
* **op_names**\ : This is to specify by name what operations to be compressed. If this field is omitted, operations will not be filtered by it.
* **exclude**\ : Default is False. If this field is True, it means the operations with specified types and names will be excluded from the compression.

Some other keys are often specific to a certain algorithms, users can refer to `pruning algorithms <./Pruner.rst>`__ and `quantization algorithms <./Quantizer.rst>`__ for the keys allowed by each algorithm.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a certain algorithms -> a certain algorithm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix it


A simple example of configuration is shown below:

.. code-block:: python

[
{
'sparsity': 0.8,
'op_types': ['default']
},
{
'sparsity': 0.6,
'op_names': ['op_name1', 'op_name2']
},
{
'exclude': True,
'op_names': ['op_name3']
}
]

It means following the algorithm's default setting for compressed operations with sparsity 0.8, but for ``op_name1`` and ``op_name2`` use sparsity 0.6, and do not compress ``op_name3``.

Quantization specific keys
^^^^^^^^^^^^^^^^^^^^^^^^^^

Besides the keys explained above, if you use quantization algorithms you need to specify more keys in ``config_list``\ , which are explained below.

* **quant_types** : list of string.

Type of quantization you want to apply, currently support 'weight', 'input', 'output'. 'weight' means applying quantization operation
to the weight parameter of modules. 'input' means applying quantization operation to the input of module forward method. 'output' means applying quantization operation to the output of module forward method, which is often called as 'activation' in some papers.


* **quant_bits** : int or dict of {str : int}

bits length of quantization, key is the quantization type, value is the quantization bits length, eg.

.. code-block:: bash

{
quant_bits: {
'weight': 8,
'output': 4,
},
}

when the value is int type, all quantization types share same bits length. eg.

.. code-block:: bash

{
quant_bits: 8, # weight or output quantization are all 8 bits
}

The following example shows a more complete ``config_list``\ , it uses ``op_names`` (or ``op_types``\ ) to specify the target layers along with the quantization bits for those layers.

.. code-block:: bash

config_list = [{
'quant_types': ['weight'],
'quant_bits': 8,
'op_names': ['conv1']
}, {
'quant_types': ['weight'],
'quant_bits': 4,
'quant_start_step': 0,
'op_names': ['conv2']
}, {
'quant_types': ['weight'],
'quant_bits': 3,
'op_names': ['fc1']
},
{
'quant_types': ['weight'],
'quant_bits': 2,
'op_names': ['fc2']
}
]

In this example, 'op_names' is the name of layer and four layers will be quantized to different quant_bits.


Export compression result
-------------------------

Export the pruend model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pruend?

^^^^^^^^^^^^^^^^^^^^^^^

You can easily export the pruned model using the following API if you are pruning your model, ``state_dict`` of the sparse model weights will be stored in ``model.pth``\ , which can be loaded by ``torch.load('model.pth')``. Note that, the exported ``model.pth``\ has the same parameters as the original model except the masked weights are zero. ``mask_dict`` stores the binary value that produced by the pruning algorithm, which can be further used to speed up the model.

.. code-block:: python

# export model weights and mask
pruner.export_model(model_path='model.pth', mask_path='mask.pth')

# apply mask to model
from nni.compression.pytorch import apply_compression_results

apply_compression_results(model, mask_file, device)


export model in ``onnx`` format(\ ``input_shape`` need to be specified):

.. code-block:: python

pruner.export_model(model_path='model.pth', mask_path='mask.pth', onnx_path='model.onnx', input_shape=[1, 1, 28, 28])


Export the quantized model
^^^^^^^^^^^^^^^^^^^^^^^^^^

You can export the quantized model directly by using ``torch.save`` api and the quantized model can be loaded by ``torch.load`` without any extra modification. The following example shows the normal procedure of saving, loading quantized model and get related parameters in QAT.

.. code-block:: python

# Save quantized model which is generated by using NNI QAT algorithm
torch.save(model.state_dict(), "quantized_model.pkt")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, what is .pkt mean, why not use .pth?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.pkt means 'Packet Tracer Network Simulation Model'. The reason why I use '.pkt' here is that I think model generated from QAT is simulated model not the real model. Of course we can use .pth here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.pth is be better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix it


# Simulate model loading procedure
# Have to init new model and compress it before loading
qmodel_load = Mnist()
optimizer = torch.optim.SGD(qmodel_load.parameters(), lr=0.01, momentum=0.5)
quantizer = QAT_Quantizer(qmodel_load, config_list, optimizer)
quantizer.compress()

# Load quantized model
qmodel_load.load_state_dict(torch.load("quantized_model.pkt"))

# Get scale, zero_point and weight of conv1 in loaded model
conv1 = qmodel_load.conv1
scale = conv1.module.scale
zero_point = conv1.module.zero_point
weight = conv1.module.weight


Speed up the model
------------------

Masks do not provide real speedup of your model. The model should be speeded up based on the exported masks, thus, we provide an API to speed up your model as shown below. After invoking ``apply_compression_results`` on your model, your model becomes a smaller one with shorter inference latency.

.. code-block:: python

from nni.compression.pytorch import apply_compression_results, ModelSpeedup

dummy_input = torch.randn(config['input_shape']).to(device)
m_speedup = ModelSpeedup(model, dummy_input, masks_file, device)
m_speedup.speedup_model()


Please refer to `here <ModelSpeedup.rst>`__ for detailed description. The example code for model speedup can be found :githublink:`here <examples/model_compress/pruning/model_speedup.py>`


Control the Fine-tuning process
-------------------------------

APIs to control the fine-tuning
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Some compression algorithms control the progress of compression during fine-tuning (e.g. `AGP <../Compression/Pruner.rst#agp-pruner>`__\ ), and some algorithms need to do something after every minibatch. Therefore, we provide another two APIs for users to invoke: ``pruner.update_epoch(epoch)`` and ``pruner.step()``.

``update_epoch`` should be invoked in every epoch, while ``step`` should be invoked after each minibatch. Note that most algorithms do not require calling the two APIs. Please refer to each algorithm's document for details. For the algorithms that do not need them, calling them is allowed but has no effect.

Enhance the fine-tuning process
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Knowledge distillation effectively learns a small student model from a large teacher model. Users can enhance the fine-tuning process that utilize knowledge distillation to improve the performance of the compressed model. Example code can be found :githublink:`here <examples/model_compress/pruning/finetune_kd_torch.py>`
9 changes: 9 additions & 0 deletions docs/en_US/Compression/advanced.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
Advanced Usage
==============

.. toctree::
:maxdepth: 2

Framework <./Framework>
Customize a new algorithm <./CustomizeCompressor>
Automatic Model Compression <./AutoPruningUsingTuners>
4 changes: 2 additions & 2 deletions docs/en_US/model_compression.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,5 +28,5 @@ For details, please refer to the following tutorials:
Pruning <Compression/pruning>
Quantization <Compression/quantization>
Utilities <Compression/CompressionUtils>
Framework <Compression/Framework>
Customize Model Compression Algorithms <Compression/CustomizeCompressor>
Advanced Usage <Compression/advanced>
API Reference <Compression/CompressionReference>
2 changes: 1 addition & 1 deletion docs/en_US/sdk_reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,4 @@ Python API Reference

Auto Tune <autotune_ref>
NAS <NAS/NasReference>
Compression Utilities <Compression/CompressionReference>
Compression <Compression/CompressionReference>
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ def __init__(self, model, pruner, preserve_round=1, dependency_aware=False):
def calc_mask(self, sparsity, wrapper, wrapper_idx=None, **depen_kwargs):
"""
calculate the mask for `wrapper`.

Parameters
----------
sparsity: float/list of float
Expand Down Expand Up @@ -292,6 +293,7 @@ def _dependency_calc_mask(self, sparsities, wrappers, wrappers_idx, channel_dset
def get_mask(self, base_mask, weight, num_prune, wrapper, wrapper_idx, channel_masks=None):
"""
Calculate the mask of given layer.

Parameters
----------
base_mask: dict
Expand All @@ -309,6 +311,7 @@ def get_mask(self, base_mask, weight, num_prune, wrapper, wrapper_idx, channel_m
mode, before calculating the masks for each layer, we will calculate a common
mask for all the layers in the dependency set. For the pruners that doesnot
support dependency-aware mode, they can just ignore this parameter.

Returns
-------
dict
Expand Down
4 changes: 2 additions & 2 deletions nni/compression/pytorch/compressor.py
Original file line number Diff line number Diff line change
Expand Up @@ -422,8 +422,8 @@ def load_model_state_dict(self, model_state):
"""
Load the state dict saved from unwrapped model.

Parameters:
-----------
Parameters
----------
model_state : dict
state dict saved from unwrapped model
"""
Expand Down