microsoft · QuanluZhang · Feb 25, 2021 · Oct 16, 2020 · Nov 4, 2020 · Nov 9, 2020
diff --git a/docs/en_US/Compression/CompressionReference.rst b/docs/en_US/Compression/CompressionReference.rst
@@ -1,16 +1,58 @@
-Python API Reference of Compression Utilities
-=============================================
+Compression Reference
+=====================
 
 .. contents::
 
-Sensitivity Utilities
+Compressors
+-----------
+
+Compressor
+^^^^^^^^^^
+
+..  autoclass:: nni.compression.pytorch.compressor.Compressor
+    :members:
+
+
+..  autoclass:: nni.compression.pytorch.compressor.Pruner
+    :members:
+
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.one_shot.OneshotPruner
+    :members:
+
+..  autoclass:: nni.compression.pytorch.compressor.Quantizer
+    :members:
+
+
+Module Wrapper
+^^^^^^^^^^^^^^
+
+..  autoclass:: nni.compression.pytorch.compressor.PrunerModuleWrapper
+    :members:
+
+
+..  autoclass:: nni.compression.pytorch.compressor.QuantizerModuleWrapper
+    :members:
+
+Weight Masker
+^^^^^^^^^^^^^
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.weight_masker.WeightMasker
+    :members:
+
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.structured_pruning.StructuredWeightMasker
+    :members:
+
+
+Compression Utilities
 ---------------------
 
+Sensitivity Utilities
+^^^^^^^^^^^^^^^^^^^^^
+
 ..  autoclass:: nni.compression.pytorch.utils.sensitivity_analysis.SensitivityAnalysis
     :members:
 
 Topology Utilities
-------------------
+^^^^^^^^^^^^^^^^^^
 
 ..  autoclass:: nni.compression.pytorch.utils.shape_dependency.ChannelDependency
     :members:
@@ -28,6 +70,6 @@ Topology Utilities
     :members:
 
 Model FLOPs/Parameters Counter
-------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 ..  autofunction:: nni.compression.pytorch.utils.counter.count_flops_params
diff --git a/docs/en_US/Compression/Overview.rst b/docs/en_US/Compression/Overview.rst
@@ -87,11 +87,6 @@ Quantization algorithms compress the original network by reducing the number of
      - Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. `Reference Paper <https://arxiv.org/abs/1602.02830>`__
 
 
-Automatic Model Compression
----------------------------
-
-Given targeted compression ratio, it is pretty hard to obtain the best compressed ratio in a one shot manner. An automatic model compression algorithm usually need to explore the compression space by compressing different layers with different sparsities. NNI provides such algorithms to free users from specifying sparsity of each layer in a model. Moreover, users could leverage NNI's auto tuning power to automatically compress a model. Detailed document can be found `here <./AutoPruningUsingTuners.rst>`__.
-
 Model Speedup
 -------------
 
@@ -102,10 +97,11 @@ Compression Utilities
 
 Compression utilities include some useful tools for users to understand and analyze the model they want to compress. For example, users could check sensitivity of each layer to pruning. Users could easily calculate the FLOPs and parameter size of a model. Please refer to `here <./CompressionUtils.rst>`__ for a complete list of compression utilities.
 
-Customize Your Own Compression Algorithms
------------------------------------------
+Advanced Usage
+--------------
+
+NNI model compression leaves simple interface for users to customize a new compression algorithm. The design philosophy of the interface is making users focus on the compression logic while hiding framework specific implementation details from users. Users can learn more about our compression framework and customize a new compression algorithm (pruning algorithm or quantization algorithm) based on our framework. Moreover, users could leverage NNI's auto tuning power to automatically compress a model. Please refer to `here <./advanced.rst>`__ for more details.
 
-NNI model compression leaves simple interface for users to customize a new compression algorithm. The design philosophy of the interface is making users focus on the compression logic while hiding framework specific implementation details from users. The detailed tutorial for customizing a new compression algorithm (pruning algorithm or quantization algorithm) can be found `here <./Framework.rst>`__.
 
 Reference and Feedback
 ----------------------

diff --git a/docs/en_US/Compression/QuickStart.rst b/docs/en_US/Compression/QuickStart.rst
diff --git a/docs/en_US/Compression/Tutorial.rst b/docs/en_US/Compression/Tutorial.rst
@@ -0,0 +1,190 @@
+Tutorial
+========
+
+.. contents::
+
+In this tutorial, we will explain more detailed usage about the model compression in NNI. 
+
+Setup compression goal
+----------------------
+
+Specify the configuration
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Users can specify the configuration (i.e., ``config_list``\ ) for a compression algorithm. For example, when compressing a model, users may want to specify the sparsity ratio, to specify different ratios for different types of operations, to exclude certain types of operations, or to compress only a certain types of operations. For users to express these kinds of requirements, we define a configuration specification. It can be seen as a python ``list`` object, where each element is a ``dict`` object. 
+
+The ``dict``\ s in the ``list`` are applied one by one, that is, the configurations in latter ``dict`` will overwrite the configurations in former ones on the operations that are within the scope of both of them. 
+
+There are different keys in a ``dict``. Some of them are common keys supported by all the compression algorithms:
+
+* **op_types**\ : This is to specify what types of operations to be compressed. 'default' means following the algorithm's default setting.
+* **op_names**\ : This is to specify by name what operations to be compressed. If this field is omitted, operations will not be filtered by it.
+* **exclude**\ : Default is False. If this field is True, it means the operations with specified types and names will be excluded from the compression.
+
+Some other keys are often specific to a certain algorithms, users can refer to `pruning algorithms <./Pruner.rst>`__ and `quantization algorithms <./Quantizer.rst>`__ for the keys allowed by each algorithm.
+
+A simple example of configuration is shown below:
+
+.. code-block:: python
+
+   [
+       {
+           'sparsity': 0.8,
+           'op_types': ['default']
+       },
+       {
+           'sparsity': 0.6,
+           'op_names': ['op_name1', 'op_name2']
+       },
+       {
+           'exclude': True,
+           'op_names': ['op_name3']
+       }
+   ]
+
+It means following the algorithm's default setting for compressed operations with sparsity 0.8, but for ``op_name1`` and ``op_name2`` use sparsity 0.6, and do not compress ``op_name3``.
+
+Quantization specific keys
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Besides the keys explained above, if you use quantization algorithms you need to specify more keys in ``config_list``\ , which are explained below.
+
+* **quant_types** : list of string. 
+
+Type of quantization you want to apply, currently support 'weight', 'input', 'output'. 'weight' means applying quantization operation
+to the weight parameter of modules. 'input' means applying quantization operation to the input of module forward method. 'output' means applying quantization operation to the output of module forward method, which is often called as 'activation' in some papers.
+
+
+* **quant_bits** : int or dict of {str : int}
+
+bits length of quantization, key is the quantization type, value is the quantization bits length, eg. 
+
+.. code-block:: bash
+
+   {
+       quant_bits: {
+           'weight': 8,
+           'output': 4,
+           },
+   }
+
+when the value is int type, all quantization types share same bits length. eg. 
+
+.. code-block:: bash
+
+   {
+       quant_bits: 8, # weight or output quantization are all 8 bits
+   }
+
+The following example shows a more complete ``config_list``\ , it uses ``op_names`` (or ``op_types``\ ) to specify the target layers along with the quantization bits for those layers.
+
+.. code-block:: bash
+
+   config_list = [{
+           'quant_types': ['weight'],        
+           'quant_bits': 8, 
+           'op_names': ['conv1']
+       }, {
+           'quant_types': ['weight'],
+           'quant_bits': 4,
+           'quant_start_step': 0,
+           'op_names': ['conv2']
+       }, {
+           'quant_types': ['weight'],
+           'quant_bits': 3,
+           'op_names': ['fc1']
+           },
+          {
+           'quant_types': ['weight'],
+           'quant_bits': 2,
+           'op_names': ['fc2']
+           }
+   ]
+
+In this example, 'op_names' is the name of layer and four layers will be quantized to different quant_bits.
+
+
+Export compression result
+-------------------------
+
+Export the pruend model
+^^^^^^^^^^^^^^^^^^^^^^^
+
+You can easily export the pruned model using the following API if you are pruning your model, ``state_dict`` of the sparse model weights will be stored in ``model.pth``\ , which can be loaded by ``torch.load('model.pth')``. Note that, the exported ``model.pth``\ has the same parameters as the original model except the masked weights are zero. ``mask_dict`` stores the binary value that produced by the pruning algorithm, which can be further used to speed up the model.
+
+.. code-block:: python
+
+   # export model weights and mask
+   pruner.export_model(model_path='model.pth', mask_path='mask.pth')
+
+   # apply mask to model
+   from nni.compression.pytorch import apply_compression_results
+
+   apply_compression_results(model, mask_file, device)
+
+
+export model in ``onnx`` format(\ ``input_shape`` need to be specified):
+
+.. code-block:: python
+
+   pruner.export_model(model_path='model.pth', mask_path='mask.pth', onnx_path='model.onnx', input_shape=[1, 1, 28, 28])
+
+
+Export the quantized model
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+You can export the quantized model directly by using ``torch.save`` api and the quantized model can be loaded by ``torch.load`` without any extra modification. The following example shows the normal procedure of saving, loading quantized model and get related parameters in QAT.
+
+.. code-block:: python
+
+   # Save quantized model which is generated by using NNI QAT algorithm
+   torch.save(model.state_dict(), "quantized_model.pkt")
+
+   # Simulate model loading procedure
+   # Have to init new model and compress it before loading
+   qmodel_load = Mnist()
+   optimizer = torch.optim.SGD(qmodel_load.parameters(), lr=0.01, momentum=0.5)
+   quantizer = QAT_Quantizer(qmodel_load, config_list, optimizer)
+   quantizer.compress()
+
+   # Load quantized model
+   qmodel_load.load_state_dict(torch.load("quantized_model.pkt"))
+
+   # Get scale, zero_point and weight of conv1 in loaded model
+   conv1 = qmodel_load.conv1
+   scale = conv1.module.scale
+   zero_point = conv1.module.zero_point
+   weight = conv1.module.weight
+
+
+Speed up the model
+------------------
+
+Masks do not provide real speedup of your model. The model should be speeded up based on the exported masks, thus, we provide an API to speed up your model as shown below. After invoking ``apply_compression_results`` on your model, your model becomes a smaller one with shorter inference latency.
+
+.. code-block:: python
+
+   from nni.compression.pytorch import apply_compression_results, ModelSpeedup
+
+   dummy_input = torch.randn(config['input_shape']).to(device)
+   m_speedup = ModelSpeedup(model, dummy_input, masks_file, device)
+   m_speedup.speedup_model()
+
+
+Please refer to `here <ModelSpeedup.rst>`__ for detailed description. The example code for model speedup can be found :githublink:`here <examples/model_compress/pruning/model_speedup.py>`
+
+
+Control the Fine-tuning process
+-------------------------------
+
+APIs to control the fine-tuning
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Some compression algorithms control the progress of compression during fine-tuning (e.g. `AGP <../Compression/Pruner.rst#agp-pruner>`__\ ), and some algorithms need to do something after every minibatch. Therefore, we provide another two APIs for users to invoke: ``pruner.update_epoch(epoch)`` and ``pruner.step()``.
+
+``update_epoch`` should be invoked in every epoch, while ``step`` should be invoked after each minibatch. Note that most algorithms do not require calling the two APIs. Please refer to each algorithm's document for details. For the algorithms that do not need them, calling them is allowed but has no effect.
+
+Enhance the fine-tuning process
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Knowledge distillation effectively learns a small student model from a large teacher model. Users can enhance the fine-tuning process that utilize knowledge distillation to improve the performance of the compressed model. Example code can be found :githublink:`here <examples/model_compress/pruning/finetune_kd_torch.py>`
diff --git a/docs/en_US/Compression/advanced.rst b/docs/en_US/Compression/advanced.rst
@@ -0,0 +1,9 @@
+Advanced Usage
+==============
+
+..  toctree::
+    :maxdepth: 2
+
+    Framework <./Framework>
+    Customize a new algorithm <./CustomizeCompressor>
+    Automatic Model Compression <./AutoPruningUsingTuners>
diff --git a/docs/en_US/model_compression.rst b/docs/en_US/model_compression.rst
@@ -28,5 +28,5 @@ For details, please refer to the following tutorials:
     Pruning <Compression/pruning>
     Quantization <Compression/quantization>
     Utilities <Compression/CompressionUtils>
-    Framework <Compression/Framework>
-    Customize Model Compression Algorithms <Compression/CustomizeCompressor>
+    Advanced Usage <Compression/advanced>
+    API Reference <Compression/CompressionReference>
diff --git a/docs/en_US/sdk_reference.rst b/docs/en_US/sdk_reference.rst
@@ -8,4 +8,4 @@ Python API Reference
 
     Auto Tune <autotune_ref>
     NAS <NAS/NasReference>
-    Compression Utilities <Compression/CompressionReference>
+    Compression <Compression/CompressionReference>
diff --git a/nni/algorithms/compression/pytorch/pruning/structured_pruning.py b/nni/algorithms/compression/pytorch/pruning/structured_pruning.py
@@ -42,6 +42,7 @@ def __init__(self, model, pruner, preserve_round=1, dependency_aware=False):
     def calc_mask(self, sparsity, wrapper, wrapper_idx=None, **depen_kwargs):
         """
         calculate the mask for `wrapper`.
+
         Parameters
         ----------
         sparsity: float/list of float
@@ -292,6 +293,7 @@ def _dependency_calc_mask(self, sparsities, wrappers, wrappers_idx, channel_dset
     def get_mask(self, base_mask, weight, num_prune, wrapper, wrapper_idx, channel_masks=None):
         """
         Calculate the mask of given layer.
+
         Parameters
         ----------
         base_mask: dict
@@ -309,6 +311,7 @@ def get_mask(self, base_mask, weight, num_prune, wrapper, wrapper_idx, channel_m
             mode, before calculating the masks for each layer, we will calculate a common
             mask for all the layers in the dependency set. For the pruners that doesnot
             support dependency-aware mode, they can just ignore this parameter.
+
         Returns
         -------
         dict

diff --git a/nni/compression/pytorch/compressor.py b/nni/compression/pytorch/compressor.py
@@ -422,8 +422,8 @@ def load_model_state_dict(self, model_state):
         """
         Load the state dict saved from unwrapped model.
 
-        Parameters:
-        -----------
+        Parameters
+        ----------
         model_state : dict
             state dict saved from unwrapped model
         """