Skip to content
This repository has been archived by the owner on May 1, 2023. It is now read-only.

Frequently Asked Questions (FAQ)

Neta Zmora edited this page Apr 13, 2020 · 13 revisions

In this page we answer frequently asked questions.

Table of Contents

Pruning and Sparsity

Q1: I pruned my model using an element-wise pruner (also known as fine-pruning) and the size of the weights tensors is not reduced. What's going on?

A1: There are different types of sparsity patterns, with element-wise sparsity being the simplest case. When you perform fine-grained pruning, you produce tensors that are sparse at the element granularity. The weight tensors are not reduced in size because the zero-coefficients are still present in the tensor. Some NN accelerators (ASICs) take advantage of fine-grained sparsity by using a compressed representation of sparse tensors. An Investigation of Sparse Tensor Formats for Tensor Libraries provides a review of some of these representations, such as the Compressed Sparse Row format. When sparse weight tensors are represented using a compact format, they are stored in memory using the compact format which reduces the bandwidth and power required to fetch them into the neural processing unit. Once the compact tensor is read (in full, or in part) it can be converted back to a dense tensor format, to perform the neural compute operation. A further acceleration is achieved if the hardware can instead perform the compute operation directly using the tensor in its compact representation.
The diagram below shows fine-grain sparsity in comparison to other sparsity patterns (source: Exploring the Regularity of Sparse Structure in Convolutional Neural Networks which provides an exploration of the different types of sparsity patterns).

sparsity patterns


Q2: I pruned my model using an element-wise pruner and I don't see an improvement in the run-time. What's going on?

A2: The answer to the question above explains the necessity to use specialized hardware to see a performance gain from fine-grained weights sparsity. Currently the PyTorch software stack does not support sparse tensor representations in the main NN operations (e.g. Convolution and GEMM) so even with the best hardware, you can only see a performance boost when exporting the PyTorch model to ONNX, and executing the ONNX model on hardware with support for sparse representation.


Q3: I pruned my model using an block-structure pruner and I don't see an improvement in the run-time. What's going on?

A3: Block pruning refers to pruning 4-D structures of a a specific shape. This is similar to filter/channel pruning but allows for non-regular shapes that accelerate inference on a specific hardware platform. If we want to introduce sparsity in order to reduce the compute load of a certain layer, we need to understand how the HW and SW perform the layer's operation and what vector shape is used. Then we can induce sparsity to match the vector shape. For example, Intel AVX-512 are SIMD instructions that apply the same instruction (Single Instruction) on a vector of inputs (Multiple Data). The following single instruction performs an element-wise multiplication of two 16 32-bit elements:

__m256i result = __mm256_mul_epi32(vec_a, vec_b);

vec_a and vec_b may represent activations and weights, respectively. If either vec_a or vec_b are partially sparse, we still need to perform the multiplication operation and the sparsity does not help reduce the compute latency. However, if either vec_a or vec_b contain only zeros then we can eliminate entirely the instruction. In this case, we would like to have sparsity in pattern blocks of 16-elements. Things are a bit more complicated because we also need to understand how the software maps layer operations to hardware. For example, a 3x3 convolution can be computed as a direct-convolution, as a matrix multiply operation (GEMM), or as a Winograd matrix operation (to name a few ways of computation). These low-level operations are then mapped to SIMD instructions. Finally, the low-level SW needs to support a block-sparse storage-format for the weight tensors as explained in one of the answers above (see for example: http://www.netlib.org/linalg/html_templates/node90.html). The model is exported to ONNX for execution on a deployment HW-SW platform that can recognize sparsity patterns embedded in weight tensors and convert the tensors to their compact storage format.

In summary, different hardware benefit from different sparsity patterns.


Q4: I pruned my model using a channel/filter-pruner and the the weights tensors are sparse, but their shapes and sizes are as before the pruning. What's going on?

A4: To change the shape of weights tensors after pruning channels/filters, you need to use 'thinning'. See an example here which defines and uses a FilterRemover to remove zeroed filters from a model.

extensions:
  net_thinner:
      class: 'FilterRemover'
      thinning_func_str: remove_filters
      arch: 'resnet20_cifar'
      dataset: 'cifar10'

Quantization

Q1: I quantized my model, but it is not running faster than the FP32 model and the size of the parameter tensors is not reduced. What's going on?

A1: As currently implemented, Distiller only simulates post-training quantization. This allows us to study the effect of quantization on accuracy. It does not, unfortunately, provide insights on the actual runtime of the quantized model. We did not implement low-level specialized operations that utilize 8-bit capabilities of the CPU/GPU. Specifically, the way post-training quantization is implemented, supported layers are wrapped with quantize/de-quantizer operations, but the layers themselves are unchanged. They still operate on FP32 tensors, but these tensors are restricted to contain only integer values. So, as you can see, we are only adding operations in order to simulate quantization. Therefore it isn't surprising at all that you're getting slower runtime when quantizing. We do not have plans to implement "native" low-precision operations within Distiller. We do, however, plan to support exporting of quantized models using ONNX, once quantization is published as part of the standard. Then you'll be able to export a model quantized in Distiller and run it on a framework that supports actual 8-bit execution on CPU/GPU. In addition, 8-bit support in PyTorch itself is also in the works (it's already implemented in Caffe2). Once that's released, we'll see how we can integrate Distiller with that.


Q2: It looks like some of the operations in my model were not quantized. What's going on?

A2: The model might need to be modified to make sure all of the quantize-able operations are visible to Distiller. Please follow the steps outlined in the Distiller docs here.


Q3: Can I use Distiller to quantize a self-defined model, for example, a mobilenet-v2 with Deconv op or an upsample operation? How would I add a custom layer?

A3: We quantize models by taking from them each leaf module we want to quantize, creating a wrapper around them and replacing the leaf module with the respective wrapper.
For example, if we'd want to quantize a model with nn.Conv2ds, we'd create a wrapper class around it like QuantizedConv2d, and replace the original nn.Conv2d object by the QuantizedConv2d wrapper object.
Fortunately, we have already implemented this for most modules, as you can see in the implementation of PostTrainLinearQuantizer. There you can see that the quantizer holds a dictionary replacement_factory, which takes a type of module as a key and outputs a function (which takes the module itself, its name and several configuration arguments) that will create a wrapper around the module and replace with it in the original model. See So quantizing nn.ConvTranspose2d would usually be done by adding

quantizer.replacement_factory[nn.ConvTranspose2d] = replace_param_layer

However the class RangeLinearQuantParamLayerWrapper doesn't accept a nn.ConvTranspose2d layer yet (will raise ValueError). A quick work around is to add in your revision the nn.ConvTranspose2d layer into the acceptable RangeLinearQuantParamLayerWrapper layers in the code, you should be fine since deconv op is a linear op.

If you want to quantize using a yaml configuration, I suggest writing a custom quantizer class in the same way described above.

For a more elaborate discussion on preparing a model for quantization see our documentation


Q4: Can you explain how Quantization-Aware Training (QAT) works?

A4: Quantization-Aware Training (QAT) in Distiller is described in several places:

The example classifier compression sample application uses the YAML files as described in the documentation above. You need to learn how the quantization process is embedded in the training-loop and how the quantizer is configured from the YAML. This is explained in the links above, which contain a lot of information. We currently do not have an example of QAT using the direct API (i.e. without the YAML).

Miscellaneous

Q1: How does SummaryGraph work?

A1: summary_graph.py uses the PyTorch JIT tracer and ONNX export functionality to convert an ONNX IR representation of the graph. This representation is then queried to learn about the details of the computation graph. We use ONNX graphs because we found that they represent the major computation blocks, and we don't care about many of the operation details which are present in a PyTorch JIT trace (e.g. padding). The limitation of using the ONNX IR, is that not all PyTorch models can be exported to ONNX. See for example this PyTorch PR to export Mask RCNN to ONNX. Another issue that can come up when using summary_graph.py on arbitrary graphs, is that Distiller currently only supports memory and compute accounting for a small number of compute operations (e.g. convs, linear) so the code may break.