Automatic Quantization

enot.quantization package contains functional for automatic quantization of user models. Best suitable for preparing user models for enot-lite int8 engines.

With enot.quantization package, you can automatically convert your PyTorch model to our intermediate representation which allows you to perform multiple kinds of quantization including vector quantization for TensorRT and OpenVINO.

This package features automatic distillation for weight fine-tuning, automatic quantization threshold search described in Fast Adjustable Threshold paper, different methods for layer selection for distillation and a number of fake-quantization algorithms.

enot-lite quantization

class TrtFakeQuantizedModel(model, leaf_modules=None, quantization_scheme='default', use_weight_scale_factors=False, use_bias_scale_factors=True)

Bases: FakeQuantizedModel

Quantized TensorRT model class, which uses int8 convolutions and fully-connected layers.

This class is used for quantization aware training.

__init__(model, leaf_modules=None, quantization_scheme='default', use_weight_scale_factors=False, use_bias_scale_factors=True)

Parameters:

model (nn.Module) – Model from which TrtFakeQuantizedModel will be constructed.
leaf_modules (list with types of modules or instances of torch.nn.Module, optional) – Types of modules or module instances that must be interpreted as leaf modules while tracing.
quantization_scheme (str) – Specifies GPU architecture for which quantization will be optimized. Pass pascal to optimize for Pascal GPU architecture. Pass default to optimize for newer then Pascal GPU architecture. Also, try to use optimal_quantization_scheme() to automatically select optimal quantization scheme.
use_weight_scale_factors (bool) – Whether to use scale factors for weights tuning. Since additional memory consumption (x2 times compared to off-mode) be careful to use. False by default.
use_bias_scale_factors (bool) – Whether to use scale factors to train biases. True by default.

enable_calibration_mode(mode=True)

Enables or disables calibration mode.

In the calibration mode quantized model collects input data statistics which will be used for quantization parameter initialization.

Parameters:: mode (bool, optional) – Whether to enable calibration mode. Default value is True.
Returns:: self
Return type:: FakeQuantizedModel

enable_quantization_mode(mode=True)

Enables or disables fake quantization.

Fake quantization mode is enabled for all quantized layers. In this regime these layers are using fake quantization nodes to produce quantized weights and activations during forward pass.

Parameters:: mode (bool, optional) – Whether to use fake quantization. Default value is True.
Returns:: self
Return type:: FakeQuantizedModel

quantization_parameters()

Returns an iterator over model quantization parameters (quantization thresholds).

Returns:: An iterator over model quantization parameters.
Return type:: iterator over torch.nn.Parameter

Notes

Weights of quantized modules (like convolution weight tensor or linear layer weight matrix) are not quantization parameters.

regular_parameters()

Returns an iterator over model parameters excluding quantization parameters.

Returns:: An iterator over regular model parameters.
Return type:: iterator over torch.nn.Parameter

class OpenvinoFakeQuantizedModel(model, leaf_modules=None, apply_avx2_fix=True, use_weight_scale_factors=False, use_bias_scale_factors=True, **kwargs)

Bases: FakeQuantizedModel

Quantized OpenVINO model class, which uses int8 convolutions and fully-connected layers.

This class is used for quantization aware training.

__init__(model, leaf_modules=None, apply_avx2_fix=True, use_weight_scale_factors=False, use_bias_scale_factors=True, **kwargs)

Parameters:

model (nn.Module) – Model from which OpenvinoFakeQuantizedModel will be constructed.
leaf_modules (list with types of modules or instances of torch.nn.Module, optional) – Types of modules or module instances that must be interpreted as leaf modules while tracing.
apply_avx2_fix (bool) – Whether to fix quantization parameters for avx2 kernels, or do not apply fix and maximize metric for avx512 kernels. Without fix we cannot guarantee stable result, because OpenVINO can mix kernels (AVX512, AVX2) on host with AVX512 instructions. True by default. Please do not change this option, if you do not know what you are doing.
use_weight_scale_factors (bool) – Whether to use scale factors for weights tuning. Since additional memory consumption (x2 times compared to off-mode) be careful to use. False by default.
use_bias_scale_factors (bool) – Whether to use scale factors to train biases. True by default.

enable_calibration_mode(mode=True)

Enables or disables calibration mode.

In the calibration mode quantized model collects input data statistics which will be used for quantization parameter initialization.

Parameters:: mode (bool, optional) – Whether to enable calibration mode. Default value is True.
Returns:: self
Return type:: FakeQuantizedModel

enable_quantization_mode(mode=True)

Enables or disables fake quantization.

Fake quantization mode is enabled for all quantized layers. In this regime these layers are using fake quantization nodes to produce quantized weights and activations during forward pass.

Parameters:: mode (bool, optional) – Whether to use fake quantization. Default value is True.
Returns:: self
Return type:: FakeQuantizedModel

quantization_parameters()

Returns an iterator over model quantization parameters (quantization thresholds).

Returns:: An iterator over model quantization parameters.
Return type:: iterator over torch.nn.Parameter

Notes

Weights of quantized modules (like convolution weight tensor or linear layer weight matrix) are not quantization parameters.

regular_parameters()

Returns an iterator over model parameters excluding quantization parameters.

Returns:: An iterator over regular model parameters.
Return type:: iterator over torch.nn.Parameter

class FakeQuantizedModel(model, transform_patterns, activations_quantization_type, leaf_modules=None, **kwargs)

Bases: Module

Base FakeQuantized model class.

Inserts fake quantization nodes into the model and provides interface for calibration and quantization aware training.

Use this class if you want to implement your own quantization scheme.

__init__(model, transform_patterns, activations_quantization_type, leaf_modules=None, **kwargs)

Parameters:

model (torch.nn.Module) – Model from which FakeQuantizedModel will be constructed.
transform_patterns (Sequence[Tuple[SubgraphTransformPattern, ...]]) – Sequence of group of transformations patterns. Each group will be applied separately.
activations_quantization_type (QuantizationType) – Type of activations quantization. Activations quantization supports only layerwise (scalar) and QuantizationStrategy and any QuantizationStrategy.
leaf_modules (list with types of modules or instances of torch.nn.Module, optional) – Types of modules or module instances that must be interpreted as leaf modules while tracing.
kwargs (kwargs) – Additional options.

enable_calibration_mode(mode=True)

Enables or disables calibration mode.

In the calibration mode quantized model collects input data statistics which will be used for quantization parameter initialization.

Parameters:: mode (bool, optional) – Whether to enable calibration mode. Default value is True.
Returns:: self
Return type:: FakeQuantizedModel

enable_quantization_mode(mode=True)

Enables or disables fake quantization.

Fake quantization mode is enabled for all quantized layers. In this regime these layers are using fake quantization nodes to produce quantized weights and activations during forward pass.

Parameters:: mode (bool, optional) – Whether to use fake quantization. Default value is True.
Returns:: self
Return type:: FakeQuantizedModel

quantization_parameters()

Returns an iterator over model quantization parameters (quantization thresholds).

Returns:: An iterator over model quantization parameters.
Return type:: iterator over torch.nn.Parameter

Notes

Weights of quantized modules (like convolution weight tensor or linear layer weight matrix) are not quantization parameters.

regular_parameters()

Returns an iterator over model parameters excluding quantization parameters.

Returns:: An iterator over regular model parameters.
Return type:: iterator over torch.nn.Parameter

class QuantizationGranularity(value)

Quantization granularity: layerwise or channelwise.

CHANNELWISE = 'CHANNELWISE'

LAYERWISE = 'LAYERWISE'

class QuantizationStrategy(value)

Quantization strategy: symmetric or assymetric.

ASYMMETRIC = 'ASYMMETRIC'

SYMMETRIC = 'SYMMETRIC'

class CalibrationMethod(value)

Calibration method: how-to calculate calibration thresholds.

MIN_MAX = 'MIN_MAX'

class RoundingFunction(value)

Function is used to round values in quantization/dequantization transformations.

HALF_TO_EVEN = ('HALF_TO_EVEN', <built-in method apply of FunctionMeta object>)

HALF_UP = ('HALF_UP', <built-in method apply of FunctionMeta object>)

class QuantizationType(granularity, strategy, calibration_method=CalibrationMethod.MIN_MAX, rounding_function=RoundingFunction.HALF_UP, bitness=8, calibration_options=None, use_weight_scale_factors=False, use_bias_scale_factors=True)

Defines the type of quantization.

__init__(granularity, strategy, calibration_method=CalibrationMethod.MIN_MAX, rounding_function=RoundingFunction.HALF_UP, bitness=8, calibration_options=None, use_weight_scale_factors=False, use_bias_scale_factors=True)

Parameters:

granularity (QuantizationGranularity) – Quantization granularity: layerwise or channelwise.
strategy (QuantizationStrategy) – QuantizationStrategy: symmetric or asymmetric.
calibration_method (CalibrationMethod) – The method that is used to calculate thresholds.
rounding_function (RoundingFunction) – The function that is used to round values in quantization procedure.
bitness (int) – Bitness of quantization.
use_weight_scale_factors (bool) – Whether to use scale factors for weights tuning. Since additional memory consumption (x2 times compared to off-mode) be careful to use. False by default.
use_bias_scale_factors (bool) – Whether to use scale factors to train biases. True by default.
calibration_options (Optional[Dict[str, Any]]) –

float_model_from_quantized_model(quantized_model)

Creates a copy of quantized model with the disabled fake quantization.

Parameters:: quantized_model (Any) –
Return type:: Any

optimal_quantization_scheme()

Returns optimal quantization scheme for locally installed GPU.

Return type:: str

calibration

Functions and classes from calibration module provides easy functional to calibrate quantization thresholds in fake-quantized model.

calibrate_quantized_model(quantized_model, dataloader, n_steps=None, epochs=1, sample_to_model_inputs=<function default_sample_to_model_inputs>, verbose=0)

Calibrates all quantization thresholds in a quantized model.

Parameters:

quantized_model (Any) – Model to calibrate quantization thresholds.
dataloader (torch.utils.data.DataLoader) – Dataloader which generates data that will be used to update model’s quantization thresholds.
n_steps (int or None, optional) – Number of total threshold calibration steps. Default value is None, which runs calibration on all dataloader images for the number of epochs in epochs argument.
epochs (int, optional) – Number of total threshold calibration epochs. Not used when n_steps argument is not None. Default value is 1.
sample_to_model_inputs (Callable, optional) – Function to map dataloader samples to model input format. Default value is default_sample_to_model_inputs(). See more here.
verbose (int, optional) – Procedure verbosity level. 0 disables all messages, 1 enables tqdm progress bar logging, 2 gives additional information about calibration. Default value is 0.

Return type:

None

Notes

Before calling this function, your model should be prepared to be as close to practical inference usage as possible. For example, it is your responsibility to call eval method of your model if your inference requires calling this method (e.g. when the model contains dropout layers).

Typically, it is better to calibrate quantization thresholds on validation-like data without augmentations (but with inference input preprocessing).

class calibration_context(quantized_model)

Bases: ContextDecorator

Context manager which enables and disables quantization threshold calibration procedure.

Within this context manager, calibration procedure is enabled in all FakeQuantization models. Exiting this context manager resets calibration and quantization flags in all layers to their initial values.

__init__(quantized_model)

Parameters:: quantized_model (Any) – Fake-quantized model instance to calibrate.

distillation

Functions and classes from distillation module provides utilities and procedures for quantized model fine-tuning using knowledge distillation.

Helper functions for distillation:

class DistillationLayer

Bases: Module

Base class for all distillation marker layers.

The main purpose of this class is to mark tensors which one wants to distill. This layer’s forward function takes exactly one input tensor and returns it.

Examples

>>> import torch
>>> from torch import nn
>>> from enot.distillation.distillation_layer import DistillationLayer
>>>
>>> student_model = torch.hub.load('pytorch/vision:v0.10.0', 'mobilenet_v2', pretrained=True)
>>> features = student_model.features
>>> # Mark output of the first convolutional layer to distill.
>>> features[0] = nn.Sequential(*features[0], DistillationLayer())
>>> student_model.features = nn.Sequential(*features)

class QuantDistillationModule

Bases: DistillationLayer

Marker layer for quantization distillation.

To mark a tensor for distillation (i.e. tell enot quantization procedures that they should use this tensor for distillation), you should insert this module into your model, and wrap all tensors which you want to distill with this module’s call.

Notes

You can call this module multiple times for different tensors, and your model can have multiple QuantDistillationModule nested class instances.

Examples

>>> from torch import nn
>>> from enot.quantization.distillation.modules import QuantDistillationModule
>>>
>>> class MyModelLayer(nn.Module):
...     def __init__(self):
...         super().__init__()
...         self.conv1 = nn.Conv2d(4, 8, (3, 3))
...         self.conv2 = nn.Conv2d(8, 4, (3, 3))
...         self.distillation_module = QuantDistillationModule()
...     def forward(self, x):
...         conv1_out = self.conv1(x)
...         conv1_out = self.distillation_module(conv1_out)  # Marks conv1 output as distillation target.
...         conv2_out = self.conv2(x)
...         return conv2_out
...
>>> my_module = MyModelLayer()

class DistillationLayerSelectionStrategy(value)

Bases: Enum

Layer selection strategy for quantization distillation procedure.

This strategy tells which layers will be used for distillation process during enot quantization.

Layer selection for distillation has been proven to be important, with some strategies being more robust in general while the others can provide better results in specific cases. For example, it is not a good idea to make distillation over detection network outputs as it produces unstable gradients. However, distillation over classification network outputs is widely used and provides good results in multiple scenarios.

The four available regimes are the following:

DistillationLayerSelectionStrategy.DISTILL_LAST_QUANT_LAYERS
- Finds last quantized layers and distill over their outputs. Default option, robust to different scenarios including classification, segmentation, detection.
DistillationLayerSelectionStrategy.DISTILL_OUTPUTS
- Finds all PyTorch tensors in user model’s outputs and distill over them. Useful for classification problems with cross entropy loss (torch.nn.CrossEntropyLoss).
DistillationLayerSelectionStrategy.DISTILL_ALL_QUANT_LAYERS
- Finds all quantized layers and distill over their outputs. This is generally more robust to overfitting, but in practice converges worse for small number of distillation epochs.
DistillationLayerSelectionStrategy.DISTILL_USER_DEFINED_LAYERS
- Distillation over user-defined tensors in the model. User should wrap such tensors with QuantDistillationModule module call. For more information, see it’s documentation.

add_distillation_nodes_to_onnx_converted_model(traced_model, onnx_tensor_names)

Inserts distillation nodes (instances of QuantDistillationModule) after nodes defined by user.

Parameters:

traced_model (torch.fx.GraphModule) – Model converted from ONNX to PyTorch using onnx2torch package. This model has to be converted by the convert() function with attach_onnx_mapping set to True.
onnx_tensor_names (list with str) – List with onnx tensor names from the original onnx model.

Return type:

None

class distillation_context(quantized_model, layer_selection_strategy=DistillationLayerSelectionStrategy.DISTILL_LAST_QUANT_LAYERS)

Bases: ContextDecorator

Context manager which enables and disables distillation procedure.

We are also open-sourcing our distillation procedures to allow user to customize distillation procedure. All classes below are public, and you can view their source code by clicking on [source] links.

class RMSELoss(eps=1e-06)[source]: Bases: Module

class DistillerInterface[source]: Distiller base interface.

class QuantizationDistiller(quantized_model, dataloader, optimizer=None, scheduler=None, distillation_layer_selection_strategy=DistillationLayerSelectionStrategy.DISTILL_LAST_QUANT_LAYERS, distillation_criterion='RMSELoss', n_epochs=1, device='cuda:0', sample_to_model_inputs=<function default_sample_to_model_inputs>, logdir=None, save_every=None, verbose=0)[source]

Bases: DistillerInterface

Quantized model distillation class with a simple distillation implementation.

__init__(quantized_model, dataloader, optimizer=None, scheduler=None, distillation_layer_selection_strategy=DistillationLayerSelectionStrategy.DISTILL_LAST_QUANT_LAYERS, distillation_criterion='RMSELoss', n_epochs=1, device='cuda:0', sample_to_model_inputs=<function default_sample_to_model_inputs>, logdir=None, save_every=None, verbose=0)[source]

Parameters:

quantized_model (FakeQuantizedModel) – Fake-quantized model.
dataloader (torch.utils.data.DataLoader) – Dataloader with model inputs for distillation.
optimizer (torch.optim.Optimizer or None, optional) – Optimizer instance.
scheduler (Scheduler or None, optional) – Scheduler instance.
distillation_layer_selection_strategy (DistillationLayerSelectionStrategy, optional) – Distillation layer selection strategy. Default value is DISTILL_LAST_QUANT_LAYERS.
distillation_criterion (Callable, optional) – Distillation criterion module. Default criterion is RMSE.
n_epochs (int, optional) – Number of epochs for distillation. Default value is 1.
device (str or torch.device, optional) – Device to use during distillation. Default value is “cuda:0”.
sample_to_model_inputs (Callable, optional) – Function to map dataloader samples to model input format. Default value is default_sample_to_model_inputs(). See more here.
logdir (str or Path or None, optional) – Save directory. Default value is None, which disables logging to directory.
save_every (int or None, optional) – Save checkpoint every n steps. Default value is None, which disables intermediate model checkpoints.
verbose (int, optional) – Verbosity level. Default value is 0.

distill()[source]

Launches distillation procedure.

Return type:: None

save(checkpoint_path)[source]

Saves quantized model state dict to the specified path.

Parameters:: checkpoint_path (Path) –
Return type:: None

class SequentialDistiller(*distillers)[source]

Bases: DistillerInterface

Compound distillation class which performs sequential distillation with multiple strategies.

__init__(*distillers)[source]

Parameters:: distillers (tuple with DistillerInterface) – Tuple with distiller instances.

distill()[source]

Launches distillation procedure.

Return type:: None

DefaultQuantizationDistiller: alias of ThresholdsAndScaleFactorsQuantizationDistiller

class ThresholdsAndScaleFactorsQuantizationDistiller(quantized_model, dataloader, learning_rate=0.005, device='cuda:0', sample_to_model_inputs=<function default_sample_to_model_inputs>, logdir=None, save_every=None, n_batches_calibrate=10, tune_scale_factors=True, distillation_layer_selection_strategy=DistillationLayerSelectionStrategy.DISTILL_LAST_QUANT_LAYERS, verbose=0, n_epochs=1)[source]

Bases: QuantizationDistiller

Quantization distiller for thresholds and scale factors with a default well-performing distillation configuration.

__init__(quantized_model, dataloader, learning_rate=0.005, device='cuda:0', sample_to_model_inputs=<function default_sample_to_model_inputs>, logdir=None, save_every=None, n_batches_calibrate=10, tune_scale_factors=True, distillation_layer_selection_strategy=DistillationLayerSelectionStrategy.DISTILL_LAST_QUANT_LAYERS, verbose=0, n_epochs=1)[source]

Parameters:

quantized_model (FakeQuantizedModel) – Fake-quantized model.
dataloader (torch.utils.data.DataLoader) – Dataloader with model inputs for distillation.
learning_rate (float, optional) – learning rate (default: 5e-3)
device (str or torch.device, optional) – Device to use during distillation. Default value is “cuda:0”.
sample_to_model_inputs (Callable, optional) – Function to map dataloader samples to model input format. Default value is default_sample_to_model_inputs(). See more here.
logdir (str or Path or None, optional) – Save directory. Default value is None, which disables logging to directory.
save_every (int or None, optional) – Save checkpoint every n steps. Default value is None, which disables intermediate model checkpoints.
n_batches_calibrate (int, optional) – Number of batches used for calibration. Default is 10.
tune_scale_factors (bool, optional) – Whether to tune scale factors or not. True by default.
distillation_layer_selection_strategy (DistillationLayerSelectionStrategy, optional) – Distillation layer selection strategy. Default value is DISTILL_LAST_QUANT_LAYERS.
verbose (int, optional) – Verbosity level. Default value is 0.
n_epochs (int, optional) – Number of epochs for distillation.

distill()[source]

Launches distillation procedure.

Return type:: None

class ThresholdsQuantizationDistiller(quantized_model, dataloader, learning_rate=0.05, device='cuda:0', sample_to_model_inputs=<function default_sample_to_model_inputs>, logdir=None, save_every=None, n_batches_calibrate=10, distillation_layer_selection_strategy=DistillationLayerSelectionStrategy.DISTILL_LAST_QUANT_LAYERS, verbose=0, n_epochs=1)[source]

Bases: ThresholdsAndScaleFactorsQuantizationDistiller

Quantization distiller for thresholds with a default well-performing distillation configuration.

__init__(quantized_model, dataloader, learning_rate=0.05, device='cuda:0', sample_to_model_inputs=<function default_sample_to_model_inputs>, logdir=None, save_every=None, n_batches_calibrate=10, distillation_layer_selection_strategy=DistillationLayerSelectionStrategy.DISTILL_LAST_QUANT_LAYERS, verbose=0, n_epochs=1)[source]

Parameters:

quantized_model (FakeQuantizedModel) – Fake-quantized model.
dataloader (torch.utils.data.DataLoader) – Dataloader with model inputs for distillation.
learning_rate (float, optional) – learning rate (default: 5e-2)
device (str or torch.device, optional) – Device to use during distillation. Default value is “cuda:0”.
sample_to_model_inputs (Callable, optional) – Function to map dataloader samples to model input format. Default value is default_sample_to_model_inputs(). See more here.
logdir (str or Path or None, optional) – Save directory. Default value is None, which disables logging to directory.
save_every (int or None, optional) – Save checkpoint every n steps. Default value is None, which disables intermediate model checkpoints.
n_batches_calibrate (int, optional) – Number of batches used for calibration. Default is 10.
distillation_layer_selection_strategy (DistillationLayerSelectionStrategy, optional) – Distillation layer selection strategy. Default value is DISTILL_LAST_QUANT_LAYERS.
verbose (int, optional) – Verbosity level. Default value is 0.
n_epochs (int, optional) – Number of epochs for distillation.