Automatic Quantization

The enot.quantization module contains functions for automatic quantization of user models. Best suitable for preparing user models for ENOT Lite INT8 engines.

With enot.quantization package, you can automatically convert your PyTorch model to our intermediate representation which allows you to perform multiple kinds of quantization including vector quantization for TensorRT, OpenVINO and STM devices.

This package features automatic distillation for weight fine-tuning, automatic quantization threshold search described in Fast Adjustable Threshold paper, different methods for layer selection for distillation and a number of fake-quantization algorithms.

The module provides fake quantized model classes – special model wrappers that implement quantization schemes and two context managers:

TensorRTFakeQuantizedModel / OpenVINOFakeQuantizedModel / STMFakeQuantizedModel
calibrate, distill

The quantization procedure is as follows:

wrap the float model in one of the fake quantized models listed above
write calibration loop using calibrate context decorator
write distillation loop using distill context decorator

Example:

from enot.quantization import TensorRTFakeQuantizedModel
from enot.quantization import calibrate
from enot.quantization import distill
from enot.quantization import RMSELoss

# wrap float model to fake quantized model
fq_model = TensorRTFakeQuantizedModel(model).cuda()

# calibration
with torch.no_grad(), calibrate(fq_model):
    for batch in itertools.islice(dataloader, 10):  # 10 batches for calibration
        batch = batch[0].cuda()
        fq_model(batch)

# distillation
n_epochs = 5
with distill(fq_model=fq_model, tune_weight_scale_factors=True) as (qdistill_model, params):
    optimizer = RAdam(params=params, lr=0.005, betas=(0.9, 0.95))
    scheduler = CosineAnnealingLR(optimizer=optimizer, T_max=len(dataloader) * n_epochs)
    criterion = RMSELoss()

    for _ in range(n_epochs):
        for batch in (tqdm_it := tqdm(dataloader)):
            batch = batch[0].cuda()

            optimizer.zero_grad()
            loss: torch.Tensor = torch.tensor(0.0).cuda()
            for student_output, teacher_output in qdistill_model(batch):
                loss += criterion(student_output, teacher_output)

            loss.backward()
            optimizer.step()
            scheduler.step()

            tqdm_it.set_description(f'loss: {loss.item():.5f}')

Fake Quantized Models

class TensorRTFakeQuantizedModel(model, args=(), kwargs=None, leaf_modules=None, quantization_scheme='default', use_weight_scale_factors=False, use_bias_scale_factors=True, quantize_add=True, inplace=False, **options)

Bases: FakeQuantizedModel

Quantized TensorRT model class, which uses INT8 convolutions and fully-connected layers.

This class is used for quantization aware training.

__init__(model, args=(), kwargs=None, leaf_modules=None, quantization_scheme='default', use_weight_scale_factors=False, use_bias_scale_factors=True, quantize_add=True, inplace=False, **options)

Parameters:

model (nn.Module) – Model from which TensorRTFakeQuantizedModel will be constructed.
args (Tuple) – Positional arguments for model.
kwargs (Dict[str, Any]) – Keyword arguments for model.
leaf_modules (list with types of modules or instances of torch.nn.Module, optional) – Types of modules or module instances that must be interpreted as leaf modules while tracing.
quantization_scheme (str) – Specifies GPU architecture for which quantization will be optimized. Pass pascal to optimize for Pascal GPU architecture. Pass default to optimize for newer then Pascal GPU architecture. Also, try to use optimal_quantization_scheme() to automatically select optimal quantization scheme.
use_weight_scale_factors (bool) – Whether to use scale factors for weights tuning. Since additional memory consumption (x2 times compared to off-mode) be careful to use. False by default.
use_bias_scale_factors (bool) – Whether to use scale factors to train biases. True by default.
quantize_add (bool) – Quantize add (+) or not. Default value is True.
inplace (bool) – Enables inplace modification of input model (reduces memory consumption). False by default.

class OpenVINOFakeQuantizedModel(model, args=(), kwargs=None, leaf_modules=None, apply_avx2_fix=False, use_weight_scale_factors=False, use_bias_scale_factors=True, inplace=False, **options)

Bases: FakeQuantizedModel

Quantized OpenVINO model class, which uses INT8 convolutions and fully-connected layers.

This class is used for quantization aware training.

__init__(model, args=(), kwargs=None, leaf_modules=None, apply_avx2_fix=False, use_weight_scale_factors=False, use_bias_scale_factors=True, inplace=False, **options)

Parameters:

model (nn.Module) – Model from which OpenVINOFakeQuantizedModel will be constructed.
args (Tuple) – Positional arguments for model.
kwargs (Dict[str, Any]) – Keyword arguments for model.
leaf_modules (list with types of modules or instances of torch.nn.Module, optional) – Types of modules or module instances that must be interpreted as leaf modules while tracing.
apply_avx2_fix (bool) – Whether to fix quantization parameters for AVX-2 kernels, or do not apply fix and maximize metric for AVX-512 kernels. Without fix we cannot guarantee stable result, because OpenVINO can mix kernels (AVX-512, AVX-2) on host with AVX-512 instructions. False by default. Please do not change this option, if you do not know what you are doing.
use_weight_scale_factors (bool) – Whether to use scale factors for weights tuning. Since additional memory consumption (x2 times compared to off-mode) be careful to use. False by default.
use_bias_scale_factors (bool) – Whether to use scale factors to train biases. True by default.
inplace (bool) – Enables inplace modification of input model (reduces memory consumption). False by default.

class STMFakeQuantizedModel(model, args=(), kwargs=None, leaf_modules=None, use_weight_scale_factors=False, use_bias_scale_factors=False, inplace=False, **options)

Bases: FakeQuantizedModel

Fake quantization model for STM devices.

ONNXSim and ONNXRuntime (basic level) optimizations MUST be applied to exported ONNX. Scale factors are disabled by default.

Examples

>>> import onnx
>>> import onnxruntime as rt
>>> import onnxsim
>>>
>>> fq_model = STMFakeQuantizedModel(...)
>>> ...
>>> torch.onnx.export(...)
>>>
>>> model, _ = onnxsim.simplify(model=ONNX_NAME)
>>> onnx.save(model, ONNX_NAME)
>>>
>>> sess_options = rt.SessionOptions()
>>> sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_BASIC
>>> sess_options.optimized_model_filepath = ONNX_NAME
>>> session = rt.InferenceSession(ONNX_NAME, sess_options)

__init__(model, args=(), kwargs=None, leaf_modules=None, use_weight_scale_factors=False, use_bias_scale_factors=False, inplace=False, **options)

Parameters:

model (nn.Module) – Model for quantization.
args (Tuple) – Positional arguments for model.
kwargs (Dict[str, Any]) – Keyword arguments for model.
leaf_modules (list with types of modules or instances of torch.nn.Module, optional) – Types of modules or module instances that must be interpreted as leaf modules while tracing.
use_weight_scale_factors (bool) – Whether to use scale factors for weights tuning. Since additional memory consumption (x2 times compared to off-mode) be careful to use. False by default.
use_bias_scale_factors (bool) – Whether to use scale factors to train biases. False by default.
inplace (bool) – Enables inplace modification of input model (reduces memory consumption). False by default.

class FakeQuantizedModel(model, args, kwargs, transform_patterns, activations_quantization_type, leaf_modules=None, inplace=False, **options)

Base FakeQuantized model class.

Inserts fake quantization nodes into the model and provides interface for calibration and quantization aware training.

enable_calibration_mode(mode=True)

Enables or disables calibration mode.

In the calibration mode quantized model collects input data statistics which will be used for quantization parameter initialization.

Parameters:: mode (bool, optional) – Whether to enable calibration mode. Default value is True.
Returns:: self
Return type:: FakeQuantizedModel

enable_quantization_mode(mode=True)

Enables or disables fake quantization.

Fake quantization mode is enabled for all quantized layers. In this regime these layers are using fake quantization nodes to produce quantized weights and activations during forward pass.

Parameters:: mode (bool, optional) – Whether to use fake quantization. Default value is True.
Returns:: self
Return type:: FakeQuantizedModel

quantization_parameters()

Returns an iterator over model quantization parameters (quantization thresholds).

Returns:: An iterator over model quantization parameters.
Return type:: iterator over torch.nn.Parameter

Notes

Weights of quantized modules (like convolution weight tensor or linear layer weight matrix) are not quantization parameters.

regular_parameters()

Returns an iterator over model parameters excluding quantization parameters.

Returns:: An iterator over regular model parameters.
Return type:: iterator over torch.nn.Parameter

float_model_from_quantized_model(quantized_model)

Creates a copy of quantized model with the disabled fake quantization.

Parameters:: quantized_model (Any) –
Return type:: Any

optimal_quantization_scheme()

Returns optimal quantization scheme for locally installed GPU.

Return type:: str

Calibration

class calibrate(fq_model)

Bases: ContextDecorator

Context manager which enables and disables quantization threshold calibration procedure.

Within this context manager, calibration procedure is enabled in all FakeQuantizedModel objects. Exiting this context manager resets calibration and quantization flags in all layers to their initial values.

Examples

>>> from enot.quantization import calibrate
>>>
>>> with torch.no_grad(), calibrate(fq_model):
>>>     for batch in itertools.islice(dataloader, 10):  # 10 batches for calibration
>>>         batch = batch[0].cuda()
>>>         fq_model(batch)

__init__(fq_model)

Parameters:: fq_model (FakeQuantizedModel) – Fake-quantized model instance to calibrate.

Distillation

The listed classes and functions provide utilities and procedures for quantized model fine-tuning using knowledge distillation.

class distill(fq_model, model=None, strategy=DistillationLayerSelectionStrategy.DISTILL_LAST_QUANT_LAYERS, tune_weight_scale_factors=False)

Bases: ContextDecorator

Context manager for distillation of FakeQuantizedModel.

Returns a pair consisting of a special module and parameters for backpropagation. The module accepts data and returns pairs: (student_output, teacher_output), which need to be passed to the loss function for distillation.

Examples

>>> from enot.quantization import distill
>>> from enot.quantization import RMSELoss
>>>
>>> with distill(fq_model=fq_model, tune_weight_scale_factors=True) as (qdistill_model, params):
>>>     optimizer = RAdam(params=params, lr=0.005, betas=(0.9, 0.95))
>>>     scheduler = CosineAnnealingLR(optimizer=optimizer, T_max=len(dataloader) * n_epochs)
>>>     criterion = RMSELoss()
>>>
>>>     for _ in range(n_epochs):
>>>         for batch in dataloader:
>>>             batch = batch[0]
>>>
>>>             optimizer.zero_grad()
>>>             loss: torch.Tensor = torch.tensor(0.0)
>>>             for student_output, teacher_output in qdistill_model(batch):
>>>                 loss += criterion(student_output, teacher_output)
>>>
>>>             loss.backward()
>>>             optimizer.step()
>>>             scheduler.step()

__init__(fq_model, model=None, strategy=DistillationLayerSelectionStrategy.DISTILL_LAST_QUANT_LAYERS, tune_weight_scale_factors=False)

Parameters:

fq_model (FakeQuantizedModel) – Fake-quantized model.
model (Optional[Union[nn.Module, GraphModule]]) – Optional teacher model for distillation. If value is None, float fq_model will be used as a teacher model. Default value is None.
strategy (DistillationLayerSelectionStrategy) – Distillation layer selection strategy. Default value is DISTILL_LAST_QUANT_LAYERS.
tune_weight_scale_factors (bool) – Whether to tune weight scale factors or not. False by default. Note: turning on weight scale factors in fake quantized model leads to additional memory consumption - x2 compared to disabled mode.

Helper functions for distillation:

class QuantDistillationModule

Marker layer for quantization distillation.

To mark a tensor for distillation (i.e. tell enot quantization procedures that they should use this tensor for distillation), you should insert this module into your model, and wrap all tensors which you want to distill with this module’s call.

Notes

You can call this module multiple times for different tensors, and your model can have multiple QuantDistillationModule nested class instances.

Examples

>>> from torch import nn
>>> from enot.quantization import QuantDistillationModule
>>>
>>> class MyModelLayer(nn.Module):
...     def __init__(self):
...         super().__init__()
...         self.conv1 = nn.Conv2d(4, 8, (3, 3))
...         self.conv2 = nn.Conv2d(8, 4, (3, 3))
...         self.distillation_module = QuantDistillationModule()
...     def forward(self, x):
...         conv1_out = self.conv1(x)
...         # mark conv1 output as distillation target:
...         conv1_out = self.distillation_module(conv1_out)
...         conv2_out = self.conv2(x)
...         return conv2_out
...
>>> my_module = MyModelLayer()

class DistillationLayerSelectionStrategy(value)

Bases: Enum

Layer selection strategy for quantization distillation procedure.

This strategy tells which layers will be used for distillation process during quantization.

Layer selection for distillation has been proven to be important, with some strategies being more robust in general while the others can provide better results in specific cases. For example, it is not a good idea to make distillation over detection network outputs as it produces unstable gradients. However, distillation over classification network outputs is widely used and provides good results in multiple scenarios.

The four available regimes are the following:

DISTILL_LAST_QUANT_LAYERS
- Finds last quantized layers and distill over their outputs. Default option, robust to different scenarios including classification, segmentation, detection.
DISTILL_OUTPUTS
- Finds all PyTorch tensors in user model’s outputs and distill over them. Useful for classification problems with cross entropy loss (torch.nn.CrossEntropyLoss).
DISTILL_ALL_QUANT_LAYERS
- Finds all quantized layers and distill over their outputs. This is generally more robust to overfitting, but in practice converges worse for small number of distillation epochs.
DISTILL_USER_DEFINED_LAYERS
- Distillation over user-defined tensors in the model. User should wrap such tensors with QuantDistillationModule module call. For more information, see it’s documentation.

class RMSELoss(eps=1e-06): RMSE loss.