Automatic Quantization
The enot.quantization
module contains functions for automatic quantization
of user models. Best suitable for preparing user models for ENOT Lite INT8
engines.
With enot.quantization
package, you can automatically convert your
PyTorch model to our intermediate representation which allows you to
perform multiple kinds of quantization including vector quantization for
TensorRT, OpenVINO and STM devices.
This package features automatic distillation for weight fine-tuning, automatic quantization threshold search described in Fast Adjustable Threshold paper, different methods for layer selection for distillation and a number of fake-quantization algorithms.
The module provides fake quantized model classes – special model wrappers that implement quantization schemes and two context managers:
The quantization procedure is as follows:
wrap the float model in one of the fake quantized models listed above
write calibration loop using
calibrate
context decoratorwrite distillation loop using
distill
context decorator
Example:
from enot.quantization import TensorRTFakeQuantizedModel
from enot.quantization import calibrate
from enot.quantization import distill
from enot.quantization import RMSELoss
# wrap float model to fake quantized model
fq_model = TensorRTFakeQuantizedModel(model).cuda()
# calibration
with torch.no_grad(), calibrate(fq_model):
for batch in itertools.islice(dataloader, 10): # 10 batches for calibration
batch = batch[0].cuda()
fq_model(batch)
# distillation
n_epochs = 5
with distill(fq_model=fq_model, tune_weight_scale_factors=True) as (qdistill_model, params):
optimizer = RAdam(params=params, lr=0.005, betas=(0.9, 0.95))
scheduler = CosineAnnealingLR(optimizer=optimizer, T_max=len(dataloader) * n_epochs)
criterion = RMSELoss()
for _ in range(n_epochs):
for batch in (tqdm_it := tqdm(dataloader)):
batch = batch[0].cuda()
optimizer.zero_grad()
loss: torch.Tensor = torch.tensor(0.0).cuda()
for student_output, teacher_output in qdistill_model(batch):
loss += criterion(student_output, teacher_output)
loss.backward()
optimizer.step()
scheduler.step()
tqdm_it.set_description(f'loss: {loss.item():.5f}')
Fake Quantized Models
- class TensorRTFakeQuantizedModel(model, args=(), kwargs=None, leaf_modules=None, quantization_scheme='default', use_weight_scale_factors=False, use_bias_scale_factors=True, quantize_add=True, inplace=False, **options)
Bases:
FakeQuantizedModel
Quantized TensorRT model class, which uses INT8 convolutions and fully-connected layers.
This class is used for quantization aware training.
- __init__(model, args=(), kwargs=None, leaf_modules=None, quantization_scheme='default', use_weight_scale_factors=False, use_bias_scale_factors=True, quantize_add=True, inplace=False, **options)
- Parameters:
model (nn.Module) – Model from which
TensorRTFakeQuantizedModel
will be constructed.args (Tuple) – Positional arguments for model.
kwargs (Dict[str, Any]) – Keyword arguments for model.
leaf_modules (list with types of modules or instances of torch.nn.Module, optional) – Types of modules or module instances that must be interpreted as leaf modules while tracing.
quantization_scheme (str) – Specifies GPU architecture for which quantization will be optimized. Pass pascal to optimize for Pascal GPU architecture. Pass default to optimize for newer then Pascal GPU architecture. Also, try to use
optimal_quantization_scheme()
to automatically select optimal quantization scheme.use_weight_scale_factors (bool) – Whether to use scale factors for weights tuning. Since additional memory consumption (x2 times compared to off-mode) be careful to use. False by default.
use_bias_scale_factors (bool) – Whether to use scale factors to train biases. True by default.
quantize_add (bool) – Quantize add (+) or not. Default value is True.
inplace (bool) – Enables inplace modification of input model (reduces memory consumption). False by default.
- class OpenVINOFakeQuantizedModel(model, args=(), kwargs=None, leaf_modules=None, apply_avx2_fix=False, use_weight_scale_factors=False, use_bias_scale_factors=True, inplace=False, **options)
Bases:
FakeQuantizedModel
Quantized OpenVINO model class, which uses INT8 convolutions and fully-connected layers.
This class is used for quantization aware training.
- __init__(model, args=(), kwargs=None, leaf_modules=None, apply_avx2_fix=False, use_weight_scale_factors=False, use_bias_scale_factors=True, inplace=False, **options)
- Parameters:
model (nn.Module) – Model from which
OpenVINOFakeQuantizedModel
will be constructed.args (Tuple) – Positional arguments for model.
kwargs (Dict[str, Any]) – Keyword arguments for model.
leaf_modules (list with types of modules or instances of torch.nn.Module, optional) – Types of modules or module instances that must be interpreted as leaf modules while tracing.
apply_avx2_fix (bool) – Whether to fix quantization parameters for AVX-2 kernels, or do not apply fix and maximize metric for AVX-512 kernels. Without fix we cannot guarantee stable result, because OpenVINO can mix kernels (AVX-512, AVX-2) on host with AVX-512 instructions. False by default. Please do not change this option, if you do not know what you are doing.
use_weight_scale_factors (bool) – Whether to use scale factors for weights tuning. Since additional memory consumption (x2 times compared to off-mode) be careful to use. False by default.
use_bias_scale_factors (bool) – Whether to use scale factors to train biases. True by default.
inplace (bool) – Enables inplace modification of input model (reduces memory consumption). False by default.
- class STMFakeQuantizedModel(model, args=(), kwargs=None, leaf_modules=None, use_weight_scale_factors=False, use_bias_scale_factors=False, inplace=False, **options)
Bases:
FakeQuantizedModel
Fake quantization model for STM devices.
ONNXSim and ONNXRuntime (basic level) optimizations MUST be applied to exported ONNX. Scale factors are disabled by default.
Examples
>>> import onnx >>> import onnxruntime as rt >>> import onnxsim >>> >>> fq_model = STMFakeQuantizedModel(...) >>> ... >>> torch.onnx.export(...) >>> >>> model, _ = onnxsim.simplify(model=ONNX_NAME) >>> onnx.save(model, ONNX_NAME) >>> >>> sess_options = rt.SessionOptions() >>> sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_BASIC >>> sess_options.optimized_model_filepath = ONNX_NAME >>> session = rt.InferenceSession(ONNX_NAME, sess_options)
- __init__(model, args=(), kwargs=None, leaf_modules=None, use_weight_scale_factors=False, use_bias_scale_factors=False, inplace=False, **options)
- Parameters:
model (nn.Module) – Model for quantization.
args (Tuple) – Positional arguments for model.
kwargs (Dict[str, Any]) – Keyword arguments for model.
leaf_modules (list with types of modules or instances of torch.nn.Module, optional) – Types of modules or module instances that must be interpreted as leaf modules while tracing.
use_weight_scale_factors (bool) – Whether to use scale factors for weights tuning. Since additional memory consumption (x2 times compared to off-mode) be careful to use. False by default.
use_bias_scale_factors (bool) – Whether to use scale factors to train biases. False by default.
inplace (bool) – Enables inplace modification of input model (reduces memory consumption). False by default.
- class FakeQuantizedModel(model, args, kwargs, transform_patterns, activations_quantization_type, leaf_modules=None, inplace=False, **options)
Base FakeQuantized model class.
Inserts fake quantization nodes into the model and provides interface for calibration and quantization aware training.
- enable_calibration_mode(mode=True)
Enables or disables calibration mode.
In the calibration mode quantized model collects input data statistics which will be used for quantization parameter initialization.
- Parameters:
mode (bool, optional) – Whether to enable calibration mode. Default value is True.
- Returns:
self
- Return type:
- enable_quantization_mode(mode=True)
Enables or disables fake quantization.
Fake quantization mode is enabled for all quantized layers. In this regime these layers are using fake quantization nodes to produce quantized weights and activations during forward pass.
- Parameters:
mode (bool, optional) – Whether to use fake quantization. Default value is True.
- Returns:
self
- Return type:
- quantization_parameters()
Returns an iterator over model quantization parameters (quantization thresholds).
- Returns:
An iterator over model quantization parameters.
- Return type:
iterator over torch.nn.Parameter
Notes
Weights of quantized modules (like convolution weight tensor or linear layer weight matrix) are not quantization parameters.
- regular_parameters()
Returns an iterator over model parameters excluding quantization parameters.
- Returns:
An iterator over regular model parameters.
- Return type:
iterator over torch.nn.Parameter
- float_model_from_quantized_model(quantized_model)
Creates a copy of quantized model with the disabled fake quantization.
Calibration
- class calibrate(fq_model)
Bases:
ContextDecorator
Context manager which enables and disables quantization threshold calibration procedure.
Within this context manager, calibration procedure is enabled in all
FakeQuantizedModel
objects. Exiting this context manager resets calibration and quantization flags in all layers to their initial values.Examples
>>> from enot.quantization import calibrate >>> >>> with torch.no_grad(), calibrate(fq_model): >>> for batch in itertools.islice(dataloader, 10): # 10 batches for calibration >>> batch = batch[0].cuda() >>> fq_model(batch)
- __init__(fq_model)
- Parameters:
fq_model (FakeQuantizedModel) – Fake-quantized model instance to calibrate.
Distillation
The listed classes and functions provide utilities and procedures for quantized model fine-tuning using knowledge distillation.
- class distill(fq_model, model=None, strategy=DistillationLayerSelectionStrategy.DISTILL_LAST_QUANT_LAYERS, tune_weight_scale_factors=False)
Bases:
ContextDecorator
Context manager for distillation of
FakeQuantizedModel
.Returns a pair consisting of a special module and parameters for backpropagation. The module accepts data and returns pairs:
(student_output, teacher_output)
, which need to be passed to the loss function for distillation.Examples
>>> from enot.quantization import distill >>> from enot.quantization import RMSELoss >>> >>> with distill(fq_model=fq_model, tune_weight_scale_factors=True) as (qdistill_model, params): >>> optimizer = RAdam(params=params, lr=0.005, betas=(0.9, 0.95)) >>> scheduler = CosineAnnealingLR(optimizer=optimizer, T_max=len(dataloader) * n_epochs) >>> criterion = RMSELoss() >>> >>> for _ in range(n_epochs): >>> for batch in dataloader: >>> batch = batch[0] >>> >>> optimizer.zero_grad() >>> loss: torch.Tensor = torch.tensor(0.0) >>> for student_output, teacher_output in qdistill_model(batch): >>> loss += criterion(student_output, teacher_output) >>> >>> loss.backward() >>> optimizer.step() >>> scheduler.step()
- __init__(fq_model, model=None, strategy=DistillationLayerSelectionStrategy.DISTILL_LAST_QUANT_LAYERS, tune_weight_scale_factors=False)
- Parameters:
fq_model (FakeQuantizedModel) – Fake-quantized model.
model (Optional[Union[nn.Module, GraphModule]]) – Optional teacher model for distillation. If value is None, float
fq_model
will be used as a teacher model. Default value is None.strategy (DistillationLayerSelectionStrategy) – Distillation layer selection strategy. Default value is
DISTILL_LAST_QUANT_LAYERS
.tune_weight_scale_factors (bool) – Whether to tune weight scale factors or not. False by default. Note: turning on weight scale factors in fake quantized model leads to additional memory consumption - x2 compared to disabled mode.
Helper functions for distillation:
- class QuantDistillationModule
Marker layer for quantization distillation.
To mark a tensor for distillation (i.e. tell enot quantization procedures that they should use this tensor for distillation), you should insert this module into your model, and wrap all tensors which you want to distill with this module’s call.
Notes
You can call this module multiple times for different tensors, and your model can have multiple
QuantDistillationModule
nested class instances.Examples
>>> from torch import nn >>> from enot.quantization import QuantDistillationModule >>> >>> class MyModelLayer(nn.Module): ... def __init__(self): ... super().__init__() ... self.conv1 = nn.Conv2d(4, 8, (3, 3)) ... self.conv2 = nn.Conv2d(8, 4, (3, 3)) ... self.distillation_module = QuantDistillationModule() ... def forward(self, x): ... conv1_out = self.conv1(x) ... # mark conv1 output as distillation target: ... conv1_out = self.distillation_module(conv1_out) ... conv2_out = self.conv2(x) ... return conv2_out ... >>> my_module = MyModelLayer()
- class DistillationLayerSelectionStrategy(value)
Bases:
Enum
Layer selection strategy for quantization distillation procedure.
This strategy tells which layers will be used for distillation process during quantization.
Layer selection for distillation has been proven to be important, with some strategies being more robust in general while the others can provide better results in specific cases. For example, it is not a good idea to make distillation over detection network outputs as it produces unstable gradients. However, distillation over classification network outputs is widely used and provides good results in multiple scenarios.
The four available regimes are the following:
DISTILL_LAST_QUANT_LAYERS
Finds last quantized layers and distill over their outputs. Default option, robust to different scenarios including classification, segmentation, detection.
DISTILL_OUTPUTS
Finds all
PyTorch
tensors in user model’s outputs and distill over them. Useful for classification problems with cross entropy loss (torch.nn.CrossEntropyLoss
).
DISTILL_ALL_QUANT_LAYERS
Finds all quantized layers and distill over their outputs. This is generally more robust to overfitting, but in practice converges worse for small number of distillation epochs.
DISTILL_USER_DEFINED_LAYERS
Distillation over user-defined tensors in the model. User should wrap such tensors with
QuantDistillationModule
module call. For more information, see it’s documentation.
- class RMSELoss(eps=1e-06)
RMSE loss.