.. _ENOT Latency Server: ################### ENOT Latency Server ################### **ENOT Latency Server** is a small, simple package based on aiohttp and designed for the remote latency measurement (hardware-aware). It is published under Apache 2.0 license, anyone can view and modify its code. Installation ============ ENOT Latency server package can be installed from PyPI: :: pip install enot-latency-server Overview ======== The package contains a class for the server side and a function for the client side: .. code-block:: python from enot_latency_server.server import LatencyServer from enot_latency_server.client import measure_latency_remote ``measure_latency_remote`` has the following signature: .. code-block:: python def measure_latency_remote( model: bytes, host: str = _DEFAULT_HOST, port: int = _DEFAULT_PORT, endpoint: str = _DEFAULT_ENDPOINT, timeout: Optional[float] = None, **kwargs, ) -> Dict[str, float]: ... it takes a model as bytes, sends it to a latency server with a specified address and returns latency as a dict. ``LatencyServer`` is a base class with a single abstract method that should be implemented: .. code-block:: python def measure_latency(self, model: bytes, **kwargs) -> Dict[str, float]: ... This function takes a model as bytes, kwargs from ``measure_latency_remote`` function and measures latency of this model on a particular device/framework/task/etc. According to our convention, the function should return time in **milliseconds** in the form: ``{'latency': latency}``, but if you measure some other metric, it's okay. You can also put in anything else like memory consumption: ``{'latency': latency, 'memory': memory}``. .. note:: When something bad happens you should raise ``aiohttp.web.`` for correct reponse for client, see https://docs.aiohttp.org/en/latest/web_exceptions.html for more details. The client-server interaction is shown in the diagram: .. mermaid:: :align: center sequenceDiagram Client ->> LatencyServer: measure_latency_remote(model: bytes) LatencyServer -->> measure_latency(): model (bytes) measure_latency() -->> LatencyServer: latency (Dict[str, float]) LatencyServer ->> Client: latency (Dict[str, float]) Example: ONNX Runtime CPU Provider Latency ========================================== Extending server: .. code-block:: python import time from typing import Dict import numpy as np import onnxruntime from enot_latency_server.server import LatencyServer class ONNXRuntimeCPULatencyServer(LatencyServer): def measure_latency(self, model: bytes, **kwargs) -> Dict[str, float]: # kwargs from measure_latency_remote sess = onnxruntime.InferenceSession(model, providers=['CPUExecutionProvider']) input_shape = sess.get_inputs()[0].shape start = time.time() sess.run(None, {'input': np.random.rand(*input_shape).astype(np.float32)}) end = time.time() return {'latency': (end - start) * 1000.0} server = ONNXRuntimeCPULatencyServer(host='192.168.0.100', port=5450) server.run() Client side: .. code-block:: python import onnx from enot_latency_server.client import measure_latency_remote model = onnx.load('model.onnx') latency = measure_latency_remote( model=model.SerializeToString(), host='192.168.0.100', port=5450, )['latency']