ENOT Latency Server

ENOT Latency Server is a small, simple package based on aiohttp and designed for the remote latency measurement (hardware-aware). It is published under Apache 2.0 license, anyone can view and modify its code.

Installation

ENOT Latency server package can be installed from PyPI:

pip install enot-latency-server

Overview

The package contains a class for the server side and a function for the client side:

from enot_latency_server.server import LatencyServer
from enot_latency_server.client import measure_latency_remote

measure_latency_remote has the following signature:

def measure_latency_remote(
    model: bytes,
    host: str = _DEFAULT_HOST,
    port: int = _DEFAULT_PORT,
    endpoint: str = _DEFAULT_ENDPOINT,
    timeout: Optional[float] = None,
) -> Dict[str, float]:
    ...

it takes a model as bytes, sends it to a latency server with a specified address and returns latency as a dict.

LatencyServer is a base class with a single abstract method that should be implemented:

@staticmethod
def measure_latency(model: bytes) -> Dict[str, float]:
    ...

This function takes a model as bytes and measures latency of this model on a particular device/framework/task/etc. According to our convention, the function should return time in milliseconds in the form: {'latency': latency}, but if you measure some other metric, it’s okay. You can also put in anything else like memory consumption: {'latency': latency, 'memory': memory}.

Note

When something bad happens you should raise web.HTTPBadRequest for the correct reponse for the client.

The client-server interaction is shown in the diagram:

sequenceDiagram Client ->> LatencyServer: measure_latency_remote(model: bytes) LatencyServer -->> measure_latency(): model (bytes) measure_latency() -->> LatencyServer: latency (Dict[str, float]) LatencyServer ->> Client: latency (Dict[str, float])

Example: ONNX Runtime CPU Provider Latency

Extending server:

import time
from typing import Dict

import numpy as np
import onnxruntime

from enot_latency_server.server import LatencyServer


class ONNXRuntimeCPULatencyServer(LatencyServer):
    @staticmethod
    def measure_latency(model: bytes) -> Dict[str, float]:
        sess = onnxruntime.InferenceSession(model, providers=['CPUExecutionProvider'])
        input_shape = sess.get_inputs()[0].shape

        start = time.time()
        sess.run(None, {'input': np.random.rand(*input_shape).astype(np.float32)})
        end = time.time()

        return {'latency': (end - start) * 1000.0}


server = ONNXRuntimeCPULatencyServer(host='192.168.0.100', port=5450)
server.run()

Client side:

import onnx
from enot_latency_server.client import measure_latency_remote

model = onnx.load('model.onnx')
latency = measure_latency_remote(
    model=model.SerializeToString(),
    host='192.168.0.100',
    port=5450,
)['latency']