ENOT Latency Server
ENOT Latency Server is a small, simple package based on aiohttp and designed for the remote latency measurement (hardware-aware). It is published under Apache 2.0 license, anyone can view and modify its code.
Installation
ENOT Latency server package can be installed from PyPI:
pip install enot-latency-server
Overview
The package contains a class for the server side and a function for the client side:
from enot_latency_server.server import LatencyServer
from enot_latency_server.client import measure_latency_remote
measure_latency_remote
has the following signature:
def measure_latency_remote(
model: bytes,
host: str = _DEFAULT_HOST,
port: int = _DEFAULT_PORT,
endpoint: str = _DEFAULT_ENDPOINT,
timeout: Optional[float] = None,
**kwargs,
) -> Dict[str, float]:
...
it takes a model as bytes, sends it to a latency server with a specified address and returns latency as a dict.
LatencyServer
is a base class with a single abstract method that should be implemented:
def measure_latency(self, model: bytes, **kwargs) -> Dict[str, float]:
...
This function takes a model as bytes, kwargs from measure_latency_remote
function and measures latency
of this model on a particular device/framework/task/etc.
According to our convention, the function should return time in milliseconds in the form: {'latency': latency}
,
but if you measure some other metric, it’s okay.
You can also put in anything else like memory consumption: {'latency': latency, 'memory': memory}
.
Note
When something bad happens you should raise aiohttp.web.<Exception>
for correct reponse for client,
see https://docs.aiohttp.org/en/latest/web_exceptions.html for more details.
The client-server interaction is shown in the diagram:
Example: ONNX Runtime CPU Provider Latency
Extending server:
import time
from typing import Dict
import numpy as np
import onnxruntime
from enot_latency_server.server import LatencyServer
class ONNXRuntimeCPULatencyServer(LatencyServer):
def measure_latency(self, model: bytes, **kwargs) -> Dict[str, float]:
# kwargs from measure_latency_remote
sess = onnxruntime.InferenceSession(model, providers=['CPUExecutionProvider'])
input_shape = sess.get_inputs()[0].shape
start = time.time()
sess.run(None, {'input': np.random.rand(*input_shape).astype(np.float32)})
end = time.time()
return {'latency': (end - start) * 1000.0}
server = ONNXRuntimeCPULatencyServer(host='192.168.0.100', port=5450)
server.run()
Client side:
import onnx
from enot_latency_server.client import measure_latency_remote
model = onnx.load('model.onnx')
latency = measure_latency_remote(
model=model.SerializeToString(),
host='192.168.0.100',
port=5450,
)['latency']