The MLPerf inference benchmark measures how fast a system can perform ML inference.
Each MLPerf Inference benchmark is defined by a model, a dataset, a quality target, and a latency constraint. The following table summarizes the five benchmarks in version v0.5 of the suite. The quality and latency targets are currently being finalized and will be posted soon.
|Area||Task||Model||Dataset||Quality Target||Latency Constraint|
|Vision||Image classification||Resnet50-v1.5||ImageNet (224x224)||TBD||TBD|
|Vision||Image classification||MobileNets-v1 224||ImageNet (224x224)||TBD||TBD|
|Vision||Object detection||SSD-ResNet34||COCO (1200x1200)||TBD||TBD|
|Vision||Object detection||SSD-MobileNets-v1||COCO (300x300)||TBD||TBD|
MLPerf inference benchmarks are executed via a load generator that issues queries to the ML model in one of several manners that represent real-world use cases. The “LoadGen” is provided in C++ with Python bindings, and is required for all submissions.
The LoadGen is responsible for:
- Generating the queries.
- Tracking the latency of queries.
- Validating the accuracy of the results.
- Computing final metrics.
Scenarios and Metrics
In order to enable representative testing of a wide variety of inference platforms and use cases, MLPerf has defined four different scenarios as described below. A given scenario is evaluated by the LoadGen generating inference requests in a particular pattern and measuring a specific metric.
- Single-stream: Evaluates real-world scenarios such as a smartphone user taking a picture. For the test run, LoadGen sends an initial query then continually sends the next query as soon as the previous query is processed. The metric is the 90th percentile latency (the latency such that 90% of queries complete at least that fast).
- Multi-stream: Evaluates real-world scenario such as a multi-camera automotive system that detects obstacles. The LoadGen uses multiple test runs to determine the maximum number of streams the system can support while meeting the latency constraint. The metric is the number of streams supported.
- Server: Evaluates real-world scenario such as a server in a datacenter that is servicing online requests. The LoadGen uses multiple test runs to determine the maximum throughput value in queries-per-second (QPS) the system can support while meeting the latency constraint 90% of the time. The metric is QPS.
- Offline: Evaluates real-world scenarios such as a batch processing system. For the test run, LoadGen sends all queries at once. The metric is throughput.
MLPerf aims to encourage innovation in software as well as hardware by allowing submitters to reimplement the reference implementations. MLPerf has two Divisions that allow different levels of flexibility during reimplementation. The Closed division is intended to compare hardware platforms or software frameworks “apples-to-apples” and requires using the same model as the reference implementation. The Open division is intended to foster faster models and optimizers and allows using a different model or retraining.
MLPerf Inference encourages but does not require power measurements for wall-powered and battery-powered systems. If you intend to submit with power measurements, you must join the power working group.
The reference implementations for the benchmarks are here.
How to submit
If you intend to submit results, please read the submission rules carefully and join the inference submitters working group before you start work. In particular, you must notify the chair of the inference submitters working group five weeks ahead of the submission deadline as described in the submission rules.
The results are here.
If you use MLPerf in a publication, please cite this website or the MLPerf papers (forthcoming).