ONNX Server: Serving BERT as an API

Another use case is serving a BERT-like model as a server with a REST endpoint.

To see if Rust could be more performant than Python, I served the onnx model through actix-web, and to benchmark it, I made a clone in Python with FastAPI.

Performance

For a request of one phrase:

Python FastAPIRust Actix WebSpeedup
Encoding400μs100μs
ONNX Inference~10ms~10ms
API overhead~2ms~1ms
Mean Latency12.8ms10.4ms-20%⏰
Requests/secs77.5 #/s95 #/s+22%🔥

The gain in performance comes from moving from considered “Fast” Python library to Rust:

  • FastAPI ⏩ Actix Web
  • BertokenizerFast ️⏩ Rust Tokenizer

Thus, as Rust libraries tend to be faster than Python ones, Rust will be faster when the application is a composition of libraries.

That’s why, I can see Rust be a good fit for excessively performance centric applications such as Real-Time Deep Learning, Embedded Deep Learning, Large-Scale AI servers! ❤️‍🦀

Check the code: https://github.com/haixuanTao/bert-onnx-rs-server
github GitHub stars