Bert Server - Haixuan Xavier Tao

ONNX Server: Serving BERT as an API

Another use case is serving a BERT-like model as a server with a REST endpoint.

To see if Rust could be more performant than Python, I served the onnx model through actix-web, and to benchmark it, I made a clone in Python with FastAPI.

Performance

For a request of one phrase:

	Python FastAPI	Rust Actix Web	Speedup
Encoding	400μs	100μs
ONNX Inference	~10ms	~10ms
API overhead	~2ms	~1ms
Mean Latency	12.8ms	10.4ms	-20%⏰
Requests/secs	77.5 #/s	95 #/s	+22%🔥

The gain in performance comes from moving from considered “Fast” Python library to Rust:

FastAPI ⏩ Actix Web
BertokenizerFast ️⏩ Rust Tokenizer

Thus, as Rust libraries tend to be faster than Python ones, Rust will be faster when the application is a composition of libraries.

That’s why, I can see Rust be a good fit for excessively performance centric applications such as Real-Time Deep Learning, Embedded Deep Learning, Large-Scale AI servers! ❤️‍🦀

Check the code: https://github.com/haixuanTao/bert-onnx-rs-server