Preprocessing

Preprocessing for Deep Learning is inevitable and can be very expensive. In the case of NLP, preprocessing translates to tokenizing.

To compare performance, I used HuggingFace tokenizer which is implemented in Rust, in Python and in Rust-Pyo3 Python.

The code is as follows for the python native tokenizer:

from transformers import BertTokenizer

PRE_TRAINED_MODEL_NAME = "bert-base-cased"

tokenizer = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)

encoding = tokenizer(
        df["Title"].to_numpy().tolist(),
        add_special_tokens=True,
        max_length=60,
        return_token_type_ids=False,
        padding="max_length",
        truncation=True,
        return_attention_mask=True,
        return_tensors="np",
    )

The Rust-Python Bertokenizer:

from transformers import BertTokenizerFast

PRE_TRAINED_MODEL_NAME = "bert-base-cased"

tokenizer = BertTokenizerFast.from_pretrained(PRE_TRAINED_MODEL_NAME)

encoding = tokenizer(
        df["Title"].to_numpy().tolist(),
        add_special_tokens=True,
        max_length=60,
        return_token_type_ids=False,
        padding="max_length",
        truncation=True,
        return_attention_mask=True,
        return_tensors="np",
    )

And, the native Rust HuggingFace Tokenizer:


use tokenizers::models::wordpiece::WordPieceBuilder;
use tokenizers::normalizers::bert::BertNormalizer;
use tokenizers::pre_tokenizers::bert::BertPreTokenizer;
use tokenizers::processors::bert::BertProcessing;
use tokenizers::tokenizer::AddedToken;
use tokenizers::tokenizer::{EncodeInput, Encoding, Tokenizer};
use tokenizers::utils::padding::{PaddingDirection::Right, PaddingParams, PaddingStrategy::Fixed};
use tokenizers::utils::truncation::TruncationParams;
use tokenizers::utils::truncation::TruncationStrategy::LongestFirst;

fn main() -> std::result::Result<(), OrtError> {
    let vocab_path = "./src/vocab.txt";
    let wp_builder = WordPieceBuilder::new()
        .files(vocab_path.into())
        .continuing_subword_prefix("##".into())
        .max_input_chars_per_word(100)
        .unk_token("[UNK]".into())
        .build()
        .unwrap();

    let mut tokenizer = Tokenizer::new(Box::new(wp_builder));
    tokenizer.with_pre_tokenizer(Box::new(BertPreTokenizer));
    tokenizer.with_truncation(Some(TruncationParams {
        max_length: 60,
        strategy: LongestFirst,
        stride: 0,
    }));
    tokenizer.with_post_processor(Box::new(BertProcessing::new(
        ("[SEP]".into(), 102),
        ("[CLS]".into(), 101),
    )));
    tokenizer.with_normalizer(Box::new(BertNormalizer::new(true, true, false, false)));
    tokenizer.add_special_tokens(&[
        AddedToken {
            content: "[PAD]".into(),
            single_word: false,
            lstrip: false,
            rstrip: false,
        },
        AddedToken {
            content: "[CLS]".into(),
            single_word: false,
            lstrip: false,
            rstrip: false,
        },
        AddedToken {
            content: "[SEP]".into(),
            single_word: false,
            lstrip: false,
            rstrip: false,
        },
        AddedToken {
            content: "[MASK]".into(),
            single_word: false,
            lstrip: false,
            rstrip: false,
        },
    ]);
    tokenizer.with_padding(Some(PaddingParams {
        strategy: Fixed(60),
        direction: Right,
        pad_id: 0,
        pad_type_id: 0,
        pad_token: "[PAD]".into(),
    }));

    // ...
    
    let input_ids = tokenizer.encode_batch(df, true).unwrap();
    
    Ok(())
}

Performance

Time per phraseSpeedup
Python BertTokenizer1000μs
Python BertTokenizerFast200-600μsx2.5 🔥
Rust Tokenizer50-150μsx4 🔥

You can tokenize 4 times faster in Rust than Python, with the same Hugging Face Tokenizer library.

Preprocessing can be very performant in Rust, making a case that Rust can outperform Python for Deep Learning.