Hey šŸ¤— Welcome to my Blog šŸ”„

me

Mail Linkedin Github

Software Engineer believing in Rust for ML/IA.

Get in touch šŸ“¬ tao.xavier@outlook.com

Deep Learning in Rust

github build status GitHub stars

Introduction

Building Deep Learning algorithms is paramount for doing Data Science in Rust. In this post, I show how:

  • Rust can support GPU.
  • Rust can provide superior performance than Python and by how much.
  • Good and bad use case for Deep Learning in Rust.

State of the art of Deep Learning in Rust

Deep Learning in the Rust ecosystem is spread between native libraries like linfa and C++ binding of common libraries like Tensorflow, Pytorch and Onnxruntime.

I have found onnxruntime-rs to be a convenient crate for DL offering:

  • the ability to load sklearn, tensorflow and pytorch model.
  • superior performance than native Pytorch or Tensorflow.
  • a small bundle size ~30Mb compared to tch-rs 1.2 Gb bundle.

āž”ļø this post is therefore going to be based on onnxruntime-rs.


This blog was originally published here: https://able.bio/haixuanTao/deep-learning-in-rust-with-gpu--26c53a7f

Hardware

Initially, onnxruntime-rs did not support GPU / CUDA despite having a C API.

But by tweaking Onnxruntime-rs, I could use the GPU C API and run DL Inference on GPU.

I opened a PR: https://github.com/nbigaouette/onnxruntime-rs/pull/87 providing the CUDA support for Linux and Windows.

And with similar work, a majority of the acceleration hardware could be added actually.

GPU Support

To enable GPU support, I had to:

  • add 2 header files in bindgen's wrapper.h file as follows:
#include "onnxruntime_c_api.h"
#if !defined(__APPLE__)
  #include "cpu_provider_factory.h"
  #include "cuda_provider_factory.h"
#endif
  • Add a feature flag:
[build-dependencies]
cuda = []
  • add a safe API to the newly added bindings:
    /// Set the session to use cpu
    #[cfg(feature = "cuda")]
    pub fn use_cpu(self, use_arena: i32) -> Result<SessionBuilder<'a>> {
        unsafe {
            sys::OrtSessionOptionsAppendExecutionProvider_CPU(self.session_options_ptr, use_arena);
        }
        Ok(self)
    }

    /// Set the session to use cuda
    #[cfg(feature = "cuda")]
    pub fn use_cuda(self, device_id: i32) -> Result<SessionBuilder<'a>> {
        unsafe {
            sys::OrtSessionOptionsAppendExecutionProvider_CUDA(self.session_options_ptr, device_id);
        }
        Ok(self)
    }
  • Generate bindings for Linux:
>>> cargo build --package onnxruntime-sys --features "generate-bindings cuda" --target x86_64-unknown-linux-gnu
  • Generate bindings for Windows through a Windows VM:
>>> cargo build --features "generate-bindings cuda" --target x86_64-pc-windows-msvc
  • Modify github CI for autonomous build test:
      - name: Download prebuilt archive (GPU, x86_64-unknown-linux-gnu)
        uses: actions-rs/cargo@v1
        with:
          command: build
          args: --target x86_64-unknown-linux-gnu --features cuda
      - name: Verify prebuilt archive downloaded (GPU, x86_64-unknown-linux-gnu)
        run: ls -lh target/x86_64-unknown-linux-gnu/debug/build/onnxruntime-sys-*/out/onnxruntime-linux-x64-gpu-1.*.tgz
      # ******************************************************************
      - name: Download prebuilt archive (GPU, x86_64-pc-windows-msvc)
        uses: actions-rs/cargo@v1
        with:
          command: build
          args: --target x86_64-pc-windows-msvc --features cuda
      - name: Verify prebuilt archive downloaded (GPU, x86_64-pc-windows-msvc)
        run: ls -lh target/x86_64-pc-windows-msvc/debug/build/onnxruntime-sys-*/out/onnxruntime-win-gpu-x64-1.*.zip
  • As well as documentation.

Performance

Time per phraseSpeedup
Rust ONNX CPU~125ms
Rust ONNX GPU~10msx12šŸ”„

Note: I have a six cores CPU and a GTX 1050 GPU.

As expected, the GPU drastically reduced the time of inference.

However, I did not found significant speedup between Onnxruntime Rust and Onnxruntime Python.

Preprocessing

Preprocessing for Deep Learning is inevitable and can be very expensive. In the case of NLP, preprocessing translates to tokenizing.

To compare performance, I used HuggingFace tokenizer which is implemented in Rust, in Python and in Rust-Pyo3 Python.

The code is as follows for the python native tokenizer:

from transformers import BertTokenizer

PRE_TRAINED_MODEL_NAME = "bert-base-cased"

tokenizer = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)

encoding = tokenizer(
        df["Title"].to_numpy().tolist(),
        add_special_tokens=True,
        max_length=60,
        return_token_type_ids=False,
        padding="max_length",
        truncation=True,
        return_attention_mask=True,
        return_tensors="np",
    )

The Rust-Python Bertokenizer:

from transformers import BertTokenizerFast

PRE_TRAINED_MODEL_NAME = "bert-base-cased"

tokenizer = BertTokenizerFast.from_pretrained(PRE_TRAINED_MODEL_NAME)

encoding = tokenizer(
        df["Title"].to_numpy().tolist(),
        add_special_tokens=True,
        max_length=60,
        return_token_type_ids=False,
        padding="max_length",
        truncation=True,
        return_attention_mask=True,
        return_tensors="np",
    )

And, the native Rust HuggingFace Tokenizer:


use tokenizers::models::wordpiece::WordPieceBuilder;
use tokenizers::normalizers::bert::BertNormalizer;
use tokenizers::pre_tokenizers::bert::BertPreTokenizer;
use tokenizers::processors::bert::BertProcessing;
use tokenizers::tokenizer::AddedToken;
use tokenizers::tokenizer::{EncodeInput, Encoding, Tokenizer};
use tokenizers::utils::padding::{PaddingDirection::Right, PaddingParams, PaddingStrategy::Fixed};
use tokenizers::utils::truncation::TruncationParams;
use tokenizers::utils::truncation::TruncationStrategy::LongestFirst;

fn main() -> std::result::Result<(), OrtError> {
    let vocab_path = "./src/vocab.txt";
    let wp_builder = WordPieceBuilder::new()
        .files(vocab_path.into())
        .continuing_subword_prefix("##".into())
        .max_input_chars_per_word(100)
        .unk_token("[UNK]".into())
        .build()
        .unwrap();

    let mut tokenizer = Tokenizer::new(Box::new(wp_builder));
    tokenizer.with_pre_tokenizer(Box::new(BertPreTokenizer));
    tokenizer.with_truncation(Some(TruncationParams {
        max_length: 60,
        strategy: LongestFirst,
        stride: 0,
    }));
    tokenizer.with_post_processor(Box::new(BertProcessing::new(
        ("[SEP]".into(), 102),
        ("[CLS]".into(), 101),
    )));
    tokenizer.with_normalizer(Box::new(BertNormalizer::new(true, true, false, false)));
    tokenizer.add_special_tokens(&[
        AddedToken {
            content: "[PAD]".into(),
            single_word: false,
            lstrip: false,
            rstrip: false,
        },
        AddedToken {
            content: "[CLS]".into(),
            single_word: false,
            lstrip: false,
            rstrip: false,
        },
        AddedToken {
            content: "[SEP]".into(),
            single_word: false,
            lstrip: false,
            rstrip: false,
        },
        AddedToken {
            content: "[MASK]".into(),
            single_word: false,
            lstrip: false,
            rstrip: false,
        },
    ]);
    tokenizer.with_padding(Some(PaddingParams {
        strategy: Fixed(60),
        direction: Right,
        pad_id: 0,
        pad_type_id: 0,
        pad_token: "[PAD]".into(),
    }));

    // ...
    
    let input_ids = tokenizer.encode_batch(df, true).unwrap();
    
    Ok(())
}

Performance

Time per phraseSpeedup
Python BertTokenizer1000Ī¼s
Python BertTokenizerFast200-600Ī¼sx2.5 šŸ”„
Rust Tokenizer50-150Ī¼sx4 šŸ”„

You can tokenize 4 times faster in Rust than Python, with the same Hugging Face Tokenizer library.

Preprocessing can be very performant in Rust, making a case that Rust can outperform Python for Deep Learning.

Batch inference: Running BERT on 10k phrases.

At work, we often develop Deep Learning model to be used on large batches of data.

To see if Rust can improve this usecase, I trained a BERT-like model and infered 10k phrases using Python and Rust.

Performance

10k phrasesPythonRust
Booting4s1s
Encoding0.7s0.3s
DL Inference75s75s
Total80s76s
Memory usage1 GiB0.7 GiB

As DL inference is taking the majority of the time, Rust will only marginely improve performance.

This is an example of a bad use case for Rust as time is consumed in the C API which does not get affected by Rust.

You can check out the code for this specific job at: https://github.com/haixuanTao/bert-onnx-rs-pipeline

github GitHub stars

ONNX Server: Serving BERT as an API

Another use case is serving a BERT-like model as a server with a REST endpoint.

To see if Rust could be more performant than Python, I served the onnx model through actix-web, and to benchmark it, I made a clone in Python with FastAPI.

Performance

For a request of one phrase:

Python FastAPIRust Actix WebSpeedup
Encoding400Ī¼s100Ī¼s
ONNX Inference~10ms~10ms
API overhead~2ms~1ms
Mean Latency12.8ms10.4ms-20%ā°
Requests/secs77.5 #/s95 #/s+22%šŸ”„

The gain in performance comes from moving from considered ā€œFastā€ Python library to Rust:

  • FastAPI ā© Actix Web
  • BertokenizerFast ļøā© Rust Tokenizer

Thus, as Rust libraries tend to be faster than Python ones, Rust will be faster when the application is a composition of libraries.

Thatā€™s why, I can see Rust be a good fit for excessively performance centric applications such as Real-Time Deep Learning, Embedded Deep Learning, Large-Scale AI servers! ā¤ļøā€šŸ¦€

Check the code: https://github.com/haixuanTao/bert-onnx-rs-server
github GitHub stars

In conclusion, should you use Rust for Deep Learning?

  • Like the whole Rust ecosystem, use it if you need performance ļøand resilience!šŸš€ But be aware that using Rust does not make things automatically fast!
  • If you need quick prototyping with a friendly language for Data Scientist, you should better use Python!

Pandas vs Polars

github GitHub stars

Introduction

Everyone loves the API of Pandas. Itā€™s fast, easy, and well documented. There are some rough edges, but most times, itā€™s just a blast.

Now, when it comes to production, Pandas is slightly trickier. Pandas does not scale very wellā€¦ there is no multithreadingā€¦ Itā€™s not thread-safeā€¦ Itā€™s not memory efficient.

But all those problems are the raison dā€™ĆŖtre of Rust.

What if, there was a DataFrame API written in Rust that solves all those issues and at the same time keeps a nice API?


This blog was originally published here: https://able.bio/haixuanTao/data-manipulation-polars-vs-rust--3def44c8

Polars

Well, Polars allows you to do read, write, filter, apply functions, group by and merge, all in a similar API as Pandas but in Rust.

It uses Apache Arrow, a data framework purposely built for doing efficient data processing and data sharing across language.

3 reasons for choosing Polars

Reason #1. Performance.

itā€™s killing it performance-wise.

Reason #2. The API is straightforward.

Do you want to mutate the data? Use apply. Do you want to filter the data? use filter. Do you want to merge? Use join . There is not going to be rust syntax like struct, derive, impl ā€¦

Reason #3. No troubles with the borrow checker.

It uses Arc-Mutex, which means that you can clone variables as much as you like. Variables are only references to in-memory data. No more fighting with the borrow checker. Mutability is limited to the API calls, which preserve the consistency/thread-safety of the data.

3 caveats of Polars

ā€Œā€ŒCaveat #1. Issuesā€¦

Building a DataFrame API is hard. Pandas took 12 years to reach 1.0.0. And, as Polars is rather young, you may face unexpected issues. In my cases, there were issues with \n characters, double quotes characters, and long utf8.

On the other hand, those are great first issues to get started with contribution and getting better at Rust šŸ”Ø.

Caveat #2. Getting comfortable with two APIs: Polars and Arrow.

As many of the heavy liftings are done using the Apache Arrow backend, youā€™ll have to get used to reading the documentation of Polars but also Apache Arrow. Both documentations are pretty straightforward, but it might feel tiring for someone who was looking for a drop-in replacement of Pandas.

Caveat #3. Compiling timeā€¦

Sadly, compiling time takes around 3min uncached. And, it uses a lot of resources.

Case Study

Now the question is, is it better than native Rust as Iā€™ve explained in my previous blog post?

Letā€™s take a hands-on comparison for a Data Pipeline and get a feel for it.

In this case study, Iā€™m going to use the stack overflow kaggle dataset. Iā€™m going to read the database, parse the dates, make a merge between the first tag and the Wikipedia comparison of programming language. Group by the status of the question asked. And retrieve the distribution of languageā€™s features within each ā€˜statusā€™ of questions.

Weā€™ll compare Polars API & Native Rust generic heap structure to do this task.

  • Iā€™ll go slightly quicker on the native Rust, as I already put more details here.
  • Multithreading is done on 12 threads Intel(R) Core(TM) i7-8750H / 20G RAM.
  • The database is 4.2G big for around 3.6 Million rows.

Reading

Reading in Polars

Reading in Polars is pretty straightforward:

use polars::prelude::*;

//...

    let mut df = CsvReader::from_path(path)?
        .with_n_threads(Some(1)) // comment for multithreading
        .with_encoding(CsvEncoding::LossyUtf8)
        .has_header(true)
        .finish()?;

Reading in Native Rust

Reading in Rust using csv and serde requires that you already have a struct, in my case my struct is utils::NativeDataFrame

    let file = File::open(path)?;

    let mut rdr = csv::ReaderBuilder::new().delimiter(b',').from_reader(file);
    let mut records: Vec<utils::NativeDataFrame> = rdr
        .deserialize()
        .into_iter()
        .filter_map(|result| match result {
            Ok(rec) => rec,
            Err(e) => None,
        })
        .collect();

Performance

Time(s)Speedup Pandas
Native Rust (Single thread)12 s2.4x
Polars(Single thread)19 s1.5x
Polars(Multithread)6.6 s4.5x
Pandas29.6 s

For reading, Polars is faster than Pandas and Native Rust, being able to do it in multithreading.

Apply

Applying Function in Polars

To Apply a function in Polars, you can use the default apply or may_apply. I prefer the latter.

fn str_to_date(dates: &Series) -> std::result::Result<Series, PolarsError> {
    let fmt = Some("%m/%d/%Y %H:%M:%S");

    Ok(dates.utf8()?.as_date64(fmt)?.into_series())
}

fn count_words(dates: &Series) -> std::result::Result<Series, PolarsError> {
    Ok(dates
	.utf8()?
	.into_iter()
	.map(|opt_name: Option<&str>| 
		 opt_name.map(|name: &str| name.split(" ").count() as u64
	))
	.collect::<UInt64Chunked>()
	.into_series())
}

// ...

    // Apply Format Date
    df.may_apply("PostCreationDate", str_to_date)?;

    let t_formatting = Instant::now();

    // Apply Custom counting words in string
    df.may_apply("BodyMarkdown", count_words)?;

Note that parallel apply is not yet implemented for utf8 series.

Applying Function in Native Rust

What I like about native rust mutation, is that the syntax is standard among iterator, and so once you get comfortable with the syntax, you can apply it everywhere šŸ˜€

use chrono::{DateTime, NaiveDate, NaiveDateTime, NaiveTime};
// use rayon::prelude::*;  for multithreads

    // Apply Format Date
    let fmt = "%m/%d/%Y %H:%M:%S";

    records
	.iter_mut()  // .par_iter_mut() for multithreads
	.for_each(|record: &mut utils::NativeDataFrame| {
	    record.PostCreationDatetime =
		match DateTime::parse_from_str(
		  record.PostCreationDate.as_ref().unwrap(), fmt) {
		    Ok(dates) => Some(dates),
		    Err(_) => None,
		}
	});

    // Apply Custom Formatting counting words in string
    records
	.iter_mut() // .par_iter_mut() for multithreads
	.for_each(|record: &mut utils::NativeDataFrame| {
	    record.CountWords =
		Some(
	  record.BodyMarkdown.as_ref().unwrap().split(' ').count() as f64
		)
	});

Performance for formatting dates

Time(s)Speedup Pandas
Native Rust (Single thread).98 s8x
Native Rust (Multithread).148 s52x
Polars(Single thread).88 s8.8x
Pandas7.8 s

Performance for counting words

Time(s)Speedup Pandas
Native Rust (Single thread)9 s2.7x
Native Rust (Multithread)1.3 s19x
Polars(Single thread)9 s2.7x
Pandas24.8 s

Polars does not seem to offer increased performance over the standard library on a single thread, and I couldnā€™t find a way to do multi-threaded applyā€¦ In this scenario, Iā€™ll prefer native Rust.

Merging

Merging in Polars

Merging in Polars is dead easy, although the number of strategy for filling none values are limited for now.

    df = df
        .join(&df_wikipedia, "Tag1", "Language", JoinType::Left)?
        .fill_none(FillNoneStrategy::Min)?;

Merging in Native Rust

Merging in native Rust can be done with nested structure and pairing with a Hashmap:

let mut hash_wikipedia: &HashMap<&String, &utils::WikiDataFrame> = &records_wikipedia
    .iter()
    .map(|record| (record.Language.as_ref().unwrap(), record))
    .collect();

records.iter_mut().for_each(|record| {
    record.Wikipedia = match hash_wikipedia.get(&record.Tag1.as_ref().unwrap()) {
        Some(wikipedia) => Some(wikipedia.clone().clone()),
        None => None,
    }
});

Performance

Time(s)Speedup Pandas
Native Rust (Single thread).680 s6.3x
Native Rust (Multithread).215 s20x
Polars.543 s8x
Pandas4.347 s

For merging, having a nested structure with None values can be very verbose. So, Iā€™ll recommend using Polars for merging.

Iā€™m not sure If polars merging is done multi-threaded or not. It seems to be multithreaded by default.

Groupby

Group By in Polars

Group by in polars are pretty easy.

    // Groupby series as a clone of reference
    let groupby_series = vec![
        df.column("OpenStatus")?.clone(),
    ];

    let target_column = vec![
        "ReputationAtPostCreation",
        "OwnerUndeletedAnswerCountAtPostTime",
        "Imperative",
        "Object-oriented",
        "Functional",
        "Procedural",
        "Generic",
        "Reflective",
        "Event-driven",
    ];

    let groups = df
        .groupby_with_series(groupby_series, false)?
        .select(target_column)
        .mean()?;

Group By in Native Rust

However, it is quite tricky in native Rust. To make a group by in a thread-safe manner, youā€™ll need to use a Hashmap with the fold method. Note that, parallel folds are slightly more complicated as folding requires passing data around threads.

    let groups_hash: HashMap<String, (utils::GroupBy, i16)> = records
        .iter() // .par_iter()
        .fold(
            HashMap::new(), // || HashMap::new()
            |mut hash_group: HashMap<String, (utils::GroupBy, i16)>, record| {
                let group: utils::GroupBy = if let Some(wiki) = &record.Wikipedia {
                    utils::GroupBy {
                        status: record.OpenStatus.as_ref().unwrap().to_string(),
                        ReputationAtPostCreation: record.ReputationAtPostCreation.unwrap(),
                        OwnerUndeletedAnswerCountAtPostTime: record
                            .OwnerUndeletedAnswerCountAtPostTime
                            .unwrap(),
                        Imperative: wiki.Imperative.unwrap(),
                        ObjectOriented: wiki.ObjectOriented.unwrap(),
                        Functional: wiki.Functional.unwrap(),
                        Procedural: wiki.Procedural.unwrap(),
                        Generic: wiki.Generic.unwrap(),
                        Reflective: wiki.Reflective.unwrap(),
                        EventDriven: wiki.EventDriven.unwrap(),
                    }
                } else {
                    utils::GroupBy {
                        status: record.OpenStatus.as_ref().unwrap().to_string(),
                        ReputationAtPostCreation: record.ReputationAtPostCreation.unwrap(),
                        OwnerUndeletedAnswerCountAtPostTime: record
                            .OwnerUndeletedAnswerCountAtPostTime
                            .unwrap(),
                        ..Default::default()
                    }
                };
                if let Some((previous, count)) = hash_group.get_mut(&group.status.to_string()) {
                    *previous = previous.clone() + group;
                    *count += 1;
                } else {
                    hash_group.insert(group.status.to_string(), (group, 1));
                };
                hash_group
            },
        ); // }
           // .reduce(
           //     || HashMap::new(),
           //     |prev, other| {
           //         let set1: HashSet<String> = prev.keys().cloned().collect();
           //         let set2: HashSet<String> = other.keys().cloned().collect();
           //         let unions: HashSet<String> = set1.union(&set2).cloned().collect();
           //         let mut map = HashMap::new();
           //         for key in unions.iter() {
           //             map.insert(
           //                 key.to_string(),
           //                 match (prev.get(key), other.get(key)) {
           //                     (Some((previous, count_prev)), Some((group, count_other))) => {
           //                         (previous.clone() + group.clone(), count_prev + count_other)
           //                     }
           //                     (Some(previous), None) => previous.clone(),
           //                     (None, Some(other)) => other.clone(),
           //                     (None, None) => (utils::GroupBy::new(), 0),
           //                 },
           //             );
           //         }
           //         map
           //     },
           // );

    let groups: Vec<utils::GroupBy> = groups_hash
        .iter()
        .map(|(_, (group, count))| utils::GroupBy {
            status: group.status.to_string(),
            ReputationAtPostCreation: group.ReputationAtPostCreation / count.clone() as f64,
            OwnerUndeletedAnswerCountAtPostTime: group.OwnerUndeletedAnswerCountAtPostTime
                / count.clone() as f64,
            Imperative: group.Imperative / count.clone() as f64,
            ObjectOriented: group.ObjectOriented / count.clone() as f64,
            Functional: group.Functional / count.clone() as f64,
            Procedural: group.Procedural / count.clone() as f64,
            Generic: group.Generic / count.clone() as f64,
            Reflective: group.Reflective / count.clone() as f64,
            EventDriven: group.EventDriven / count.clone() as f64,
        })
        .collect();

Uncomment for multithreading

Performance

Time(s)Speedup Pandas
Native Rust (Single thread).536 s2x
Native Rust (Multithread).115 s9.5x
Polars(Single thread).131 s8.3x
Polars(Multithread).125 s8.8x
Pandas1.1 s

Group By and Merging are the ideal case for Polars. Youā€™ll get 8x more performance than Pandas on a single thread, and Polars handles multithreading, although in my case, it didnā€™t matter much.

Native Rust can do it as well, but judging by the size of the code, it is not an ideal use case.

Conclusion

Performance overall

Time(s)Speedup Pandas
Native Rust (Single thread)24 s3.3x
Native Rust (Multithread)13.7 s5.8x
Polars (Single thread)30 s2.6x
Polars (Multithread)17 s4.7x
Polars (lazy, Multithreaded)16.5 s4.8x
Pandas80 s

As reading is IO bound, I wanted to make a benchmark of pure performance.

Performance without Reading

Time(s)Speedup Pandas
Native Rust (Single thread)12 s3.3x
Native Rust (Multithread)1.7 s23x
Polars (Single thread)10 s4x
Polars (Multithread)11 s3.6x
Polars (Lazy, Multithread)11 s3.6x
Pandas40 s

ā€Œ

Overall takeaway

  • Use Polars if you want a great API.
  • Use Polars for merging and group by.
  • Use Polars for single instruction multiple data(SIMD) operation.
  • Use Native Rust if youā€™re already familiar with rust generic heap structure like vectors and hashmap.
  • Use Native Rust for linear mutation of the data with map and fold. Youā€™ll get O(n) scalability that can be parallelized almost instantly with rayon.
  • Use pandas when performance, scalability, memory usage does not matter.

For me, both Polars and native Rust makes a lot of sense for data between 1Go and 1To.

Iā€™ll invite you to make your own opinion. The code is available here: https://github.com/haixuanTao/dataframe-python-rust

github GitHub stars

Pandas vs Rust (#1 Google Result)

github GitHub stars

Introduction

Pandas is the main Data analysis package of Python. For many reasons, Native Python has poor performance on data analysis without vectorization with NumPy and the likes. And historically, Pandas has been created by Wes McKinney to package those optimisations in a nice API to facilitate data analysis in Python.

This, however, is not necessary for Rust. Rust has great data performance natively. This is why Rust doesnā€™t really need a package like Pandas.

I believe the rustiest way to do Data Manipulation in Rust would be to build a heap of data struct.

This is my experience and reasoning comparing Pandas vs Rust.

Data

Performance benchmarks are done on this very random dataset: https://www.kaggle.com/START-UMD/gtd that offers around 160,000 lines / 130 columns for a total size of 150Mb. The size of this dataset corresponds to the type of dataset I regularly encounter, thatā€™s why I chose this one. It isnā€™t the biggest dataset in the world, and, more studies should probably be done on a larger dataset.

The merge will be done with another random dataset: https://datacatalog.worldbank.org/dataset/world-development-indicators, the WDICountry.csv


This blog was originally published on https://able.bio/haixuanTao/data-manipulation-pandas-vs-rust--1d70e7fc

Reading

Pandas

Reading and instantiating Data in Pandas is pretty straightforward, and handles by default many data quality problems:

import pandas as pd

path = "/home/peter/Documents/TEST/RUST/terrorism/src/globalterrorismdb_0718dist.csv"
df = pd.read_csv(path)

Rust Reading CSV

For Rust, Managing bad quality data is very very tedious. In this dataset, some fields are empty, some lines are badly formatted, and some are not UTF-8 encoded.

To open the CSV, I used the csv crate but it does not solve all the issues listed above. With well-formatted data, reading can be done like so:

let path = "/home/peter/Documents/TEST/RUST/terrorism/src/foo.csv";
let mut rdr = csv::Reader::from_path(path).unwrap();

But with bad quality formatting, I had to add additional parameters like:

use std::fs::File;    
use encoding_rs::WINDOWS_1252;
use encoding_rs_io::DecodeReaderBytesBuilder;

// ...

    let file = File::open(path)?;
    let transcoded = DecodeReaderBytesBuilder::new()
        .encoding(Some(WINDOWS_1252))
        .build(file);
    let mut rdr = csv::ReaderBuilder::new()
        .delimiter(b',')
        .from_reader(transcoded); 

ref: https://stackoverflow.com/questions/53826986/how-to-read-a-non-utf8-encoded-csv-file

Rust Instantiating the data

To instantiate the data, I used Serde https://serde.rs/ for serializing and deserializing my data.

To use Serde, I needed to make a struct of my data. Having a struct of my data is great as it makes my code follow a model-based coding paradigm with a well-defined type for each field. It also enables me to implement traits and methods on top of them.

However, the data I wanted to use has 130 columnsā€¦ And, It seemed that there is no way to generate the definition of the struct automatically.

To avoid doing the definition manually, I had to build my own struct generator:


fn inspect(path: &str) {
    let mut record: Record = HashMap::new();

    let mut rdr = csv::Reader::from_path(path).unwrap();

    for result in rdr.deserialize() {
        match result {
            Ok(rec) => {
                record = rec;
                break;
            }
            Err(e) => (),
        };
    }
    // Print Struct
    println!("#[skip_serializing_none]");
    println!("#[derive(Debug, Deserialize, Serialize)]");
    println!("struct DataFrame {{");
    for (key, value) in &record {
        println!("    #[serialize_always]");

        match value.parse::<i64>() {
            Ok(n) => {
                println!("    {}: Option<i64>,", key);
                continue;
            }
            Err(e) => (),
        }
        match value.parse::<f64>() {
            Ok(n) => {
                println!("    {}: Option<f64>,", key);
                continue;
            }
            Err(e) => (),
        }
        println!("    {}: Option<String>,", key);
    }
    println!("}}");
}

This generated the struct as follows:

use serde::{Deserialize, Serialize};
use serde_with::skip_serializing_none;

#[skip_serializing_none]
#[derive(Debug, Clone, Deserialize, Serialize)]
struct DataFrame {
    #[serialize_always]
    individual: Option<f64>,
    #[serialize_always]
    natlty3_txt: Option<String>,
    #[serialize_always]
    ransom: Option<f64>,
    #[serialize_always]
    related: Option<String>,
    #[serialize_always]
    gsubname: Option<String>,
    #[serialize_always]
    claim2: Option<String>,
    #[serialize_always]

    // ...

skip_serializing_none: Avoid having error on empty fields in the CSV.

serialize_always: Makes the number of field when writing csv fixed.

Now, that I had my struct, I used serde serialization to populate a vector of struct:

    let mut records: Vec<DataFrame> = Vec::new();

    for result in rdr.deserialize() {
        match result {
            Ok(rec) => {
                records.push(rec);
            }
            Err(e) => println!("{}", e),
        };
    }

This generated my vector of struct, hooray šŸŽ‰

On a general note with Rust, you shouldnā€™t expect things to work as smoothly as it would with Python.

On reading / instantiating data, Pandas wins hands down for CSV.

Filtering

Pandas

There are many ways to do filtering in pandas, the most common way for me is as follows:

df = df[df.country_txt == "United States"]
df.to_csv("python_output.csv")

Rust

To do filtering in Rust, we can refer to the docs for vector in Rust https://doc.rust-lang.org/std/vec/struct.Vec.html

There is a large umbrella of methods for Vector filtering, with many nightly features that are going to be great for data manipulation when they ship. For this use case, I used the retain method as it fitted my need perfectly:

    records.retain(|x| &x.country_txt.unwrap() == "United States");
    let mut wtr =
        csv::Writer::from_path("output_rust_filter.csv")?;

    for record in &records {
        wtr.serialize(record)?;
    }

One big difference between Pandas and Rust is that Rust filtering uses Closures (eq. lambda function in python) whereas Pandas filtering uses Pandas API based on columns. Rust can therefore make more complex filters compared to Pandas. It also adds in readability.

Performance

Time(s)Mem Usage(Gb)
Pandas3.0s2.5Gb
Rust1.6s šŸ”„ -50%1.7Gb šŸ”„ -32%

Even though weā€™re using Pandas API for filtering, we get significantly better performance using Rust.

On Filtering, Rust seems to be more capable and faster. šŸš…

Groupby

Pandas

Group by are a big part of the data reduction pipeline in python, it goes usually as follows:

df = df.groupby(by="country_txt", as_index=False).agg(
    {"nkill": "sum", "individual": "mean", "eventid": "count"}
)
df.to_csv("python_output_groupby.csv")

Rust

For group by and data reduction, thanks to David Sanders, group by can be done as follows:

use itertools::Itertools;


// ...

#[derive(Debug, Deserialize, Serialize)]
struct GroupBy {
    country: String,
    total_nkill: f64,
    average_individual: f64,
    count: f64,
}

// ... 

    let groups = records
        .into_iter()
        .sorted_unstable_by(|a, b| Ord::cmp(&a.country_txt, &b.country_txt))
        .group_by(|record| record.country_txt.clone())
        .into_iter()
        .map(|(country, group)| {
            let (total_nkill, count, average_individual) = group.into_iter().fold(
                (0., 0., 0.),
                |(total_nkill, count, average_individual), record| {
                    (
                        total_nkill + record.nkill.unwrap_or(0.),
                        count + 1.,
                        average_individual + record.individual.unwrap_or(0.),
                    )
                },
            );
            GroupBy {
                country: country.unwrap(),
                total_nkill,
                average_individual: average_individual / count,
                count,
            }
        })
        .collect::<Vec<_>>();
    let mut wtr =
        csv::Writer::from_path("output_rust_groupby.csv")
            .unwrap();

    for group in &groups {
        wtr.serialize(group)?;
    }

ā€Œ

Although this solution is not as elegant as Pandas groupby, it gives a lot of flexibility on the computation of the reduced fields. Again, thanks to Closures.

I think more reduction method other than sum and fold would greatly improve the development experience of map-reduce style operation in rust. We will then probably have equivalent experience between Rust and Pandas.

Performance

Time(s)Mem(Gb)
Pandas2.78s2.5Gb
Rust2.0sšŸ”„ -35%1.7GbšŸ”„ -32%

Although the performance is better for Rust, I would advise using Pandas for map-reduce heavy application, as it seems more appropriate.

Mutation

Pandas

There are many ways to do mutation in Pandas, I usually do the following for performance and functional style:

df["computed"] = df["nkill"].map(lambda x: (x - 10) / 2 + x ** 2 / 3)
df.to_csv("python_output_map.csv")

Rust

For mutation, the functional iter of Rust really makes this part a walk in the park:

    records.iter_mut().for_each(|x: &mut DataFrame| {
        let nkill = match &x.nkill {
            Some(nkill) => nkill,
            None => &0.,
        };

        x.computed = Some((nkill - 10.) / 2. + nkill * nkill / 3.);
    });

    let mut wtr = csv::Writer::from_path(
        "output_rust_map.csv",
    )?;
    for record in &records {
        wtr.serialize(record)?;
    }

Performance

Time(s)Mem(Gb)
Pandas12.82s4.7Gb
Rust1.58sšŸ”„ -87%1.7GbšŸ”„ -64%

This is where the difference really appeared to me. Pandas do not scale for line-by-line lambda functions. Pandas would have been even worst if I had done an operation involving several columns.

Rust is way better for line-by-line mutation natively.

Merging

Python

Merging in python is pretty efficient generally speaking, it goes like this in general:

df_country = pd.read_csv(
    "/home/peter/Documents/TEST/RUST/terrorism/src/WDICountry.csv"
)

df_merge = pd.merge(
    df, df_country, left_on="country_txt", right_on="Short_Name"
)
df_merge.to_csv("python_output_merge.csv")

Rust

For Rust, however, this is a tricky part as, with Struct, merging isnā€™t really a thing. For me, the rustiest way of doing a merge is by adding a nested field containing the other struct we want to join data with.

I first created a new struct and a new heap for the new data:

#[skip_serializing_none]
#[derive(Clone, Debug, Deserialize, Serialize)]
struct DataFrameCountry {
    #[serialize_always]
    SNA_price_valuation: Option<String>,
    #[serialize_always]
    IMF_data_dissemination_standard: Option<String>,
    #[serialize_always]
    Latest_industrial_data: Option<String>,
    #[serialize_always]
    System_of_National_Accounts: Option<String>,
    //...

// ...

    let mut records_country: Vec<DataFrameCountry> = Vec::new();
    let file = File::open(path_country)?;
    let transcoded = DecodeReaderBytesBuilder::new()
        .encoding(Some(WINDOWS_1252))
        .build(file);
    let mut rdr = csv::ReaderBuilder::new()
        .delimiter(b',')
        .from_reader(transcoded); 

    for result in rdr.deserialize() {
        match result {
            Ok(rec) => {
                records_country.push(rec);
            }
            Err(e) => println!("{}", e),
        };
    }

I then cloned this new struct with the previous struct on a specific field that is unique.


impl DataFrame {
    fn add_country_ext(&mut self, country: Option<DataFrameCountry>) {
        self.country_merge = Some(country)
    }
}

//...

    for country in records_country {
        records
            .iter_mut()
            .filter(|record| record.country_txt == country.Short_Name)
            .for_each(|x| {
                x.add_country_ext(Some(country.clone()));
            });
    }
    let mut wtr =
        csv::Writer::from_path("output_rust_join.csv")
            .unwrap();
    for record in &records {
        wtr.serialize(record)?;
    }

I cloned the data for convenience and also for better comparability, but a reference can be passed if you can manage it.

And there we go! šŸš€

Except, a nested struct is not yet serializable in CSV for Rust -> https://github.com/BurntSushi/rust-csv/pull/197

So I had to adapt it to:

impl DataFrame {
    fn add_country_ext(&mut self, country: Option<DataFrameCountry>) {
        self.country_ext = Some(format!("{:?}", country))
    }
}

But, then, we got a sort of merge! šŸš€

Performance

Time(s)Mem(Gb)
Pandas22.47s11.8Gb
Rust5.48sšŸ”„ -75%2.6GbšŸ”„ -78%

Rust is capable of doing nested structs that are going to be as capable if not more capable than Pandas merges. However, it isnā€™t really a one to one comparison and in this case, it is going to depend on your use case.

Conclusion

After this experience, this is my take away.

  • Use Pandas when you can: small CSV(<1M lines), simple operation, data cleaning ā€¦
  • Use Rust when you have: complex operations, memory heavy or time-consuming pipelines, custom functions, scalable softwareā€¦

That been said, Rust offers impressive flexibility compared to Pandas. Adding the fact that Rust is way more capable of multi-threading than Pandas, I believe that Rust can solve problems Pandas simply cannot.

Additionally, the possibility to run Rust on any platform(Web, Android, or Embedded) also create new opportunities for data manipulation in places inconceivable for Pandas and can provide solutions for yet to be resolved challenges.

Performance

The performance table gives us an insight as to what to expect from Rust. I believe, the speedup can go from x2 at the minimum and up to x50 for large data pipelines. The memory use will have an even greater decrease as memory usage accumulates over time with python.

Scraping Python vs Rust

Introduction

Web scraping is about as error-prone as you can imagine. Pages might not exist, HTML elements might not always be thereā€¦ And so, a language that can support errors and edge cases well at runtime and not crash is a huge plus.

Performance

Performance test of scraping the 50 pages of http://books.toscrape.com/catalogue/page-1.html

NameCPU UsageTime(s)
Synchronous Python5%44.3s
Synchronous Rust7%55s
Async Python63%2.5s
Async Rust107%2.25s

ā€Œ Performances are pretty similar for such low level of requests. Time is consumed downloading. Maybe with significantly more requests, bigger difference would be seen.


This blog was originally published on: https://able.bio/haixuanTao/web-scraper-python-vs-rust--d6176429

Synchronous Python code

import requests
import bs4 as bs
import csv
URL = "http://books.toscrape.com/catalogue/page-%d.html"

with open('./test_python.csv', 'w') as csvfile:
    spamwriter = csv.writer(csvfile, delimiter=',')
    for i in range(1, 50):
        response = requests.get(URL % i)
        if response.status_code == 200:
            content = response.content
            soup = bs.BeautifulSoup(content, 'lxml')
            articles = soup.find_all('article')

            for article in articles:
                information = []
                information.append(article.find(
                    'p', class_='price_color').text)
                information.append(article.find('h3').find('a').get('title'))
                spamwriter.writerow(information)

ā€Œ

Synchronous Rust code:

use csv::Writer;
use select::document::Document;
use select::predicate::{Attr, Class, Name};
use std::fs::OpenOptions;

async fn test(i: &i32) -> Result<(), Box<dyn std::error::Error + Send + Sync>> {
    let url = format!("http://books.toscrape.com/catalogue/page-{}.html", i);
    let response = reqwest::get(&url).await?.text().await?;
    let file = OpenOptions::new()
        .write(true)
        .create(true)
        .append(true)
        .open("test2.csv")
        .unwrap();
    let mut wtr = Writer::from_writer(file);

    let document = Document::from(response.as_str());

    for node in document.find(Name("article")) {
        let name = match node.find(Name("h3")).next() {
            Some(h3) => h3.find(Name("a")).next().unwrap().text(),
            None => "".to_string(),
        };
        let price = node
            .find(Attr("class", "price_color"))
            .next()
            .unwrap()
            .text();

        // println!("{:#?} ", url);
        wtr.write_record(&[&url, &price, &name]).unwrap();
    }

    Ok(())
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    for i in 1..50 {
        test(&i).await.unwrap();
    }
    Ok(())
}

Asynchronous

During scraping, most of the time is lost downloading file rather than computing.

However, with synchronous runtimes, pages are scraped one by one and so downloaded one by one. Each download can take time and idle the whole process. Therefore, if we can manage to not wait for the completion of each download, we will gain efficiency.

Python

It is possible using the ā€œasyncioā€ library, and it might look like that:

import asyncio
import requests
import bs4 as bs
import csv

URL = "http://books.toscrape.com/catalogue/page-%d.html"


async def get_book(url, spamwriter):
    response = requests.get(url)
    if response.status_code == 200:
        content = response.content
        soup = bs.BeautifulSoup(content, 'lxml')
        articles = soup.find_all('article')

        for article in articles:
            information = [url]
            information.append(article.find(
                'p', class_='price_color').text)
            information.append(article.find('h3').find('a').get('title'))
            spamwriter.writerow(information)


async def main():
    with open('./test_async_python.csv', 'w') as csvfile:
        spamwriter = csv.writer(csvfile, delimiter=',')
        tasks = []
        for i in range(1, 50):
            tasks.append(asyncio.create_task(
                get_book(URL % i, spamwriter)))

        for task in tasks:
            await task

asyncio.run(main())

Python does provide the async/await terminology which makes it easier to read and write.

Rust

Rust, on the contrary to Python, has been built with asynchronous computation in mind. It is thread-safe and extremely efficient. The fact that the language, in its nature. is super fast makes it great for coroutines. The code might look like that:

use csv::Writer;
use select::document::Document;
use select::predicate::{Attr, Name};
use std::fs::OpenOptions;

async fn test(i: &i32) -> Result<(), Box<dyn std::error::Error + Send + Sync>> {
    let url = format!("http://books.toscrape.com/catalogue/page-{}.html", i);
    let response = reqwest::get(&url).await?.text().await?;
    let file = OpenOptions::new()
        .write(true)
        .create(true)
        .append(true)
        .open("test2.csv")
        .unwrap();
    let mut wtr = Writer::from_writer(file);

    let document = Document::from(response.as_str());

    for node in document.find(Name("article")) {
        let name = match node.find(Name("h3")).next() {
            Some(h3) => h3.find(Name("a")).next().unwrap().text(),
            None => "".to_string(),
        };
        let price = node
            .find(Attr("class", "price_color"))
            .next()
            .unwrap()
            .text();

        println!("{:#?} ", url);
        wtr.write_record(&[&url, &price, &name]).unwrap();
    }

    Ok(())
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {

    let mut handles: std::vec::Vec<_> = Vec::new();
    for i in 1..50 {
        let job = tokio::spawn(async move { test(&i).await });
        handles.push(job);
    }

    let mut results = Vec::new();
    for job in handles {
        results.push(job.await);
    }

    Ok(())
}

ā€Œ

Productivity

This humble personal productivity cheatsheet is here to help other identify things that can help them increase their productivity.

Prezto

prezto main idea is that shell can be interactive.

Installation

sudo apt-get update
sudo apt-get install zsh
git clone --recursive https://github.com/sorin-ionescu/prezto.git "${ZDOTDIR:-$HOME}/.zprezto"

Features I use on a daily basis With prezto:

  • auto-completion
  • auto-suggestion
  • docker completion
  • git completion

More info: https://github.com/sorin-ionescu/prezto

FZF

fzf main idea is that you should never have to know by heart strings that you can approximate.

Installation

git clone --depth 1 https://github.com/junegunn/fzf.git ~/.fzf
~/.fzf/install

Features I use on a daily basis With prezto:

  • kill **
  • ctrl+r
  • ctrl+t
  • cd **
  • vim **

More info at: https://github.com/junegunn/fzf

VSCode

VSCode makes it really easy to have flexibility and automation put in place.

Productivity shortcut I use on daily basis:

  • ctrl+p : To open a file
  • ctrl+` : To open the terminal
  • ctrl+shift+p : To access extension functionilities.
  • ctrl+shift+p+Open User Settings(JSON) : For scripted settings.
  • ctrl+shift+p+Snippets : For automating the generation of code.

Vim (VSCode)

Vim idea is to allow automation on-the-lfy given a set of functionality to each key.

Vim has some great automation features that I just can't live without:

  • / : Search
  • v : Visual mode to select block of text
  • :+s : Replace within a selection
  • ctrl+z ... f+g : Jump in and out of vim mode
  • . : Repeat previous command.

Touchtyping

Touch Typing main idea is that you can type faster by moving less your hands.

And, you can reduce hand movement by learning how to maximize the utilization of your fingers.

Features I use on a daily basis:

  • Not having sour wrist.

Getting Started

https://www.typingclub.com/