Hey š¤ Welcome to my Blog š„
Software Engineer believing in Rust for ML/IA.
Get in touch š¬ tao.xavier@outlook.com
Deep Learning in Rust
Introduction
Building Deep Learning algorithms is paramount for doing Data Science in Rust. In this post, I show how:
- Rust can support GPU.
- Rust can provide superior performance than Python and by how much.
- Good and bad use case for Deep Learning in Rust.
State of the art of Deep Learning in Rust
Deep Learning in the Rust ecosystem is spread between native libraries like linfa and C++ binding of common libraries like Tensorflow, Pytorch and Onnxruntime.
I have found onnxruntime-rs to be a convenient crate for DL offering:
- the ability to load sklearn, tensorflow and pytorch model.
- superior performance than native Pytorch or Tensorflow.
- a small bundle size ~30Mb compared to tch-rs 1.2 Gb bundle.
ā”ļø this post is therefore going to be based on onnxruntime-rs.
This blog was originally published here: https://able.bio/haixuanTao/deep-learning-in-rust-with-gpu--26c53a7f
Hardware
Initially, onnxruntime-rs did not support GPU / CUDA despite having a C API.
But by tweaking Onnxruntime-rs, I could use the GPU C API and run DL Inference on GPU.
I opened a PR: https://github.com/nbigaouette/onnxruntime-rs/pull/87 providing the CUDA support for Linux and Windows.
And with similar work, a majority of the acceleration hardware could be added actually.
GPU Support
To enable GPU support, I had to:
- add 2 header files in bindgen's
wrapper.h
file as follows:
#include "onnxruntime_c_api.h"
#if !defined(__APPLE__)
#include "cpu_provider_factory.h"
#include "cuda_provider_factory.h"
#endif
- Add a feature flag:
[build-dependencies]
cuda = []
- add a safe API to the newly added bindings:
/// Set the session to use cpu
#[cfg(feature = "cuda")]
pub fn use_cpu(self, use_arena: i32) -> Result<SessionBuilder<'a>> {
unsafe {
sys::OrtSessionOptionsAppendExecutionProvider_CPU(self.session_options_ptr, use_arena);
}
Ok(self)
}
/// Set the session to use cuda
#[cfg(feature = "cuda")]
pub fn use_cuda(self, device_id: i32) -> Result<SessionBuilder<'a>> {
unsafe {
sys::OrtSessionOptionsAppendExecutionProvider_CUDA(self.session_options_ptr, device_id);
}
Ok(self)
}
- Generate bindings for Linux:
>>> cargo build --package onnxruntime-sys --features "generate-bindings cuda" --target x86_64-unknown-linux-gnu
- Generate bindings for Windows through a Windows VM:
>>> cargo build --features "generate-bindings cuda" --target x86_64-pc-windows-msvc
- Modify github CI for autonomous build test:
- name: Download prebuilt archive (GPU, x86_64-unknown-linux-gnu)
uses: actions-rs/cargo@v1
with:
command: build
args: --target x86_64-unknown-linux-gnu --features cuda
- name: Verify prebuilt archive downloaded (GPU, x86_64-unknown-linux-gnu)
run: ls -lh target/x86_64-unknown-linux-gnu/debug/build/onnxruntime-sys-*/out/onnxruntime-linux-x64-gpu-1.*.tgz
# ******************************************************************
- name: Download prebuilt archive (GPU, x86_64-pc-windows-msvc)
uses: actions-rs/cargo@v1
with:
command: build
args: --target x86_64-pc-windows-msvc --features cuda
- name: Verify prebuilt archive downloaded (GPU, x86_64-pc-windows-msvc)
run: ls -lh target/x86_64-pc-windows-msvc/debug/build/onnxruntime-sys-*/out/onnxruntime-win-gpu-x64-1.*.zip
- As well as documentation.
Performance
Time per phrase | Speedup | |
---|---|---|
Rust ONNX CPU | ~125ms | |
Rust ONNX GPU | ~10ms | x12š„ |
Note: I have a six cores CPU and a GTX 1050 GPU.
As expected, the GPU drastically reduced the time of inference.
However, I did not found significant speedup between Onnxruntime Rust and Onnxruntime Python.
Preprocessing
Preprocessing for Deep Learning is inevitable and can be very expensive. In the case of NLP, preprocessing translates to tokenizing.
To compare performance, I used HuggingFace tokenizer which is implemented in Rust, in Python and in Rust-Pyo3 Python.
The code is as follows for the python native tokenizer:
from transformers import BertTokenizer
PRE_TRAINED_MODEL_NAME = "bert-base-cased"
tokenizer = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)
encoding = tokenizer(
df["Title"].to_numpy().tolist(),
add_special_tokens=True,
max_length=60,
return_token_type_ids=False,
padding="max_length",
truncation=True,
return_attention_mask=True,
return_tensors="np",
)
The Rust-Python Bertokenizer:
from transformers import BertTokenizerFast
PRE_TRAINED_MODEL_NAME = "bert-base-cased"
tokenizer = BertTokenizerFast.from_pretrained(PRE_TRAINED_MODEL_NAME)
encoding = tokenizer(
df["Title"].to_numpy().tolist(),
add_special_tokens=True,
max_length=60,
return_token_type_ids=False,
padding="max_length",
truncation=True,
return_attention_mask=True,
return_tensors="np",
)
And, the native Rust HuggingFace Tokenizer:
use tokenizers::models::wordpiece::WordPieceBuilder;
use tokenizers::normalizers::bert::BertNormalizer;
use tokenizers::pre_tokenizers::bert::BertPreTokenizer;
use tokenizers::processors::bert::BertProcessing;
use tokenizers::tokenizer::AddedToken;
use tokenizers::tokenizer::{EncodeInput, Encoding, Tokenizer};
use tokenizers::utils::padding::{PaddingDirection::Right, PaddingParams, PaddingStrategy::Fixed};
use tokenizers::utils::truncation::TruncationParams;
use tokenizers::utils::truncation::TruncationStrategy::LongestFirst;
fn main() -> std::result::Result<(), OrtError> {
let vocab_path = "./src/vocab.txt";
let wp_builder = WordPieceBuilder::new()
.files(vocab_path.into())
.continuing_subword_prefix("##".into())
.max_input_chars_per_word(100)
.unk_token("[UNK]".into())
.build()
.unwrap();
let mut tokenizer = Tokenizer::new(Box::new(wp_builder));
tokenizer.with_pre_tokenizer(Box::new(BertPreTokenizer));
tokenizer.with_truncation(Some(TruncationParams {
max_length: 60,
strategy: LongestFirst,
stride: 0,
}));
tokenizer.with_post_processor(Box::new(BertProcessing::new(
("[SEP]".into(), 102),
("[CLS]".into(), 101),
)));
tokenizer.with_normalizer(Box::new(BertNormalizer::new(true, true, false, false)));
tokenizer.add_special_tokens(&[
AddedToken {
content: "[PAD]".into(),
single_word: false,
lstrip: false,
rstrip: false,
},
AddedToken {
content: "[CLS]".into(),
single_word: false,
lstrip: false,
rstrip: false,
},
AddedToken {
content: "[SEP]".into(),
single_word: false,
lstrip: false,
rstrip: false,
},
AddedToken {
content: "[MASK]".into(),
single_word: false,
lstrip: false,
rstrip: false,
},
]);
tokenizer.with_padding(Some(PaddingParams {
strategy: Fixed(60),
direction: Right,
pad_id: 0,
pad_type_id: 0,
pad_token: "[PAD]".into(),
}));
// ...
let input_ids = tokenizer.encode_batch(df, true).unwrap();
Ok(())
}
Performance
Time per phrase | Speedup | |
---|---|---|
Python BertTokenizer | 1000Ī¼s | |
Python BertTokenizerFast | 200-600Ī¼s | x2.5 š„ |
Rust Tokenizer | 50-150Ī¼s | x4 š„ |
You can tokenize 4 times faster in Rust than Python, with the same Hugging Face Tokenizer library.
Preprocessing can be very performant in Rust, making a case that Rust can outperform Python for Deep Learning.
Batch inference: Running BERT on 10k phrases.
At work, we often develop Deep Learning model to be used on large batches of data.
To see if Rust can improve this usecase, I trained a BERT-like model and infered 10k phrases using Python and Rust.
Performance
10k phrases | Python | Rust |
---|---|---|
Booting | 4s | 1s |
Encoding | 0.7s | 0.3s |
DL Inference | 75s | 75s |
Total | 80s | 76s |
Memory usage | 1 GiB | 0.7 GiB |
As DL inference is taking the majority of the time, Rust will only marginely improve performance.
This is an example of a bad use case for Rust as time is consumed in the C API which does not get affected by Rust.
You can check out the code for this specific job at: https://github.com/haixuanTao/bert-onnx-rs-pipeline
ONNX Server: Serving BERT as an API
Another use case is serving a BERT-like model as a server with a REST endpoint.
To see if Rust could be more performant than Python, I served the onnx model through actix-web, and to benchmark it, I made a clone in Python with FastAPI.
Performance
For a request of one phrase:
Python FastAPI | Rust Actix Web | Speedup | |
---|---|---|---|
Encoding | 400Ī¼s | 100Ī¼s | |
ONNX Inference | ~10ms | ~10ms | |
API overhead | ~2ms | ~1ms | |
Mean Latency | 12.8ms | 10.4ms | -20%ā° |
Requests/secs | 77.5 #/s | 95 #/s | +22%š„ |
The gain in performance comes from moving from considered āFastā Python library to Rust:
- FastAPI ā© Actix Web
- BertokenizerFast ļøā© Rust Tokenizer
Thus, as Rust libraries tend to be faster than Python ones, Rust will be faster when the application is a composition of libraries.
Thatās why, I can see Rust be a good fit for excessively performance centric applications such as Real-Time Deep Learning, Embedded Deep Learning, Large-Scale AI servers! ā¤ļøāš¦
Check the code: https://github.com/haixuanTao/bert-onnx-rs-server
In conclusion, should you use Rust for Deep Learning?
- Like the whole Rust ecosystem, use it if you need performance ļøand resilience!š But be aware that using Rust does not make things automatically fast!
- If you need quick prototyping with a friendly language for Data Scientist, you should better use Python!
Pandas vs Polars
Introduction
Everyone loves the API of Pandas. Itās fast, easy, and well documented. There are some rough edges, but most times, itās just a blast.
Now, when it comes to production, Pandas is slightly trickier. Pandas does not scale very wellā¦ there is no multithreadingā¦ Itās not thread-safeā¦ Itās not memory efficient.
But all those problems are the raison dāĆŖtre of Rust.
What if, there was a DataFrame API written in Rust that solves all those issues and at the same time keeps a nice API?
This blog was originally published here: https://able.bio/haixuanTao/data-manipulation-polars-vs-rust--3def44c8
Polars
Well, Polars allows you to do read, write, filter, apply functions, group by and merge, all in a similar API as Pandas but in Rust.
It uses Apache Arrow, a data framework purposely built for doing efficient data processing and data sharing across language.
3 reasons for choosing Polars
Reason #1. Performance.
itās killing it performance-wise.
Reason #2. The API is straightforward.
Do you want to mutate the data? Use apply
. Do you want to filter the data? use filter
. Do you want to merge? Use join
. There is not going to be rust syntax like struct
, derive
, impl
ā¦
Reason #3. No troubles with the borrow checker.
It uses Arc-Mutex, which means that you can clone variables as much as you like. Variables are only references to in-memory data. No more fighting with the borrow checker. Mutability is limited to the API calls, which preserve the consistency/thread-safety of the data.
3 caveats of Polars
āāCaveat #1. Issuesā¦
Building a DataFrame API is hard. Pandas took 12 years to reach 1.0.0. And, as Polars is rather young, you may face unexpected issues. In my cases, there were issues with \n characters, double quotes characters, and long utf8.
On the other hand, those are great first issues to get started with contribution and getting better at Rust šØ.
Caveat #2. Getting comfortable with two APIs: Polars and Arrow.
As many of the heavy liftings are done using the Apache Arrow backend, youāll have to get used to reading the documentation of Polars but also Apache Arrow. Both documentations are pretty straightforward, but it might feel tiring for someone who was looking for a drop-in replacement of Pandas.
Caveat #3. Compiling timeā¦
Sadly, compiling time takes around 3min uncached. And, it uses a lot of resources.
Case Study
Now the question is, is it better than native Rust as Iāve explained in my previous blog post?
Letās take a hands-on comparison for a Data Pipeline and get a feel for it.
In this case study, Iām going to use the stack overflow kaggle dataset. Iām going to read the database, parse the dates, make a merge between the first tag and the Wikipedia comparison of programming language. Group by the status of the question asked. And retrieve the distribution of languageās features within each āstatusā of questions.
Weāll compare Polars API & Native Rust generic heap structure to do this task.
- Iāll go slightly quicker on the native Rust, as I already put more details here.
- Multithreading is done on 12 threads Intel(R) Core(TM) i7-8750H / 20G RAM.
- The database is 4.2G big for around 3.6 Million rows.
Reading
Reading in Polars
Reading in Polars is pretty straightforward:
use polars::prelude::*;
//...
let mut df = CsvReader::from_path(path)?
.with_n_threads(Some(1)) // comment for multithreading
.with_encoding(CsvEncoding::LossyUtf8)
.has_header(true)
.finish()?;
Reading in Native Rust
Reading in Rust using csv
and serde
requires that you already have a struct
, in my case my struct is utils::NativeDataFrame
let file = File::open(path)?;
let mut rdr = csv::ReaderBuilder::new().delimiter(b',').from_reader(file);
let mut records: Vec<utils::NativeDataFrame> = rdr
.deserialize()
.into_iter()
.filter_map(|result| match result {
Ok(rec) => rec,
Err(e) => None,
})
.collect();
Performance
Time(s) | Speedup Pandas | |
---|---|---|
Native Rust (Single thread) | 12 s | 2.4x |
Polars(Single thread) | 19 s | 1.5x |
Polars(Multithread) | 6.6 s | 4.5x |
Pandas | 29.6 s |
For reading, Polars is faster than Pandas and Native Rust, being able to do it in multithreading.
Apply
Applying Function in Polars
To Apply a function in Polars, you can use the default apply
or may_apply
. I prefer the latter.
fn str_to_date(dates: &Series) -> std::result::Result<Series, PolarsError> {
let fmt = Some("%m/%d/%Y %H:%M:%S");
Ok(dates.utf8()?.as_date64(fmt)?.into_series())
}
fn count_words(dates: &Series) -> std::result::Result<Series, PolarsError> {
Ok(dates
.utf8()?
.into_iter()
.map(|opt_name: Option<&str>|
opt_name.map(|name: &str| name.split(" ").count() as u64
))
.collect::<UInt64Chunked>()
.into_series())
}
// ...
// Apply Format Date
df.may_apply("PostCreationDate", str_to_date)?;
let t_formatting = Instant::now();
// Apply Custom counting words in string
df.may_apply("BodyMarkdown", count_words)?;
Note that parallel apply is not yet implemented for utf8 series.
Applying Function in Native Rust
What I like about native rust mutation, is that the syntax is standard among iterator, and so once you get comfortable with the syntax, you can apply it everywhere š
use chrono::{DateTime, NaiveDate, NaiveDateTime, NaiveTime};
// use rayon::prelude::*; for multithreads
// Apply Format Date
let fmt = "%m/%d/%Y %H:%M:%S";
records
.iter_mut() // .par_iter_mut() for multithreads
.for_each(|record: &mut utils::NativeDataFrame| {
record.PostCreationDatetime =
match DateTime::parse_from_str(
record.PostCreationDate.as_ref().unwrap(), fmt) {
Ok(dates) => Some(dates),
Err(_) => None,
}
});
// Apply Custom Formatting counting words in string
records
.iter_mut() // .par_iter_mut() for multithreads
.for_each(|record: &mut utils::NativeDataFrame| {
record.CountWords =
Some(
record.BodyMarkdown.as_ref().unwrap().split(' ').count() as f64
)
});
Performance for formatting dates
Time(s) | Speedup Pandas | |
---|---|---|
Native Rust (Single thread) | .98 s | 8x |
Native Rust (Multithread) | .148 s | 52x |
Polars(Single thread) | .88 s | 8.8x |
Pandas | 7.8 s |
Performance for counting words
Time(s) | Speedup Pandas | |
---|---|---|
Native Rust (Single thread) | 9 s | 2.7x |
Native Rust (Multithread) | 1.3 s | 19x |
Polars(Single thread) | 9 s | 2.7x |
Pandas | 24.8 s |
Polars does not seem to offer increased performance over the standard library on a single thread, and I couldnāt find a way to do multi-threaded applyā¦ In this scenario, Iāll prefer native Rust.
Merging
Merging in Polars
Merging in Polars is dead easy, although the number of strategy for filling none
values are limited for now.
df = df
.join(&df_wikipedia, "Tag1", "Language", JoinType::Left)?
.fill_none(FillNoneStrategy::Min)?;
Merging in Native Rust
Merging in native Rust can be done with nested structure and pairing with a Hashmap:
let mut hash_wikipedia: &HashMap<&String, &utils::WikiDataFrame> = &records_wikipedia
.iter()
.map(|record| (record.Language.as_ref().unwrap(), record))
.collect();
records.iter_mut().for_each(|record| {
record.Wikipedia = match hash_wikipedia.get(&record.Tag1.as_ref().unwrap()) {
Some(wikipedia) => Some(wikipedia.clone().clone()),
None => None,
}
});
Performance
Time(s) | Speedup Pandas | |
---|---|---|
Native Rust (Single thread) | .680 s | 6.3x |
Native Rust (Multithread) | .215 s | 20x |
Polars | .543 s | 8x |
Pandas | 4.347 s |
For merging, having a nested structure with None
values can be very verbose. So, Iāll recommend using Polars for merging.
Iām not sure If polars merging is done multi-threaded or not. It seems to be multithreaded by default.
Groupby
Group By in Polars
Group by in polars are pretty easy.
// Groupby series as a clone of reference
let groupby_series = vec![
df.column("OpenStatus")?.clone(),
];
let target_column = vec![
"ReputationAtPostCreation",
"OwnerUndeletedAnswerCountAtPostTime",
"Imperative",
"Object-oriented",
"Functional",
"Procedural",
"Generic",
"Reflective",
"Event-driven",
];
let groups = df
.groupby_with_series(groupby_series, false)?
.select(target_column)
.mean()?;
Group By in Native Rust
However, it is quite tricky in native Rust. To make a group by in a thread-safe manner, youāll need to use a Hashmap with the fold
method. Note that, parallel folds are slightly more complicated as folding requires passing data around threads.
let groups_hash: HashMap<String, (utils::GroupBy, i16)> = records
.iter() // .par_iter()
.fold(
HashMap::new(), // || HashMap::new()
|mut hash_group: HashMap<String, (utils::GroupBy, i16)>, record| {
let group: utils::GroupBy = if let Some(wiki) = &record.Wikipedia {
utils::GroupBy {
status: record.OpenStatus.as_ref().unwrap().to_string(),
ReputationAtPostCreation: record.ReputationAtPostCreation.unwrap(),
OwnerUndeletedAnswerCountAtPostTime: record
.OwnerUndeletedAnswerCountAtPostTime
.unwrap(),
Imperative: wiki.Imperative.unwrap(),
ObjectOriented: wiki.ObjectOriented.unwrap(),
Functional: wiki.Functional.unwrap(),
Procedural: wiki.Procedural.unwrap(),
Generic: wiki.Generic.unwrap(),
Reflective: wiki.Reflective.unwrap(),
EventDriven: wiki.EventDriven.unwrap(),
}
} else {
utils::GroupBy {
status: record.OpenStatus.as_ref().unwrap().to_string(),
ReputationAtPostCreation: record.ReputationAtPostCreation.unwrap(),
OwnerUndeletedAnswerCountAtPostTime: record
.OwnerUndeletedAnswerCountAtPostTime
.unwrap(),
..Default::default()
}
};
if let Some((previous, count)) = hash_group.get_mut(&group.status.to_string()) {
*previous = previous.clone() + group;
*count += 1;
} else {
hash_group.insert(group.status.to_string(), (group, 1));
};
hash_group
},
); // }
// .reduce(
// || HashMap::new(),
// |prev, other| {
// let set1: HashSet<String> = prev.keys().cloned().collect();
// let set2: HashSet<String> = other.keys().cloned().collect();
// let unions: HashSet<String> = set1.union(&set2).cloned().collect();
// let mut map = HashMap::new();
// for key in unions.iter() {
// map.insert(
// key.to_string(),
// match (prev.get(key), other.get(key)) {
// (Some((previous, count_prev)), Some((group, count_other))) => {
// (previous.clone() + group.clone(), count_prev + count_other)
// }
// (Some(previous), None) => previous.clone(),
// (None, Some(other)) => other.clone(),
// (None, None) => (utils::GroupBy::new(), 0),
// },
// );
// }
// map
// },
// );
let groups: Vec<utils::GroupBy> = groups_hash
.iter()
.map(|(_, (group, count))| utils::GroupBy {
status: group.status.to_string(),
ReputationAtPostCreation: group.ReputationAtPostCreation / count.clone() as f64,
OwnerUndeletedAnswerCountAtPostTime: group.OwnerUndeletedAnswerCountAtPostTime
/ count.clone() as f64,
Imperative: group.Imperative / count.clone() as f64,
ObjectOriented: group.ObjectOriented / count.clone() as f64,
Functional: group.Functional / count.clone() as f64,
Procedural: group.Procedural / count.clone() as f64,
Generic: group.Generic / count.clone() as f64,
Reflective: group.Reflective / count.clone() as f64,
EventDriven: group.EventDriven / count.clone() as f64,
})
.collect();
Uncomment for multithreading
Performance
Time(s) | Speedup Pandas | |
---|---|---|
Native Rust (Single thread) | .536 s | 2x |
Native Rust (Multithread) | .115 s | 9.5x |
Polars(Single thread) | .131 s | 8.3x |
Polars(Multithread) | .125 s | 8.8x |
Pandas | 1.1 s |
Group By and Merging are the ideal case for Polars. Youāll get 8x more performance than Pandas on a single thread, and Polars handles multithreading, although in my case, it didnāt matter much.
Native Rust can do it as well, but judging by the size of the code, it is not an ideal use case.
Conclusion
Performance overall
Time(s) | Speedup Pandas | |
---|---|---|
Native Rust (Single thread) | 24 s | 3.3x |
Native Rust (Multithread) | 13.7 s | 5.8x |
Polars (Single thread) | 30 s | 2.6x |
Polars (Multithread) | 17 s | 4.7x |
Polars (lazy, Multithreaded) | 16.5 s | 4.8x |
Pandas | 80 s |
As reading is IO bound, I wanted to make a benchmark of pure performance.
Performance without Reading
Time(s) | Speedup Pandas | |
---|---|---|
Native Rust (Single thread) | 12 s | 3.3x |
Native Rust (Multithread) | 1.7 s | 23x |
Polars (Single thread) | 10 s | 4x |
Polars (Multithread) | 11 s | 3.6x |
Polars (Lazy, Multithread) | 11 s | 3.6x |
Pandas | 40 s |
ā
Overall takeaway
- Use Polars if you want a great API.
- Use Polars for merging and group by.
- Use Polars for single instruction multiple data(SIMD) operation.
- Use Native Rust if youāre already familiar with rust generic heap structure like vectors and hashmap.
- Use Native Rust for linear mutation of the data with
map
andfold
. Youāll get O(n) scalability that can be parallelized almost instantly withrayon
. - Use pandas when performance, scalability, memory usage does not matter.
For me, both Polars and native Rust makes a lot of sense for data between 1Go and 1To.
Iāll invite you to make your own opinion. The code is available here: https://github.com/haixuanTao/dataframe-python-rust
Pandas vs Rust (#1 Google Result)
Introduction
Pandas is the main Data analysis package of Python. For many reasons, Native Python has poor performance on data analysis without vectorization with NumPy and the likes. And historically, Pandas has been created by Wes McKinney to package those optimisations in a nice API to facilitate data analysis in Python.
This, however, is not necessary for Rust. Rust has great data performance natively. This is why Rust doesnāt really need a package like Pandas.
I believe the rustiest way to do Data Manipulation in Rust would be to build a heap of data struct.
This is my experience and reasoning comparing Pandas vs Rust.
Data
Performance benchmarks are done on this very random dataset: https://www.kaggle.com/START-UMD/gtd that offers around 160,000 lines / 130 columns for a total size of 150Mb. The size of this dataset corresponds to the type of dataset I regularly encounter, thatās why I chose this one. It isnāt the biggest dataset in the world, and, more studies should probably be done on a larger dataset.
The merge will be done with another random dataset: https://datacatalog.worldbank.org/dataset/world-development-indicators, the WDICountry.csv
This blog was originally published on https://able.bio/haixuanTao/data-manipulation-pandas-vs-rust--1d70e7fc
Reading
Pandas
Reading and instantiating Data in Pandas is pretty straightforward, and handles by default many data quality problems:
import pandas as pd
path = "/home/peter/Documents/TEST/RUST/terrorism/src/globalterrorismdb_0718dist.csv"
df = pd.read_csv(path)
Rust Reading CSV
For Rust, Managing bad quality data is very very tedious. In this dataset, some fields are empty, some lines are badly formatted, and some are not UTF-8 encoded.
To open the CSV, I used the csv
crate but it does not solve all the issues listed above. With well-formatted data, reading can be done like so:
let path = "/home/peter/Documents/TEST/RUST/terrorism/src/foo.csv";
let mut rdr = csv::Reader::from_path(path).unwrap();
But with bad quality formatting, I had to add additional parameters like:
use std::fs::File;
use encoding_rs::WINDOWS_1252;
use encoding_rs_io::DecodeReaderBytesBuilder;
// ...
let file = File::open(path)?;
let transcoded = DecodeReaderBytesBuilder::new()
.encoding(Some(WINDOWS_1252))
.build(file);
let mut rdr = csv::ReaderBuilder::new()
.delimiter(b',')
.from_reader(transcoded);
ref: https://stackoverflow.com/questions/53826986/how-to-read-a-non-utf8-encoded-csv-file
Rust Instantiating the data
To instantiate the data, I used Serde https://serde.rs/ for serializing and deserializing my data.
To use Serde, I needed to make a struct of my data. Having a struct of my data is great as it makes my code follow a model-based coding paradigm with a well-defined type for each field. It also enables me to implement traits and methods on top of them.
However, the data I wanted to use has 130 columnsā¦ And, It seemed that there is no way to generate the definition of the struct automatically.
To avoid doing the definition manually, I had to build my own struct generator:
fn inspect(path: &str) {
let mut record: Record = HashMap::new();
let mut rdr = csv::Reader::from_path(path).unwrap();
for result in rdr.deserialize() {
match result {
Ok(rec) => {
record = rec;
break;
}
Err(e) => (),
};
}
// Print Struct
println!("#[skip_serializing_none]");
println!("#[derive(Debug, Deserialize, Serialize)]");
println!("struct DataFrame {{");
for (key, value) in &record {
println!(" #[serialize_always]");
match value.parse::<i64>() {
Ok(n) => {
println!(" {}: Option<i64>,", key);
continue;
}
Err(e) => (),
}
match value.parse::<f64>() {
Ok(n) => {
println!(" {}: Option<f64>,", key);
continue;
}
Err(e) => (),
}
println!(" {}: Option<String>,", key);
}
println!("}}");
}
This generated the struct as follows:
use serde::{Deserialize, Serialize};
use serde_with::skip_serializing_none;
#[skip_serializing_none]
#[derive(Debug, Clone, Deserialize, Serialize)]
struct DataFrame {
#[serialize_always]
individual: Option<f64>,
#[serialize_always]
natlty3_txt: Option<String>,
#[serialize_always]
ransom: Option<f64>,
#[serialize_always]
related: Option<String>,
#[serialize_always]
gsubname: Option<String>,
#[serialize_always]
claim2: Option<String>,
#[serialize_always]
// ...
skip_serializing_none: Avoid having error on empty fields in the CSV.
serialize_always: Makes the number of field when writing csv fixed.
Now, that I had my struct, I used serde serialization to populate a vector of struct:
let mut records: Vec<DataFrame> = Vec::new();
for result in rdr.deserialize() {
match result {
Ok(rec) => {
records.push(rec);
}
Err(e) => println!("{}", e),
};
}
This generated my vector of struct, hooray š
On a general note with Rust, you shouldnāt expect things to work as smoothly as it would with Python.
On reading / instantiating data, Pandas wins hands down for CSV.
Filtering
Pandas
There are many ways to do filtering in pandas, the most common way for me is as follows:
df = df[df.country_txt == "United States"]
df.to_csv("python_output.csv")
Rust
To do filtering in Rust, we can refer to the docs for vector in Rust https://doc.rust-lang.org/std/vec/struct.Vec.html
There is a large umbrella of methods for Vector filtering, with many nightly features that are going to be great for data manipulation when they ship. For this use case, I used the retain
method as it fitted my need perfectly:
records.retain(|x| &x.country_txt.unwrap() == "United States");
let mut wtr =
csv::Writer::from_path("output_rust_filter.csv")?;
for record in &records {
wtr.serialize(record)?;
}
One big difference between Pandas and Rust is that Rust filtering uses Closures (eq. lambda function in python) whereas Pandas filtering uses Pandas API based on columns. Rust can therefore make more complex filters compared to Pandas. It also adds in readability.
Performance
Time(s) | Mem Usage(Gb) | |
---|---|---|
Pandas | 3.0s | 2.5Gb |
Rust | 1.6s š„ -50% | 1.7Gb š„ -32% |
Even though weāre using Pandas API for filtering, we get significantly better performance using Rust.
On Filtering, Rust seems to be more capable and faster. š
Groupby
Pandas
Group by are a big part of the data reduction pipeline in python, it goes usually as follows:
df = df.groupby(by="country_txt", as_index=False).agg(
{"nkill": "sum", "individual": "mean", "eventid": "count"}
)
df.to_csv("python_output_groupby.csv")
Rust
For group by and data reduction, thanks to David Sanders, group by can be done as follows:
use itertools::Itertools;
// ...
#[derive(Debug, Deserialize, Serialize)]
struct GroupBy {
country: String,
total_nkill: f64,
average_individual: f64,
count: f64,
}
// ...
let groups = records
.into_iter()
.sorted_unstable_by(|a, b| Ord::cmp(&a.country_txt, &b.country_txt))
.group_by(|record| record.country_txt.clone())
.into_iter()
.map(|(country, group)| {
let (total_nkill, count, average_individual) = group.into_iter().fold(
(0., 0., 0.),
|(total_nkill, count, average_individual), record| {
(
total_nkill + record.nkill.unwrap_or(0.),
count + 1.,
average_individual + record.individual.unwrap_or(0.),
)
},
);
GroupBy {
country: country.unwrap(),
total_nkill,
average_individual: average_individual / count,
count,
}
})
.collect::<Vec<_>>();
let mut wtr =
csv::Writer::from_path("output_rust_groupby.csv")
.unwrap();
for group in &groups {
wtr.serialize(group)?;
}
ā
Although this solution is not as elegant as Pandas groupby, it gives a lot of flexibility on the computation of the reduced fields. Again, thanks to Closures.
I think more reduction method other than sum
and fold
would greatly improve the development experience of map-reduce style operation in rust. We will then probably have equivalent experience between Rust and Pandas.
Performance
Time(s) | Mem(Gb) | |
---|---|---|
Pandas | 2.78s | 2.5Gb |
Rust | 2.0sš„ -35% | 1.7Gbš„ -32% |
Although the performance is better for Rust, I would advise using Pandas for map-reduce heavy application, as it seems more appropriate.
Mutation
Pandas
There are many ways to do mutation in Pandas, I usually do the following for performance and functional style:
df["computed"] = df["nkill"].map(lambda x: (x - 10) / 2 + x ** 2 / 3)
df.to_csv("python_output_map.csv")
Rust
For mutation, the functional iter
of Rust really makes this part a walk in the park:
records.iter_mut().for_each(|x: &mut DataFrame| {
let nkill = match &x.nkill {
Some(nkill) => nkill,
None => &0.,
};
x.computed = Some((nkill - 10.) / 2. + nkill * nkill / 3.);
});
let mut wtr = csv::Writer::from_path(
"output_rust_map.csv",
)?;
for record in &records {
wtr.serialize(record)?;
}
Performance
Time(s) | Mem(Gb) | |
---|---|---|
Pandas | 12.82s | 4.7Gb |
Rust | 1.58sš„ -87% | 1.7Gbš„ -64% |
This is where the difference really appeared to me. Pandas do not scale for line-by-line lambda functions. Pandas would have been even worst if I had done an operation involving several columns.
Rust is way better for line-by-line mutation natively.
Merging
Python
Merging in python is pretty efficient generally speaking, it goes like this in general:
df_country = pd.read_csv(
"/home/peter/Documents/TEST/RUST/terrorism/src/WDICountry.csv"
)
df_merge = pd.merge(
df, df_country, left_on="country_txt", right_on="Short_Name"
)
df_merge.to_csv("python_output_merge.csv")
Rust
For Rust, however, this is a tricky part as, with Struct, merging isnāt really a thing. For me, the rustiest way of doing a merge is by adding a nested field containing the other struct we want to join data with.
I first created a new struct and a new heap for the new data:
#[skip_serializing_none]
#[derive(Clone, Debug, Deserialize, Serialize)]
struct DataFrameCountry {
#[serialize_always]
SNA_price_valuation: Option<String>,
#[serialize_always]
IMF_data_dissemination_standard: Option<String>,
#[serialize_always]
Latest_industrial_data: Option<String>,
#[serialize_always]
System_of_National_Accounts: Option<String>,
//...
// ...
let mut records_country: Vec<DataFrameCountry> = Vec::new();
let file = File::open(path_country)?;
let transcoded = DecodeReaderBytesBuilder::new()
.encoding(Some(WINDOWS_1252))
.build(file);
let mut rdr = csv::ReaderBuilder::new()
.delimiter(b',')
.from_reader(transcoded);
for result in rdr.deserialize() {
match result {
Ok(rec) => {
records_country.push(rec);
}
Err(e) => println!("{}", e),
};
}
I then cloned this new struct with the previous struct on a specific field that is unique.
impl DataFrame {
fn add_country_ext(&mut self, country: Option<DataFrameCountry>) {
self.country_merge = Some(country)
}
}
//...
for country in records_country {
records
.iter_mut()
.filter(|record| record.country_txt == country.Short_Name)
.for_each(|x| {
x.add_country_ext(Some(country.clone()));
});
}
let mut wtr =
csv::Writer::from_path("output_rust_join.csv")
.unwrap();
for record in &records {
wtr.serialize(record)?;
}
I cloned the data for convenience and also for better comparability, but a reference can be passed if you can manage it.
And there we go! š
Except, a nested struct is not yet serializable in CSV for Rust -> https://github.com/BurntSushi/rust-csv/pull/197
So I had to adapt it to:
impl DataFrame {
fn add_country_ext(&mut self, country: Option<DataFrameCountry>) {
self.country_ext = Some(format!("{:?}", country))
}
}
But, then, we got a sort of merge! š
Performance
Time(s) | Mem(Gb) | |
---|---|---|
Pandas | 22.47s | 11.8Gb |
Rust | 5.48sš„ -75% | 2.6Gbš„ -78% |
Rust is capable of doing nested structs that are going to be as capable if not more capable than Pandas merges. However, it isnāt really a one to one comparison and in this case, it is going to depend on your use case.
Conclusion
After this experience, this is my take away.
- Use Pandas when you can: small CSV(<1M lines), simple operation, data cleaning ā¦
- Use Rust when you have: complex operations, memory heavy or time-consuming pipelines, custom functions, scalable softwareā¦
That been said, Rust offers impressive flexibility compared to Pandas. Adding the fact that Rust is way more capable of multi-threading than Pandas, I believe that Rust can solve problems Pandas simply cannot.
Additionally, the possibility to run Rust on any platform(Web, Android, or Embedded) also create new opportunities for data manipulation in places inconceivable for Pandas and can provide solutions for yet to be resolved challenges.
Performance
The performance table gives us an insight as to what to expect from Rust. I believe, the speedup can go from x2 at the minimum and up to x50 for large data pipelines. The memory use will have an even greater decrease as memory usage accumulates over time with python.
Scraping Python vs Rust
Introduction
Web scraping is about as error-prone as you can imagine. Pages might not exist, HTML elements might not always be thereā¦ And so, a language that can support errors and edge cases well at runtime and not crash is a huge plus.
Performance
Performance test of scraping the 50 pages of http://books.toscrape.com/catalogue/page-1.html
Name | CPU Usage | Time(s) |
---|---|---|
Synchronous Python | 5% | 44.3s |
Synchronous Rust | 7% | 55s |
Async Python | 63% | 2.5s |
Async Rust | 107% | 2.25s |
ā Performances are pretty similar for such low level of requests. Time is consumed downloading. Maybe with significantly more requests, bigger difference would be seen.
This blog was originally published on: https://able.bio/haixuanTao/web-scraper-python-vs-rust--d6176429
Synchronous Python code
import requests
import bs4 as bs
import csv
URL = "http://books.toscrape.com/catalogue/page-%d.html"
with open('./test_python.csv', 'w') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',')
for i in range(1, 50):
response = requests.get(URL % i)
if response.status_code == 200:
content = response.content
soup = bs.BeautifulSoup(content, 'lxml')
articles = soup.find_all('article')
for article in articles:
information = []
information.append(article.find(
'p', class_='price_color').text)
information.append(article.find('h3').find('a').get('title'))
spamwriter.writerow(information)
ā
Synchronous Rust code:
use csv::Writer; use select::document::Document; use select::predicate::{Attr, Class, Name}; use std::fs::OpenOptions; async fn test(i: &i32) -> Result<(), Box<dyn std::error::Error + Send + Sync>> { let url = format!("http://books.toscrape.com/catalogue/page-{}.html", i); let response = reqwest::get(&url).await?.text().await?; let file = OpenOptions::new() .write(true) .create(true) .append(true) .open("test2.csv") .unwrap(); let mut wtr = Writer::from_writer(file); let document = Document::from(response.as_str()); for node in document.find(Name("article")) { let name = match node.find(Name("h3")).next() { Some(h3) => h3.find(Name("a")).next().unwrap().text(), None => "".to_string(), }; let price = node .find(Attr("class", "price_color")) .next() .unwrap() .text(); // println!("{:#?} ", url); wtr.write_record(&[&url, &price, &name]).unwrap(); } Ok(()) } #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { for i in 1..50 { test(&i).await.unwrap(); } Ok(()) }
Asynchronous
During scraping, most of the time is lost downloading file rather than computing.
However, with synchronous runtimes, pages are scraped one by one and so downloaded one by one. Each download can take time and idle the whole process. Therefore, if we can manage to not wait for the completion of each download, we will gain efficiency.
Python
It is possible using the āasyncioā library, and it might look like that:
import asyncio
import requests
import bs4 as bs
import csv
URL = "http://books.toscrape.com/catalogue/page-%d.html"
async def get_book(url, spamwriter):
response = requests.get(url)
if response.status_code == 200:
content = response.content
soup = bs.BeautifulSoup(content, 'lxml')
articles = soup.find_all('article')
for article in articles:
information = [url]
information.append(article.find(
'p', class_='price_color').text)
information.append(article.find('h3').find('a').get('title'))
spamwriter.writerow(information)
async def main():
with open('./test_async_python.csv', 'w') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',')
tasks = []
for i in range(1, 50):
tasks.append(asyncio.create_task(
get_book(URL % i, spamwriter)))
for task in tasks:
await task
asyncio.run(main())
Python does provide the async/await
terminology which makes it easier to read and write.
Rust
Rust, on the contrary to Python, has been built with asynchronous computation in mind. It is thread-safe and extremely efficient. The fact that the language, in its nature. is super fast makes it great for coroutines. The code might look like that:
use csv::Writer;
use select::document::Document;
use select::predicate::{Attr, Name};
use std::fs::OpenOptions;
async fn test(i: &i32) -> Result<(), Box<dyn std::error::Error + Send + Sync>> {
let url = format!("http://books.toscrape.com/catalogue/page-{}.html", i);
let response = reqwest::get(&url).await?.text().await?;
let file = OpenOptions::new()
.write(true)
.create(true)
.append(true)
.open("test2.csv")
.unwrap();
let mut wtr = Writer::from_writer(file);
let document = Document::from(response.as_str());
for node in document.find(Name("article")) {
let name = match node.find(Name("h3")).next() {
Some(h3) => h3.find(Name("a")).next().unwrap().text(),
None => "".to_string(),
};
let price = node
.find(Attr("class", "price_color"))
.next()
.unwrap()
.text();
println!("{:#?} ", url);
wtr.write_record(&[&url, &price, &name]).unwrap();
}
Ok(())
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let mut handles: std::vec::Vec<_> = Vec::new();
for i in 1..50 {
let job = tokio::spawn(async move { test(&i).await });
handles.push(job);
}
let mut results = Vec::new();
for job in handles {
results.push(job.await);
}
Ok(())
}
ā
Productivity
This humble personal productivity cheatsheet is here to help other identify things that can help them increase their productivity.
Prezto
prezto
main idea is that shell can be interactive.
Installation
sudo apt-get update
sudo apt-get install zsh
git clone --recursive https://github.com/sorin-ionescu/prezto.git "${ZDOTDIR:-$HOME}/.zprezto"
Features I use on a daily basis With prezto
:
- auto-completion
- auto-suggestion
- docker completion
- git completion
More info: https://github.com/sorin-ionescu/prezto
FZF
fzf
main idea is that you should never have to know by heart strings that you can approximate.
Installation
git clone --depth 1 https://github.com/junegunn/fzf.git ~/.fzf
~/.fzf/install
Features I use on a daily basis With prezto
:
kill **
- ctrl+r
- ctrl+t
cd **
vim **
More info at: https://github.com/junegunn/fzf
VSCode
VSCode makes it really easy to have flexibility and automation put in place.
Productivity shortcut I use on daily basis:
- ctrl+p : To open a file
- ctrl+` : To open the terminal
- ctrl+shift+p : To access extension functionilities.
- ctrl+shift+p+
Open User Settings(JSON)
: For scripted settings. - ctrl+shift+p+
Snippets
: For automating the generation of code.
Vim (VSCode)
Vim idea is to allow automation on-the-lfy given a set of functionality to each key.
Vim has some great automation features that I just can't live without:
- / : Search
- v : Visual mode to select block of text
- :+s : Replace within a selection
- ctrl+z ... f+g : Jump in and out of vim mode
- . : Repeat previous command.
Touchtyping
Touch Typing main idea is that you can type faster by moving less your hands.
And, you can reduce hand movement by learning how to maximize the utilization of your fingers.
Features I use on a daily basis:
- Not having sour wrist.