Reading
Pandas
Reading and instantiating Data in Pandas is pretty straightforward, and handles by default many data quality problems:
import pandas as pd
path = "/home/peter/Documents/TEST/RUST/terrorism/src/globalterrorismdb_0718dist.csv"
df = pd.read_csv(path)
Rust Reading CSV
For Rust, Managing bad quality data is very very tedious. In this dataset, some fields are empty, some lines are badly formatted, and some are not UTF-8 encoded.
To open the CSV, I used the csv
crate but it does not solve all the issues listed above. With well-formatted data, reading can be done like so:
let path = "/home/peter/Documents/TEST/RUST/terrorism/src/foo.csv";
let mut rdr = csv::Reader::from_path(path).unwrap();
But with bad quality formatting, I had to add additional parameters like:
use std::fs::File;
use encoding_rs::WINDOWS_1252;
use encoding_rs_io::DecodeReaderBytesBuilder;
// ...
let file = File::open(path)?;
let transcoded = DecodeReaderBytesBuilder::new()
.encoding(Some(WINDOWS_1252))
.build(file);
let mut rdr = csv::ReaderBuilder::new()
.delimiter(b',')
.from_reader(transcoded);
ref: https://stackoverflow.com/questions/53826986/how-to-read-a-non-utf8-encoded-csv-file
Rust Instantiating the data
To instantiate the data, I used Serde https://serde.rs/ for serializing and deserializing my data.
To use Serde, I needed to make a struct of my data. Having a struct of my data is great as it makes my code follow a model-based coding paradigm with a well-defined type for each field. It also enables me to implement traits and methods on top of them.
However, the data I wanted to use has 130 columns… And, It seemed that there is no way to generate the definition of the struct automatically.
To avoid doing the definition manually, I had to build my own struct generator:
fn inspect(path: &str) {
let mut record: Record = HashMap::new();
let mut rdr = csv::Reader::from_path(path).unwrap();
for result in rdr.deserialize() {
match result {
Ok(rec) => {
record = rec;
break;
}
Err(e) => (),
};
}
// Print Struct
println!("#[skip_serializing_none]");
println!("#[derive(Debug, Deserialize, Serialize)]");
println!("struct DataFrame {{");
for (key, value) in &record {
println!(" #[serialize_always]");
match value.parse::<i64>() {
Ok(n) => {
println!(" {}: Option<i64>,", key);
continue;
}
Err(e) => (),
}
match value.parse::<f64>() {
Ok(n) => {
println!(" {}: Option<f64>,", key);
continue;
}
Err(e) => (),
}
println!(" {}: Option<String>,", key);
}
println!("}}");
}
This generated the struct as follows:
use serde::{Deserialize, Serialize};
use serde_with::skip_serializing_none;
#[skip_serializing_none]
#[derive(Debug, Clone, Deserialize, Serialize)]
struct DataFrame {
#[serialize_always]
individual: Option<f64>,
#[serialize_always]
natlty3_txt: Option<String>,
#[serialize_always]
ransom: Option<f64>,
#[serialize_always]
related: Option<String>,
#[serialize_always]
gsubname: Option<String>,
#[serialize_always]
claim2: Option<String>,
#[serialize_always]
// ...
skip_serializing_none: Avoid having error on empty fields in the CSV.
serialize_always: Makes the number of field when writing csv fixed.
Now, that I had my struct, I used serde serialization to populate a vector of struct:
let mut records: Vec<DataFrame> = Vec::new();
for result in rdr.deserialize() {
match result {
Ok(rec) => {
records.push(rec);
}
Err(e) => println!("{}", e),
};
}
This generated my vector of struct, hooray 🎉
On a general note with Rust, you shouldn’t expect things to work as smoothly as it would with Python.
On reading / instantiating data, Pandas wins hands down for CSV.