Merging

Python

Merging in python is pretty efficient generally speaking, it goes like this in general:

df_country = pd.read_csv( "/home/peter/Documents/TEST/RUST/terrorism/src/WDICountry.csv" ) df_merge = pd.merge( df, df_country, left_on="country_txt", right_on="Short_Name" ) df_merge.to_csv("python_output_merge.csv")

Rust

For Rust, however, this is a tricky part as, with Struct, merging isn’t really a thing. For me, the rustiest way of doing a merge is by adding a nested field containing the other struct we want to join data with.

I first created a new struct and a new heap for the new data:

#[skip_serializing_none] #[derive(Clone, Debug, Deserialize, Serialize)] struct DataFrameCountry { #[serialize_always] SNA_price_valuation: Option<String>, #[serialize_always] IMF_data_dissemination_standard: Option<String>, #[serialize_always] Latest_industrial_data: Option<String>, #[serialize_always] System_of_National_Accounts: Option<String>, //... // ... let mut records_country: Vec<DataFrameCountry> = Vec::new(); let file = File::open(path_country)?; let transcoded = DecodeReaderBytesBuilder::new() .encoding(Some(WINDOWS_1252)) .build(file); let mut rdr = csv::ReaderBuilder::new() .delimiter(b',') .from_reader(transcoded); for result in rdr.deserialize() { match result { Ok(rec) => { records_country.push(rec); } Err(e) => println!("{}", e), }; }

I then cloned this new struct with the previous struct on a specific field that is unique.

impl DataFrame { fn add_country_ext(&mut self, country: Option<DataFrameCountry>) { self.country_merge = Some(country) } } //... for country in records_country { records .iter_mut() .filter(|record| record.country_txt == country.Short_Name) .for_each(|x| { x.add_country_ext(Some(country.clone())); }); } let mut wtr = csv::Writer::from_path("output_rust_join.csv") .unwrap(); for record in &records { wtr.serialize(record)?; }

I cloned the data for convenience and also for better comparability, but a reference can be passed if you can manage it.

And there we go! 🚀

Except, a nested struct is not yet serializable in CSV for Rust -> https://github.com/BurntSushi/rust-csv/pull/197

So I had to adapt it to:

impl DataFrame { fn add_country_ext(&mut self, country: Option<DataFrameCountry>) { self.country_ext = Some(format!("{:?}", country)) } }

But, then, we got a sort of merge! 🚀

Performance

Time(s)Mem(Gb)
Pandas22.47s11.8Gb
Rust5.48s🔥 -75%2.6Gb🔥 -78%

Rust is capable of doing nested structs that are going to be as capable if not more capable than Pandas merges. However, it isn’t really a one to one comparison and in this case, it is going to depend on your use case.