klib offers fast and simple Python functions for an efficient data preparation
For those of you who want to follow along, let’s make sure you have access to the Kaggle API to download the data. For that you need to create an API token in your Kaggle account settings and save it under ~/.kaggle/kaggle.json
We download three datasets using the Kaggle API, unzip them and read the resulting .csv files into pd.DataFrames. We then hand them to klib.data_cleaning() using the default settings and obtain the cleaned DataFrames.
Alternatively, I encourage you to follow along using your own data! In this case, just read the data in and pass it to the data_cleaning() function.
The following table highlights the key differences between the original and the cleaned datasets. We can see that even when containing mostly numerical data — as is the case for the “fraud” dataset — the size already drops by about 50%. For mixed datatypes, which we can find in the US pollution dataset, the savings are significantly higher; from more than 1.3GB down to 200MB!
Dataset: Before: After:Fraud Shape: (284807, 31) Shape: (283726, 31)
Pollution Shape: (1746661, 29) Shape: (1746661, 25)
Hotel Shape: (515738, 17) Shape: (515212, 17)
Aside from significant memory savings these optimizations also reduce the computation time for anything that follows. For instance transformations on your DataFrame or queries. As we can see in the table above, these are typically in a similar range than the memory savings.
A value that stands out is the slightly slower (17ms) call to the nlargest() method for the hotel dataset. While this may be explained by other tasks running on my laptop or CPU throttling, it could also be due to some numerical optimization working better on the original datatype.
Read More …