How to make your Pandas operation 100x faster | by Yifei Huang | Dec, 2020
A practical guide to speeding up your Pandas code
Pandas is a great tool for exploring and working with data. As such, it is deliberately optimized for versatility and ease of use, instead of performance. There are often many different ways to do the exact same operation, some of which are far more performant than others. This is not a problem if your data is relatively small or your transformation is relatively simple, but it can quickly becomes a bottleneck as the scale and complexity increase. In this article, I will walk through a few different approaches to optimizing the performance ranging from vectorization to parallelization.
Let’s start by making this issue more concrete with an example.
A common operation when using Pandas is to create a new derived value based on one or more existing columns. The examples below (inspired by ) illustrates the various different ways of doing this using Pandas, and measure the resulting performance using the
%timeit magic command in Jupyter notebooks.
The first approach
[sum_square(row, row) for _, row in df.iterrows()]
uses list comprehension along with the method
iterrows , and is the slowest by a long shot. This is because it is effectively using a simple for loop and incurring the heavy overhead of using the pandas series object in each iteration. It is rarely necessary to use series object in your transformation function, so you should almost never use the
The second approach
[sum_square(a, b) for a, b in df[[0, 1]].itertuples(index=False)]
iterrows method of the first approach with
itertuples method. The improvement in performance is a whopping 60x (575ms -> 9.53ms). This is because the
itertuples method bypasses the overhead associated with the Pandas series object and uses the simple tuples instead. If you need to loop through a Pandas dataframe, you should almost always
itertuples instead of the
The third approach
df.apply(lambda row: sum_square(row, row), axis=1 )
apply method instead of list comprehension to apply the function to each row of the dataframe. It still incurs the overhead of the Pandas series object when working with each row, but is 6x faster than the first approach because it is able to leverage some more efficient pre-compiled looping methods underneath the hood. It is, however, still effectively a for loop, just more efficient way to do so.
The fourth approach
df.apply(lambda row: sum_square(row, row), raw=True, axis=1 )
is able to achieve a 4x speed up relative to the third approach, with a very simple parameter tweak in adding
raw=True . This is telling the
apply method to bypass the overhead associated with the Pandas series object and use simple map objects instead. It is interesting note that this is still slower than the second approach with
The fifth approach
np.vectorize to attempt to make the function provided more “vectorized”. This basically means it tries to leverage pre-compiled and native methods that are SIMD (Single Instruction Multiple Data) and significantly faster , as much as possible. Depending on the function, it may or may not actually yield a speed up, but it never hurts to try given the relative simplicity. In this particularly example, it worked quite well and yielded a 171x speed up relative to the first approach. It is even 3x faster than the itertuple method in the second approach.
The last approach
np.power(df + df, 2)
fully takes advantage of the vectorized native Numpy methods and is 1888x faster than the first approach we started with. Just like the
np.vectorize method, it may not always be possible (or clear as to how) to vectorize the given function, but when you can you should definitely try. A good rule of thumb for assessing whether you can vectorize the function is if you can represent it as linear algebra operations on matrices and scalar values. If you can, you can likely find native Numpy methods to vectorize those operations.
What if your function cannot be vectorized or you don’t know how to?
There are transformations of the data that simply cannot be vectorized, and therefore cannot benefit from the dramatic improvements that vectorization provided above (but you should still never use
iterrows). One example of this type of transformation is regex operations on strings like extracting the top level domain from URLs. In these cases, there are a few different approaches like compiling your function using Cython or run-time compilers like Numba, or using parallel processing. I prefer parallel processing because it requires the least amount of rewrite of your existing code. You simply have to add a few lines wrapper code to make it work. The example below illustrates how you can do this
As you can see with relatively little effort and no modification to the existing code, we were able to achieve a 7x speed up to the code by leveraging parallel processing. It is worth noting that this improvement increases with data scale and function run time, as can be seen in the graph below, where I varied simulated function run time from 0 to 0.5ms. The serial implementation processing time scales at a much higher rate than parallel implementation.
A few important consideration to keep in mind when using parallel processing
- Overhead — parallel processing incurs quite a bit of overhead, and as a result it is not guaranteed to be faster than serial processing if you function is relatively fast and your data is not super large. For the example above (10k rows), we can see that parallel processing overtakes serial processing, at around 0.01ms function run time.
- CPU bound vs I/O bound — If your function is CPU bound, meaning that the transformation logic just takes a long time and it is not waiting for data, then you should use multiprocess (or process pool). If your function is I/O bound, meaning that it is spending a lot of time waiting for data (e.g. making api requests over the internet), then multithreading (or thread pool) will be the better and faster option. This article  provides a really good discussion on differences between the two.
- Whenever possible write your transformation functions using native pre-compiled SIMD methods (most Numpy and many Pandas native methods are)
- If you are not sure how to vectorize yourself, try
np.vectorize,but your mileage may vary
df.itertuplesfor looping, never use
df.iterrowsunless you know exactly why you must use it
- If your data is large and transformation is slow and cannot be vectorized easily, parallel processing is an easy way to boost performance
Read More …