## How to make your Pandas operation 100x faster | by Yifei Huang | Dec, 2020

[ad_1]

## A practical guide to speeding up your Pandas code

Pandas is a great tool for exploring and working with data. As such, it is deliberately optimized for versatility and ease of use, instead of performance. There are often many different ways to do the exact same operation, some of which are far more performant than others. This is not a problem if your data is relatively small or your transformation is relatively simple, but it can quickly becomes a bottleneck as the scale and complexity increase. In this article, I will walk through a few different approaches to optimizing the performance ranging from vectorization to parallelization.

Let’s start by making this issue more concrete with an example.

A common operation when using Pandas is to create a new derived value based on one or more existing columns. The examples below (inspired by [1]) illustrates the various different ways of doing this using Pandas, and measure the resulting performance using the`%timeit`

magic command in Jupyter notebooks.

The first approach

`[sum_square(row[0], row[1]) for _, row in df.iterrows()]`

uses list comprehension along with the method `iterrows`

, and is the slowest by a long shot. This is because it is effectively using a simple for loop *and* incurring the heavy overhead of using the pandas series object in each iteration. It is rarely necessary to use series object in your transformation function, so you should almost never use the `iterrows`

method.

The second approach

`[sum_square(a, b) `**for** a, b **in** df[[0, 1]].itertuples(index=**False**)]

replaces the `iterrows`

method of the first approach with `itertuples`

method. The improvement in performance is a whopping 60x (575ms -> 9.53ms). This is because the `itertuples`

method bypasses the overhead associated with the Pandas series object and uses the simple tuples instead. If you need to loop through a Pandas dataframe, you should almost always `itertuples`

instead of the `iterrows`

The third approach

`df.apply(`**lambda** row: sum_square(row[0], row[1]), axis=1 )

uses the `apply`

method instead of list comprehension to apply the function to each row of the dataframe. It still incurs the overhead of the Pandas series object when working with each row, but is 6x faster than the first approach because it is able to leverage some more efficient pre-compiled looping methods underneath the hood. It is, however, still effectively a for loop, just more efficient way to do so.

The fourth approach

`df.apply(`**lambda** row: sum_square(row[0], row[1]), raw=**True**, axis=1 )

is able to achieve a 4x speed up relative to the third approach, with a very simple parameter tweak in adding `raw=True`

. This is telling the `apply`

method to bypass the overhead associated with the Pandas series object and use simple map objects instead. It is interesting note that this is still slower than the second approach with `itertuples`

The fifth approach

`np.vectorize(sum_square)(df[0], df[1])`

uses `np.vectorize`

to attempt to make the function provided more “vectorized”. This basically means it tries to leverage pre-compiled and native methods that are SIMD (Single Instruction Multiple Data) and significantly faster [2], as much as possible. Depending on the function, it may or may not actually yield a speed up, but it never hurts to try given the relative simplicity. In this particularly example, it worked quite well and yielded a 171x speed up relative to the first approach. It is even 3x faster than the itertuple method in the second approach.

The last approach

`np.power(df[0] + df[1], 2)`

fully takes advantage of the vectorized native Numpy methods and is 1888x faster than the first approach we started with. Just like the `np.vectorize`

method, it may not always be possible (or clear as to how) to vectorize the given function, but when you can you should definitely try. A good rule of thumb for assessing whether you can vectorize the function is if you can represent it as linear algebra operations on matrices and scalar values. If you can, you can likely find native Numpy methods to vectorize those operations.

**What if your function cannot be vectorized or you don’t know how to?**

There are transformations of the data that simply cannot be vectorized, and therefore cannot benefit from the dramatic improvements that vectorization provided above (but you should still never use `iterrows`

). One example of this type of transformation is regex operations on strings like extracting the top level domain from URLs. In these cases, there are a few different approaches like compiling your function using Cython or run-time compilers like Numba, or using parallel processing. I prefer parallel processing because it requires the least amount of rewrite of your existing code. You simply have to add a few lines wrapper code to make it work. The example below illustrates how you can do this

Read More …

[ad_2]