How to avoid a Pandas pandemonium, Part II | by Pamela Wu | Nov, 2020

[ad_1]


A deep dive into common Pandas mistakes and what you can do to avoid them. Part II: Speed-ups and memory optimizations

Pamela Wu
Photo by Alex wong on Unsplash

When you first start out using Pandas, it’s often best to just get your feet wet and deal with problems as they come up. Then, the years pass, the amazing things you’ve been able to build with it start to accumulate, but you have a vague inkling that you keep making the same kinds of mistakes and that your code is running really slowly for what seems like pretty simple operations. This is when it’s time to dig into the inner workings of Pandas and take your code to the next level. Like with any library, the best way to optimize your code is to understand what’s going on underneath the syntax.

In Part I, we went over writing clean code and spotting common silent failures. In Part II, we’re working on speeding up your runtime and lowering your memory footprint.

I also made a Jupyter notebook with the whole lesson, both parts included.

This is some code setup from the previous part, but I’m including it here so that this part can stand on its own.

# for neatness, it helps to keep all of your imports up top
import sys
import traceback
import numba
import numpy as np
import pandas as pd
import numpy.random as nr
import matplotlib.pyplot as plt

% matplotlib inline# generate some fake data to play with
data = {
"day_of_week": ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"] * 1000,
"booleans": [True, False] * 3500,
"positive_ints": nr.randint(0, 100, size=7000),
"mixed_ints": nr.randint(-100, 100, size=7000),
"lat1": nr.randn(7000) * 30,
"lon1": nr.randn(7000) * 30,
"lat2": nr.randn(7000) * 30,
"lon2": nr.randn(7000) * 30,
}

df_large = pd.DataFrame(data)

# let's copy this pro-actively, copy vs. view is in part i
large_copy = df_large.copy()
df_large.head()

Now that you have great coding habits, it’s time to try to up the performance. There’s a range of things you can use, from vectorization to just-in-time compilation to get your code running faster. To measure bottlenecks and quantify performance gains, let’s introduce timeit, a nifty Jupyter notebook tool for performance measurement. All you need to know is that putting %timeit before a single line of code will measure the runtime of that line, while putting %%timeit in a code block will measure the runtime for the whole block.

For this example, I chose the haversine function because it’s a great example of a function that looks kind of complicated but is actually quite amenable to optimization. I found this example function on StackOverflow.

def haversine(lat1, lon1, lat2, lon2):
"""Haversine calculates the distance between two points on a sphere."""
lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
dlat = lat2 - lat1
dlon = lon2 - lon1
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
return c

%timeit haversine(100, -100, 50, -50)
> 14.3 µs ± 308 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Three different ways of iterating through rows

You’ve probably known that Pandas is built off of Numpy to take advantage of optimizations only offered by its underlying C code while keeping its accessible Python interface. However, like anything with a foot in two worlds, you need to know where the borders are to take proper advantage of its dual nature. Depending on how you iterate through a DataFrame, you could either be taking full advantage of or totally disregarding those optimizations.

If you are completely new to Pandas and are just thinking of DataFrames as a wrapper for nested Python lists, then your first instinct may be to iterate through rows one at a time, such as by using df.iterrows().

%%timeit# `iterrows` is a generator that yields indices and rows
dists = []
for i, r in large_copy.iterrows():
dists.append(haversine(r["lat1"], r["lon1"], r["lat2"], r["lon2"]))
large_copy["spherical_dist"] = dists> 736 ms ± 29.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

740 ms per loop is 3/4 of a second and can really add up. 100,000 rows will end up taking you 20 hours.

There’s a somewhat optimized way with the df.apply() method, which works on both DataFrame and Series objects. You define a custom function and send it through, and it will try to infer faster ways of processing the DataFrame columns faster in some cases.

large_copy["spherical_dist"] = large_copy.apply(
lambda r: haversine(r["lat1"], r["lon1"], r["lat2"], r["lon2"]), axis=1
)
> 387 ms ± 31.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This is an improvement, but it’s not amazing. What you need to do is use vectorization, which to put in absolutely simplistic terms, is to feed into a Numpy-compatible function a whole Series of values that would normally take one value at a time so that the C code is free to divide up the vector internally and do parallel processing. Numpy functions can usually take a single value, a vector, or a matrix. If the internal processing can act on both a single value or a vector of values at every step, you can use vectorization out of the box.

%timeit large_copy["spherical_dist"] = haversine(
large_copy["lat1"],
large_copy["lon1"],
large_copy["lat2"],
large_copy["lon2"]
)
> 2.17 ms ± 188 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

That’s about a 100X speedup — 20 hours has shrunk down to about 4 minutes. Note that I haven’t changed the original function in any way. If you go look at the code, you can trace for yourself how the values can be both single values or a Series.

Some rules of thumb for iterating

Like with indexing, Pandas is flexible about how you want to go through the values of each row. The following are some rules of thumb:

  • If you want to apply the same transformation to each value of of a column, you should use vectorization.
  • If you need conditional vectorization, use boolean indexing.
  • Also works on strings! i.e. Series.str.replace("remove_word", "")
  • You should only use apply for specific functions that can’t be broadcast. i.e. pd.to_datetime()

Just-in-time compilation with Numba

What if you can’t vectorize? Does this mean you’re stuck with df.apply()? Not necessarily – if your code can be expressed as a combination of pure Python and Numpy arrays, you should give Numba a try and see if your code can be sped up for you. Writing Numba is nothing like writing Cython, which is a lot like writing a whole new programming language if you just know Python. Again, as long as your code can be expressed in pure Python and Numpy, it’s literally just putting a couple of decorators on top of the existing functions.

This example is based on some functions that calculate if a given complex number is part of the Mandlebrot set and visualizes the resulting fractal from trying every pixel coordinate. It was taken from the Numba docs.

def mandel(x, y, max_iters):
"""
Given the real and imaginary parts of a complex number,
determine if it is a candidate for membership in the Mandelbrot
set given a fixed number of iterations.
"""
i = 0
c = complex(x,y)
z = 0.0j
for i in range(max_iters):
z = z*z + c
if (z.real*z.real + z.imag*z.imag) >= 4:
return i
return 255
def create_fractal(min_x, max_x, min_y, max_y, image, iters):
height = image.shape[0]
width = image.shape[1]
pixel_size_x = (max_x - min_x) / width
pixel_size_y = (max_y - min_y) / height
for x in range(width):
real = min_x + x * pixel_size_x
for y in range(height):
imag = min_y + y * pixel_size_y
color = mandel(real, imag, iters)
image[y, x] = color
return image
image = pd.DataFrame()
image["pixels"] = np.zeros(1, dtype=np.uint8)
%timeit create_fractal(-2.0, 1.0, -1.0, 1.0, image, 20)> 30.5 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Here are the exact same functions but with the Numba decorators on top. I had to make the input data very, very small because timeit runs many iterations to measure the speed.

@numba.jit
def
mandel(x, y, max_iters):
"""
Given the real and imaginary parts of a complex number,
determine if it is a candidate for membership in the Mandelbrot
set given a fixed number of iterations.
"""
i = 0
c = complex(x,y)
z = 0.0j
for i in range(max_iters):
z = z*z + c
if (z.real*z.real + z.imag*z.imag) >= 4:
return i
return 255
@numba.jit
def create_fractal(min_x, max_x, min_y, max_y, image, iters):
height = image.shape[0]
width = image.shape[1]
pixel_size_x = (max_x - min_x) / width
pixel_size_y = (max_y - min_y) / height
for x in range(width):
real = min_x + x * pixel_size_x
for y in range(height):
imag = min_y + y * pixel_size_y
color = mandel(real, imag, iters)
image[y, x] = color
return image
image = pd.DataFrame()
image["pixels"] = np.zeros(1, dtype=np.uint8)
%timeit create_fractal(-2.0, 1.0, -1.0, 1.0, image, 20)> 1.63 ms ± 84 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

That’s 18X the speed for about 5 seconds of typing. If your function requires some intricate for loops and conditions mixed together, it’s still worth taking a look to see if you can apply some Numba magic.

In case you were curious, this is what the create_fractal function looks like when you apply it to a real input of all zeroes:

If you’ve used Pandas for a while, you’ve probably had a medium-large DataFrame crash your Python process at some point. What’s puzzling is when the DataFrame doesn’t really seem that large, or maybe you remember processing something even bigger last week on the same laptop, but that one didn’t crash. What gives?

The biggest offenders of memory problems in Pandas are probably:

  • You have references still attached to variables, which means they don’t get garbage collected.
  • You have too many copies of DataFrames lying around.
  • You could stand to do more in-place operations, which don’t produce copies of your DataFrame.
  • object dtypes take up a lot more memory than fixed dtypes.

Garbage collection

Garbage collection is the process by which Python frees up memory by releasing memory that is no longer useful to the program. You can release the objects referenced by memory by removing the reference to that object. This flags the formerly referenced object for memory release.

The best way to let garbage collection help you manage memory is to wrap whatever you can into functions. Variables declared in functions are only scoped to the function, so when the function is finished running, they get discarded. On the other hand, global variables (like large_copy) are kept around until the Python process ends (i.e. the notebook kernel is shut down). Even if you del a variable, it just decreases the reference by 1, but if the reference count isn’t 0, the object referenced isn’t actually deleted. That’s why global variables can screw up what you think your memory is holding onto.

Just for fun, you can peek into what a variable’s reference count is by using sys.getrefcount(var_name).

# `foo` is a reference
foo = []
sys.getrefcount(foo) # this temporarily bumps it up to 2> 2# yet another global reference bumps it up again
foo.append(foo)
sys.getrefcount(foo)> 3

At this point, del won’t be able to garbage collect it, there are too many references in the global scope.

object dtypes take up a lot of memory

It’s those pesky object dtypes again! Not surprisingly, telling Pandas that you need to be able to store literally anything at any time somewhere means that it will pre-allocate a huge amount of initial memory for you for the thing you’re storing. This is fine if you’re storing something complex, but if you’re storing something that could easily be represented more simply, you might want to see if you can change the dtype to something better for your situation.

It’s actually easier than you may think to inspect how much memory your DataFrames are occupying. It’s already built into the DataFrame object.

large_copy.info(memory_usage="deep")>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7000 entries, 0 to 6999
Data columns (total 8 columns):
day_of_week 7000 non-null object
booleans 7000 non-null bool
positive_ints 7000 non-null int64
mixed_ints 7000 non-null int64
lat1 7000 non-null float64
lon1 7000 non-null float64
lat2 7000 non-null float64
lon2 7000 non-null float64
dtypes: bool(1), float64(4), int64(2), object(1)
memory usage: 773.5 KB

A common practice when conserving memory is downcasting. For example, if you know your integers don’t need 64-bits, cast them down to 32-bits. But, as we’ll see, not all downcasting is equally impactful.

large_copy["positive_ints"] = large_copy["positive_ints"].astype(np.int32)large_copy.info(memory_usage="deep")>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7000 entries, 0 to 6999
Data columns (total 8 columns):
day_of_week 7000 non-null object
booleans 7000 non-null bool
positive_ints 7000 non-null int32
mixed_ints 7000 non-null int64
lat1 7000 non-null float64
lon1 7000 non-null float64
lat2 7000 non-null float64
lon2 7000 non-null float64
dtypes: bool(1), float64(4), int32(1), int64(1), object(1)
memory usage: 746.2 KB

A 3% reduction isn’t really remarkable. If you have string columns and they’re being stored as an object type, they’ll always out-shadow floats and integers for memory gluttony.

All str types are stored as object in Pandas because they can be any length. You can downcast string columns to a fixed-length str type. For example, this one limits to 10 characters:

large_copy["day_of_week"] = large_copy["day_of_week"].astype("|S10")large_copy.info(memory_usage="deep")>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7000 entries, 0 to 6999
Data columns (total 8 columns):
day_of_week 7000 non-null object
booleans 7000 non-null bool
positive_ints 7000 non-null int32
mixed_ints 7000 non-null int64
lat1 7000 non-null float64
lon1 7000 non-null float64
lat2 7000 non-null float64
lon2 7000 non-null float64
dtypes: bool(1), float64(4), int32(1), int64(1), object(1)
memory usage: 636.8 KB

An 18% reduction isn’t bad, but is there more we can do? You happen to know that there are only supposed to be 7 days of the week at most in day_of_week. There’s a category type that can take advantage of columns that are composed of a small set of repeating elements.

large_copy["day_of_week"] = large_copy["day_of_week"].astype("category")large_copy.info(memory_usage="deep")>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7000 entries, 0 to 6999
Data columns (total 8 columns):
day_of_week 7000 non-null category
booleans 7000 non-null bool
positive_ints 7000 non-null int32
mixed_ints 7000 non-null int64
lat1 7000 non-null float64
lon1 7000 non-null float64
lat2 7000 non-null float64
lon2 7000 non-null float64
dtypes: bool(1), category(1), float64(4), int32(1), int64(1)
memory usage: 315.2 KB

That’s a 59% reduction, just from one column alone!

I’m almost ashamed to admit it took me years of using Pandas before I started to look up optimizations, and even today, it’s not like I go around optimizing every single notebook I make. However, with a few simple tricks, you can see how the difference of a couple of lines of code or even just how you handle a single column could result in speedups of 10–100X and memory decreases of greater than 50%. It’s definitely a good set of tricks to have up your sleeves.

If you missed it, check out Part I: Writing good code and spotting silent failures.

[ad_2]




Source link

Write a comment