## Using regression to rank the most important features for determining the price of a house

So you may have heard the old adage “Location, location, location” when it comes to real estate. I know I heard it more than a few times growing up with a father who was a realtor for a few years. The phrase was first coined by Harold Samuel in 1944, and it simply means that the location of a home greatly influences its price. If you’ve ever seen an episode of House Hunters then you might have noticed that a home closer to a downtown hub might cost more than a home in suburbs outside of the city. Certainly, we could make the argument that the habitual repetition of this phrase has lead us to conclude that location is the primary factor for determining the price of a house.

But what if location isn’t the most important thing? What if we should be saying “square footage, square footage, square footage” or “bathrooms, bathrooms, bathrooms”? As it turns out, we can use regression to find out which features are most influential in determining the price of a house. I’ll be using the King County, Washington housing dataset to investigate and answer this question.

First, let’s go over some basics about regression. The simplest form of regression is something you have likely already seen before: y = mx + b. Let’s break down this formula: y is the target variable, x is the input variable, m is the slope and b is the y-intercept. When applied to our housing problem, y is the price of the house and x could be some feature of the house such as square footage. If we continue with this example then we would need to find some number for the slope m and y-intercept b such that multiplying m by square footage and adding b would give us the price of the house. How do we go about choosing a suitable m and b? We simply start by taking a guess and plug the guesses for those values into the formula to get a price. The difference between our calculated price and actual price is the error. We can then try the equation with another house to calculate its price and measure the error. We can then update the values for m and b so that try to minimize the error in our calculated housing prices. Oftentimes it is the sum of squared errors that we want to minimize in order to optimize the line of best fit. Check out this Wikipedia article to learn more about the sum of squared errors!

Now we obviously don’t want to use just two examples to define our line of best fit. We also don’t want to just rely on square footage because we know that there are number of factors that influence price. Since we want to incorporate multiple features we would need to use multiple linear regression. The formula for multiple linear regression is simply an extension of simple linear regression.

We can see that instead of one slope m and one input x, we have multiple slopes and multiple inputs. Sometimes these slopes are referred to as weights and by examining the magnitude of these weights we can start to understand which features are the most important. The greater the magnitude of the weight for a given variable then the greater the influence that variable has on the overall price of the house.

For the purposes of not making this blog too long, I’ll just explain how we can determine the most important feature. Let’s first take a look at all of the variables we started with:

The variables shown above are displayed using a heatmap. The heatmap shows us the Pearson’s correlation coefficient between each pair of variables. The dataset provided longitudes and latitudes for each property, and using the geopy python library I calculated the geodesic distance to different locales within King County. The distance to different locales is how I defined location in for this project. You can see that there is a cluster of highly correlated variables in a beige color towards the bottom right of the heatmap. Each of those location based variables are highly correlated with each other meaning that some had to be dropped in order to try and satisfy the assumption of multicollinearity. I also looked at the variance inflation factor to evaluate multicollinearity between variables and decided to use a house’s distance to the Seattle Center as the definitive location metric. It’s also worth noting that many of the other variables in the heatmap were later dropped after running some vanilla models and evaluating p-values.

Let’s take a look at how distance to the airport correlates with the price of a house.

We can see that the bar plot matches our intuition when we think about the price of houses with respect to their proximity to a popular downtown civic center and gathering place. Houses very close to Seattle Center cost more than houses than those houses that are farther away. There is a clear and definitive trend that the price of a house decreases as the distance from Seattle Center increases. Although we are able to observe a trend, it’s not enough to decide that this location metric the most influential in determining the price of a house.

Before we answer the question of whether or not location is the most important factor for housing price, we need to understand how we arrived at the final model.

It’s important to understand that model building and model evaluation is a very iterative process. Choosing a different set of variables or different thresholds to eliminate outliers will result in a completely different model. We would need to do repeated testing on unseen data in order to conclude with some confidence that one model was better than another.

I won’t go through each of the decisions I made to arrive at my final model in this blog, but I found that a linear model was in fact not the best type of model to fit my data. Some of the variables I chose to use did not meet the assumption of homoscedasticity. I also could have chosen to use stricter thresholds a greater percentage of outliers thereby causing the residuals of the data to be more normally distributed (another important assumption for linear regression). Nonetheless, I was able to use polynomial regression to build a model that I believed to be sufficient. Specifically, it was a second degree polynomial that fit my data the best. Fortunately, even with a nonlinear model, the magnitude of the weight associated with a variable will still tell us which variable is most influential.

Let’s look at an example formula for polynomial regression:

If we apply the formula above to our earlier example using price and square footage then it looks like this:

Once we’ve determined what the coefficients are for each of these input variables we will know whether square footage or square footage raised to the second power is more influential in determining price!

Let’s first look at the code we would use to instantiate and fit a 2nd degree polynomial to the data: