Towards a Revamped Real Estate Index | by Will Fried | Dec, 2020
Harvard Data Science Capstone Project, Fall 2020
For people interested in investing in the stock market or choosing players for their fantasy teams, a simple Google search would overwhelm them with in-depth analyses. Unfortunately, the same actionable insights are not readily available when it comes to real estate. For example, prospective home sellers in Cambridge, MA trying to learn more about their local market would be lucky to find a chart like this:
And if they were hoping to optimize the process and determine the best time to put their houses on the market, they might come across forecasts like this:
The sources that provide this type of information are collectively known as real estate indices. The graphs above are taken from the National Association of Realtors (NAR) Confidence Index. Another prominent index is the Case-Shiller Home Price Index, which tracks home prices over time and is shown below for the Boston metro area:
The graphs above get at two of the main shortcomings of existing real estate indices that limit their usefulness. First, these analyses cover very broad geographical areas and thus end up averaging over the highly diverse neighborhoods that make up the market area of interest. Second, these indices are inherently descriptive in nature, rather than predictive, in that their goal is to explain how different metrics have evolved over the past months and years, not how they will change over the coming months. Moreover, any forecasts that are presented tend not to be particularly rigorous, as indicated by the forecast shown in Figure 2, which is based on a survey of Realtors.
This semester we worked with REX, a real estate technology company that is trying to bring innovation to an industry that hasn’t seen much of it over the past 50+ years. In the spirit of REX’s mission, our goal was to address these two weaknesses of traditional real estate indices. First, we sought to predict the market conditions in any given market area between one to six months into the future. In our analysis, we defined monthly market conditions to be the number of new home listings and the number of home sales. Second, to make these predictions as targeted as possible, we tailored our forecast to each census tract (CT) in the given market area. To keep things simple, we focused exclusively on the Denver metro area with the idea that the methodology developed there would generalize to any other market area in the country.
We approached the task in two stages: first we sought to forecast the average market conditions across the entire market area. Then we planned to adjust this forecast for each individual census tract based on the characteristics of each census tract. Throughout this process, our priority was to build interpretable models that could shine light on the various factors that appear to influence market conditions. In addition to offering more insights, we aimed to provide visibility into the inner workings of the forecasting model so potential users would have faith in the forecast.
Our first task was to collect features that could help predict the number of listings and/or sales across the Denver metro area.
We incorporated three types of data into our forecasting models. First, we used the raw count of the monthly number of listings and sales in Denver since March 2016. Second, we made use of the fact that most home buyers and sellers conduct months of research online before they actually enter the market. This means, for example, that a surge of people searching for homes is likely indicative of a spike in the number of active buyers over the following several months. We measured this online activity using Google Trends, which allowed us to track the popularity of real estate related search terms such as “Zillow”, “home for sale” and “selling a house”. Figure 4 shows how the relative popularity of several of these search terms has varied over the years in the state of Colorado.
Finally, we speculated that housing market indicators could help us capture the real estate conditions in the Denver metro area. To that end, we collected indicators such as the median days on market and average listing price in Denver from the Federal Reserve Economic Data records. In addition, we gathered economic indicators such as the 30-year mortgage rate and unemployment rate to reflect macro-level economic conditions over time.
Figure 5 presents a high-level overview of our forecasting methodology. Each component of the final forecasting model is described below in detail. For concreteness, the discussion focuses on home sales, but the methodology is identical for new home listings.
1. Time Series Model
First, we built a time series model to predict the total number of monthly home sales in the Denver metro area using the historical home sales data since March 2016. We used a Bayesian structural time series (BSTS) model, which is described in detail in this blog post. After fitting the model, we sampled from the posterior predictive distribution to compute both the point forecast and the predictive interval of the number of sales for the month of interest. Figure 6 shows that the model captures the annual real estate cycle reasonably well but, to no surprise, loses its effectiveness after COVID-19 hit in March 2020.
2. Google Trends Model
Because time series models are relatively difficult and restrictive, we formulated this model in a way that fits into the simpler machine learning setting where the observations are independent of each other. This meant that the first step was to construct the design matrix.
Our exploratory data analysis revealed that there tends to be a three-month lag between online activity and home sales. Therefore, we constructed the feature set by creating lagging features of the Google Trends data and the raw time series data up to three months in the past. Figure 7 below illustrates this setup: first, let’s assume that the time of prediction is right at the end of month T. Then to predict the number of home sales in month T+1, T+2 or T+3, we use the historical number of home sales and the Google Trends scores in month T, T-1, and T-2 as predictors. (To be clear, we trained separate models to forecast 1,2 and 3 months ahead.) In addition, we added features to measure the absolute difference and the difference in percentage between each of our collected features in months T and T-1 and months T-1 and T-2. Lastly, we accounted for seasonality by one-hot encoding the month of the year we were trying to forecast.
Next, we tried to make the time series of the target variable as stationary as possible, as this typically makes it easier for a forecasting model to make accurate predictions. To that end, rather than directly predicting the number of sales, we applied differencing, which means that the goal was to predict the difference between the number of sales in consecutive months. We later confirmed that this approach does in fact yield better forecasts.
Having specified the design matrix and the target variable, we were ready to build the model. First, we tried several methods to perform feature selection/dimensionality reduction such as lasso, PCA, and the feature importance outputted by tree-based models. Then, we tried different machine learning methods to make the actual prediction. All of these combinations of a dimensionality reduction technique paired with a machine learning method were evaluated by performing cross validation with a mean squared error loss function across the entire dataset. As shown below in Figure 8, the best performing model is ridge regression after feature selection by lasso.
Figure 9 shows the 3-month forecast produced by this optimal model. Just as with the BSTS model, this forecast has struggled during the pandemic.
3. Ensemble Model
Having built two independent forecasting models, we next used stacking to combine them into an ensemble. Specifically, we built a linear regression model to learn the optimal linear combination of the predictions of the BSTS and Google Trends models. Finally, we tried to further improve the forecast by using the housing market indicators and economic indicators to build a boosting model that is fit to the residuals of this stacking model. Although this boosting step does not have a significant impact on the model performance in Denver, it may still be useful in other market areas.
Overall, the ensemble model performs well in forecasting 3 months ahead. In order to predict 6 months ahead, we first forecast 1, 2 and 3 months ahead using this ensemble model and incorporate these predictions into the time series. Next, we train a new BSTS model on this augmented time series and predict the remaining 3 months. The results are shown below:
To reiterate, the goal of the second modeling phase is to adjust this market-wide forecast to each individual census tract based on the unique characteristics of the census tracts. This first involved collecting these census tract-level features.
Census Tract Features
We collected the features from two different sources. The first source is NeighborhoodScout, which is a company that sells a wide range of census tract summary statistics. These statistics cover may different topics including area type (i.e. urban vs. suburban vs. rural vs. remote), suitability score (e.g. family friendly score, young single professional score), length of commute, crime rate, and the quality of schools. We supplemented these statistics by scouring through hundreds of publicly available census datasets on the census tract level. Some of the additional features we thought might be relevant include age demographics (i.e. fraction of population in different age brackets), population growth rate, workforce participation rate, mode of transportation to work, household income, fraction of homes with a mortgage, and vacancy rate. In all, we collected close to 100 features.
The goal of this model is to predict the number of sales in any given month and census tract. This model considers two types of features: the census-tract statistics as described above and the output of the first model, which predicts the total number of sales in the Denver market area. Initially we planned on modeling the number of sales directly by building a model that outputs a count value, such as a Poisson regression model or a gradient boosting regression model with a Poisson objective function. But we eventually realized there’s a simpler and more elegant approach.
The objective of this alternative model is to predict how the rate of sales in each census tract compares to the overall rate of sales in the Denver market area. In particular, the target variable is the ratio of the rate of sales in a given census tract to the rate of sales in the Denver market area. Here’s how the model works: for each month of training data, we calculate the rate of sales in the Denver market area (i.e. the total number of sales divided by the total number of households across all the census tracts) as well as the rate of sales in each census tract (i.e. number of sales in a given census tract divided by the number of households in the given census tract). Next, we compute the ratio of the two quantities for each month. This gives us a distribution of 56 ratios for each census tract, one for each month since March 2016. This distribution is displayed below for 15 arbitrary census tracts with the empirical mean represented by the vertical line.
As shown in the plots above, the distribution of the ratios for each census tract tends to be roughly Gaussian distributed. This makes sense because there are likely many latent factors that influence each census tract’s ratio in an additive way, so by the central limit theorem, the sampling distribution of the ratio should be close to a Gaussian distribution. This means that we can estimate the distribution of the ratio for each census tract by simply fitting a Gaussian distribution to the data via maximum likelihood estimation. Importantly, there is no discernible trend or autocorrelation in the time series of the monthly ratios, which means we can treat the observed ratio for any given month as an independent and identically distributed sample from the fitted Gaussian distribution.
At this point, we’re able to make predictions about future months as follows: first, we forecast the rate of sales for the market area as a whole using the market-level model. Next, we multiply the forecasted sales rate by the mean of the ratio distribution for each census tract. Finally, we multiply this rate by the number of households in each census tract to produce our final estimate. Of course, we want a full predictive distribution, not just a best-guess estimate; to achieve this, we perform Monte Carlo simulation where we sample from the market-wide forecast distribution and then sample a ratio from the Gaussian distribution fitted to each census tract.
While this model performs well from a predictive standpoint, it isn’t particularly illuminating because it doesn’t make use of the census tract level features. Therefore, the next step was to use these features to explain why particular census tracts have historically listed and sold homes at a higher rate than others, and vice versa. We considered two interpretable machine learning methods: linear regression and gradient boosting.
1. Linear Regression
Unfortunately, we ran into a lot of multicollinearity issues with the linear regression model. There are two contributing factors: the first is that many of the census tract-level features are highly correlated with each other. One such set of correlated features include the urban score, population density, fraction of workforce that commutes by public transportation, and commute times. The second issue has to do with the fact that many of the statistics we collected, such as age demographics and home value, come in the form of a distribution. This means that the probabilities of all the brackets add up to 1, which, in turn, means that the features are linearly dependent. Even if we were to specify a baseline bracket and ignore some of the less common brackets, the remaining features would still be close to being linearly dependent.
Overall, these issues with the design matrix signaled that it would be challenging to build a linear model without ignoring the majority of the collected features. Therefore, it made more sense to go with a non-parametric model, where multicollinearity is much less of a concern.
2. Gradient Boosting
For this part, we trained both an XGBoost and a LightGBM model. As with most tree-based models, we can extract the feature importance, which represents the overall influence of each predictor across the decision trees that comprise the ensemble. Figure 12 shows the feature importance plot for the XGBoost model:
According to the feature importance plot, of the top 10 features, two are related to the change in population over the past five to ten years and three are related to the fraction of homes that were built after a specified year, which of course is positively correlated with the population growth rate. Together, these features indicate that the rate of home sales in a given census tract is strongly related to the interest in living in the census tract in the first place. This is quite intuitive and thus boosts the credibility of the model interpretation.
These feature importance plots, however, have two main drawbacks. First, they tell us nothing about the direction of the relationship between the feature and the response variable. And second, the importance of each feature is averaged across the entire training set and thus doesn’t necessarily capture the feature’s influence on the sales ratio of any particular census tract.
Inspired by the xgboostExplainer package in R, we gathered this additional information by plotting the contribution of each feature to the prediction associated with a given census tract. Figure 13 below shows how the first 30 features contribute to the sales ratio that is predicted for two different census tracts. (Note that the plot only shows the contribution of the first 30 features; if we were to extend this plot to include all 111 features, we’d see that the bars would add up to the final prediction denoted by the dashed line).
The first thing to notice is that the intercept is almost exactly 1.0 for both census tracts. This makes sense because the average census tract has a sales rate that is very similar to the sales rate across the entire market area (hence a ratio of roughly 1.0). Now let’s analyze the rental rate, which is the most influential feature for these two census tracts and represents the fraction of households in the given census tract that are rented (as opposed to owned). Figure 13 indicates that for the first census tract, the value for the rental rate contributes positively to the predicted sales ratio and vice versa for the second census tract. It turns out we can provide an intuitive explanation for this result: the rental rate in the first census tract is relatively low (~21.65%), while it is relatively high for the second census tract (~63.25%). Because homes that are owned are much more likely to be listed and sold than homes that are rented, it makes sense that the model has learned an inverse correlation between rental rate and the rate of home sales.
The diagram below summarizes how all the different data sources and models fit together:
Lastly, we applied the exact same modeling procedure to the Atlanta market area in order to confirm the generalizability of our methodology. We were encouraged to find that our models achieved a similar level of performance even though the Atlanta real estate market differs substantially from the Denver real estate market in many regards.
Apart from applying this methodology to even more market areas, there are several natural extensions to our real estate index. First, in addition to focusing on the number of listings and the number of sales, we can consider other dimensions that contribute to the overall market conditions, such as median days on market and the fraction of listings that didn’t end up selling. It would be particularly interesting to see if there are any features that account for the variation in buyer demand across census tracts. Second, while the current analysis lumps together all Multiple Listing Service (MLS) listings, we can easily filter by listing attributes (e.g. number of bedrooms, price range) to make the models as customizable as possible. Finally, we can make the forecasts even more accurate and interpretable by supplementing Google Trends data with other forms of nontraditional data, such as digital advertising data, which captures people’s online activity on an even more granular level.
We would like to thank the following people: our mentor, Zona Kostic, for her guidance throughout the semester, expertise in real estate and constant enthusiasm; our course instructor, Chris Tanner, for putting together such a rewarding research experience; and our partners at REX for giving us the opportunity to work on this exciting project and for always making time to meet with us to answer our questions and provide feedback.
Read More …