Pricing Airbnb Listings Optimally | by Tony Ng | Jan, 2021

[ad_1]


This section is the crux of the entire project.

I shall skip discussing Neural Network as it was done in Colab as a negative example for this project. The key point would be that Neural Network is often referred to as a ‘black-box’ which it is not entirely possible to explain its prediction outcome. For example, we can tell a person who has applied for a loan that he is arbitrarily rejected by our Neural Network model, but we cannot explain and tell him/her what had resulted in the overall final decision. In this case, this method would not be relevant to our project as we would want our final result to be highly explainable and interpretable by Airbnb hosts so they can be informed in their final pricing decision as well. However, the code for the ANN is still made available in the notebook.

It is also worth discussing the train-test split (TTS) strategy as this project is slightly unique with respect to other machine learning projects. Most frequently, one would perform TTS on the dataset randomly (apart from time-series). However, it is important to ascertain that the target variable price is firstly optimum before the respective model is trained with that instance in our project. The implication thus of performing a ‘blind TTS’ would be the derivation of a “garbage-in garbage-out” final model. Hence, we require some form of scrutinization and filtering to our train-set where the model derives the underlying rules of optimum pricing strategies.

From the above explanation, the unfiltered dataset should have a mix of both optimum and less-optimum prices. The impediment of identifying one from another would be to not have any objective indicator to determine so in our dataset. The only borderline plausible variable would be the number of reviews for listing. The reasoning would be that supposedly the free market was efficient, higher patron staying counts would imply a higher overall demand or engagement and hence higher plausibility of optimum pricing, despite not knowing about the nature or sentiment of the underlying review as parties were willing to engage in voluntary trade at the start. As the prerequisite of a reviewer would be being an existing patron, the number of reviews is an indirect indicator of the favourability of a listing to be engaged and thus optimum prices. Hence, we should train our models with the top 80% dataset of review counts.

Machine learning is actually not as difficult as other non-practitioners might believe. At a practical level, you need not understand every intricate computational detail behind a given model and code it out for usage. Furthermore, it might not a good coding practice to “reinvent the wheel” when others have already made this available. For instance, I can test out 9 different untuned models from scikit-learn (and XGBoost) with only a few lines of code:

Code:

%%time
nameList = []
cvMeanList = []
cvStdList = []
for Model in [LinearRegression, Ridge, Lasso,
DecisionTreeRegressor, RandomForestRegressor, ExtraTreesRegressor,
AdaBoostRegressor, GradientBoostingRegressor, XGBRegressor]:
if Model == XGBRegressor: cv_res = rmse_cv(XGBRegressor(objective='reg:squarederror', eval_metric = 'mae'))
else: cv_res = rmse_cv(Model())
print('{}: {:.5f} +/- {:5f}'.format(Model.__name__, -cv_res.mean(), cv_res.std()))
nameList.append(Model.__name__)
cvMeanList.append(-cv_res.mean())
cvStdList.append(cv_res.std())

Output:

LinearRegression: 79.72456 +/- 10.095378
Ridge: 79.75446 +/- 10.114177
Lasso: 81.44520 +/- 8.418724
DecisionTreeRegressor: 103.70623 +/- 13.965223
RandomForestRegressor: 77.86522 +/- 13.281151
ExtraTreesRegressor: 78.39075 +/- 14.291264
AdaBoostRegressor: 120.35514 +/- 23.933491
GradientBoostingRegressor: 76.78751 +/- 11.726186
XGBRegressor: 76.69236 +/- 11.640701
CPU times: user 1min 38s, sys: 614 ms, total: 1min 38s
Wall time: 1min 38s

Next to each model, the first number represents the 10-fold cross validation error’s mean (squared error), and the second represents the cross validation error’s standard deviation (squared error).

  • We can see that models such as both DecisionTreeRegressor and AdaBoostRegressor were not able to outperform the simplistic baseline model of LinearRegression.
  • However, both GradientBoostingRegressor and XGBRegressor had a lower CV error value relative to the list of models. We can attempt to further tune both models in our solution

Gradient boosting (GB) regressor can be used when the target variable is continuous, and GB classifier can be used when the target variable is categorical. AdaBoost, GB, and XGBoost all use a similar method of boosting which increases the performance of the model. A short paraphrase from Analytics Vidhya, a spam detection model that only identifies either the presence of links or email from an unknown source are both weak models separately. However, by combining both rules from training, the model is most robust and would thus have better overall generalisation capability. Hence, GB is an ensemble of multiple decision tree models that has a decent prediction result.

However, tuning a model is pretty much trial and error. For instance, I can attempt to find the minimum mean absolute error (MAE) through testing a few points across a supposedly large parameter space for the number of estimators (n_estimators). The following plot was achieved through testing 7 points:

n_estimators = [5, 25, 50, 100, 250, 500, 1000]

The local minimum should be approximating 220 n_estimators.

Unlike Neural Network, both GB and XGBoost can be highly explainable. For instance, we can tell which variables were important in explaining prediction results for price. In general, there are 2 different ways to tell importance in features using tree-based models:

  1. Feature Importance from Mean Decrease in Impurity (MDI)
  • Impurity is quantified by the splitting criterion of the decision trees (Gini, Entropy or Mean Squared Error).
  • However, this method can give high importance to features that may not be predictive on unseen data when the model is overfitting.

2. Permutation Importance

  • Permutation-based feature importance, on the other hand, avoids this issue, since it can be computed on unseen data.

From this, we can see that both room_type_Private room and calculated_host_listings_counts are consistently ranked at the top as most important in explaining variable price.

6.3 eXtreme Gradient Boosting

Extreme Gradient Boosting (XGBoost) is a fairly recent machine learning method (considering Neural Networks was conceptualized in the 1940s and SVM was introduced by Vapnik and Chervonenkis in the 1960s) that is not only both fast and efficient but is also amongst the best performing models currently.

xgboost is an available package for import in Colab. Furthermore, the XGB model can utilise Colab’s free GPU to be more efficiently tuned as well.

The same parameter searching for Learning Rate was applied to the XGB model like the n_estimators in GB previously.

Although searching for 1 parameter can be easily performed with a simple for-loop, a more comprehensive search would be to specify a set of parameter grid and either use GridSearchCV or RandomSearchCV.

  • GridSearchCV iterates through the product of all possible combinations (Pros: very thorough parameter search, Cons: potentially very long running time e.g. 5 hyperparameters for 5 parameters = 5*5*5*5*5=5⁵=3125 models.
  • RandomSearchCV run time is determined by n_iter which can be specified conversely.

In this case, RandomSearchCV of n_iter=50 was specified for the following parameter grid:

param_grid = {
"learning_rate": [0.032, 0.033, 0.034],
"colsample_bytree": [0.6, 0.8, 1.0],
"subsample": [0.6, 0.8, 1.0],
"max_depth": [2, 3, 4],
"n_estimators": [100, 500, 1000, 2000],
"reg_lambda": [1, 1.5, 2],
"gamma": [0, 0.1, 0.3],
}

From the previous mention of TTS, the test-set result should not be our focus as it is likely that we are training on rows with optimum prices and hence we should not expect it to generalise well to test set with instances of less optimum prices. However, the more important aim of the project is to explain the numerical value behind each prediction result.

We can visualize predictions with detail using SHapley Additive exPlanations (SHAP) library for both GB and XGBoost models. In this case, I will discuss only the XGBoost due to redundancy. Variables ‘pushing’ prediction higher and lower are shown in red and blue respectively.

For instance, the first-row prediction value is explained using SHAP. Firstly, the base value 135.5 is the mean of all price values which would the same for all instances. However, variables in red increased the prediction price while the sole variable room_type_Private room decreased which determined the final prediction value to be 146.52. From this chart, we can probably interpret that a shared as compared to a private room cannot justify a higher room price than the latter.

Additionally, if we were to rotate the figure above anti-clockwise by 90 degrees where the first row would be at the start of the X-axis, we can stack the remaining predictions side-by-side to the right and obtain the figure below. We can see that most prediction values hover around 135 from approximately index 0 to 1600 where both upward and downward push are evenly matched. Thereafter, there were a few variables that switched sides to push the prediction lower from index 1600 to 2800. The final slices were met by little resistance by blue variables to drive the prediction values downwards.

For SHAP variable importance, it will not be determined by impurity values as discussed in GB but simply how much a variable has overall explained (or pushed) in all predictions. We can see that room_type_Private room has been consistently most important in predicting price values (same with GB MDI and Permutation Importance), where either 0 or 1 pushes a prediction downwards and upwards respectively.

The above plot is not that useful if you’re uninterested in the direction pushed by each variable but about overall variable importance, plotting the absolute SHAP value would be more helpful.

We can easily create a web application (app) with streamlit and deploy this using a PaaS provider Heroku to share our SHAP results. We can create the app using Heroku-CLI using:

heroku login
git init
heroku create airbnb-sg
git remote add heroku git@heroku.com:airbnb-sg.git
heroku git:remote -a airbnb-sg
git add .
git commit -m "Add changes to both Heroku and Github"
git push heroku HEAD:master

Link to:

Note that the app uses cache predictions as Heroku dynos for free-tier is extremely limited.

Read More …

[ad_2]


Write a comment