Animals see the LightGBM: Predicting Shelter Outcomes for SoCo County Animals | by Amulya Saridey | Dec, 2020


Cleaning, exploring, and engineering data in order to build a model with practical use for predicting outcome types of animals in the Sonoma County Animal Shelter data.

This article was written for MIS 382N: Business Data Science. Our team consists of Ella Akl, Jonathan Garcia, Riley Moynihan, & Amulya Saridey who all worked together to write this article!

Photo by Andriyko Podilnyk on Unsplash

Animal shelters provide a much needed service to both animals and humans alike. They help stray animals become healthy and trained in order to find a “forever home”, help owners track down and find their lost pets, and let terminally ill or severely injured dogs find peace in the next life.

However, animal shelters suffer from a variety of logistical problems. Limited resources, inadequate volunteer training, and overcrowding are major issues that can severely affect the quality of life for both the animals and the shelter staff. Therefore, we thought it would be constructive to use data science to help shelters better understand and predict the outcomes of the animals in their care, so that they can optimize their resources and training.

One of the coolest things about data is the ability to explore, visualize, and feature engineer models. Given autonomy to work with any dataset, naturally, we thought to explore a dataset regarding animals- the Sonoma County Animal Shelter data.

Sonoma County, CA, collects data about the animals they process in the county shelters. They uploaded this data as a CSV file to the catalog. The dataset includes over 20,000 animal records, with 24 features. The data includes demographic information about the animal, the context of the animal’s intake into the shelter, and the outcome that resulted in the animal leaving the shelter. In our case, we build our model to predict ‘Outcome Type’, the high-level reason for an animal leaving the shelter. This ranges from adoption / returned, to euthanization (this distribution of outcomes will be analyzed further on).

Upon downloading the dataset, it was clear that a fair amount of cleaning and preprocessing would be required to make it useful for our purposes.

Before we could start deriving any sort of insight from the data, we needed to make sure that it was clean and not missing any crucial information. To do that, we loaded things up in Pandas.

# Import statements

import pandas as pd

The most obvious thing to check for was missing values. Pandas makes that a pretty simple thing to do with Running this command gave us the following output:

Image by Author

At first we were somewhat dismayed with the sheer number of NAs in some features, but we quickly realized that the features with the highest NA counts were either irrelevant (such as Outcome Jurisdiction, which we would not know in the context of using this model to predict Outcome Type), or optional (such as Name).

The only feature we deemed a candidate for imputation was “Size”. Given the low number here, we opted to just perform a forward fill for this feature.

Once that was filled, there was only one other feature that could not tolerate any NAs, that being “Outcome Type” itself. As there were only 120 records with missing outcome types, we felt confident that simply dropping these records would not adversely affect our project.

The final preprocessing we did going into data exploration was to extract and reformat the geolocation data for the Outcome Jurisdiction for each record. We thought that this might provide some interesting visualizations in terms of where these animals were going after being released from the shelter.

Speaking of visualizations, the first thing we did with our cleaned-up dataset was to explore the various features and how they are distributed within our data.

First things first, it is important to install our necessary Python packages to load and explore the data. Here, we are installing Pandas and NumPy. These are two very popular packages in the world of data science, rightfully so, with immense power to manipulate and view data in different ways. In addition, we also imported a couple of other packages which will be useful for our visualizations later on.

# Import statements

import pandas as pd

import numpy as np

# Packages for data visualizations

import matplotlib.pyplot as plt

import seaborn as sns

from chart_studio import plotly

import plotly.graph_objs as go

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

After doing this and loading the data into a Pandas DataFrame from a CSV file, we were able to see the sprawl of data that we had to work with.

Image by Author

Before doing a deep dive of exploration, it is always good practice to do some data clean up. The steps that we took were to drop any columns that weren’t needed, fill N/A values, and standardize columns that might have inconsistent data (Ex. different data types or formatting). A great approach to standardizing columns is to convert them to categorical values, if possible. Computers strongly prefer numbers over any other data type, so doing this step of the cleaning process ensures more accurate models.

For this particular dataset, we dropped a few columns that we didn’t feel would benefit our analysis, such as Animal ID, Impound Number, Intake Subtype, etc. Besides dropping columns, our biggest modifications were cleaning up the location by splitting the column into two columns consisting of latitude and longitude, as well as calculating a column for age which was done by subtracting the birthdate from the intake date. We ended up with this new DataFrame.

Image by Author

Once familiarized with our cleaned up dataset, we were able to start exploring it by using visualizations. We started off by creating a couple simple charts to begin visually understanding our data. The first was a distribution of animals’ ages using a distplot from Seaborn.

Age distribution of animals at SoCo Animal Shelter (Image by Author)

Then, we used a histogram to display the distribution of sex of the animals.

Image by Author

Here we used Matplotlib to do a very basic value counts bar plot for the outcome types of the animals at the shelter.

Outcome types for animals at SoCo during data exploration (Image by Author)

After doing this visualization, we can see that most of the animals that ended up at the animal shelter were either returned to the owner or were taken in adoption. These are both favorable outcomes! To further this idea, we wanted to visualize where these animals were adopted / outplaced to. We plotted the locations as an overlay on a map using Plotly and color coded them based on the outcome type.

Animals being adopted all over the country from SoCo Animal Shelter! (Image by Author)
A closer look of Sonoma County (Image by Author)

After viewing the data on a map, it was quite interesting to discover that Sonoma County Animal Shelter animals are adopted all over the country.

Next, we combined sex of the animal with outcome type to analyze any potential correlations between these two features.

Outcome types based on gender for animals (Image by Author)

This sparked an additional visualization that we could do with the data. Instead of comparing outcome types to sex, we thought- why not compare it to the types of animals that are at the shelter. This would give us a more granular understanding of the dataset. This first chart shows that there tends to be more dogs than cats/other animals.

Distribution of dogs, cats, and other animals at the SoCo Animal Shelter (Image by Author)

We then added outcome type that this to get the following:

Outcome types based off of animal type (Image by Author)

The results of this graph give us a lot of great insights! If you take a look at the blue bar for the outcome type of “Return To Owner”, this number is significantly higher for dogs than any other animal. This makes a lot of sense as dogs can easily run away from their homes and owners.

We also tried to change the data in identifying which animal was mixed breed versus animals that were not mix breed and that yielded us this plot:

Counts for breed type (Image by Author)

After this, we decided to now map the outcome types based off of breed to understand any connections or correlations. This is what us visualizing the data yielded

Outcome types based off of animal breeds (Image by Author)

Then, another factor we wanted to visualize was how having a name affected the animal’s outcome type. First we visualized the spread for animals who had a name versus those that did not have a name.

Count of animals with name versus without name (Image by Author)

Then, we visualized the outcome types based off of the above data.

Outcome type based on name versus no name (Image by Author)

Lastly, after converting all the times and dates into a standard format, we visualized which days animals tend to get adopted the most!

Days animals are most likely to get adopted (Image by Author)

Before we could begin training and evaluating any models, the data still needed to undergo a few more transformations in order to be the most effective at predicting Outcome Type.

The thing that was done was to ditch all of the features that were irrelevant in the context of our problem. Most of these features included those completely dependent on the Outcome Type itself (i.e. Outcome Subtype), or identification features that exist simply to ID the animals or case.

The next step was to convert columns into a more useful and understandable form for any given model. First, some columns like “Name” were converted into a boolean (e.g. “Has Name”). The assumption here is that the actual name of the animal isn’t relevant, but whether or not it does have a name does have some predictive power.

Another type of conversion was binning some features, such as the animal’s age. First, we had to calculate the age by taking the difference of “Intake Date” and “Date of Birth”. From there, we created a categorical variable with the bins 0–3, 3–6, 6–9, and 9+. In addition, since some animals did not have a known DOB, we created the final category “UNK”.

The last thing we did was take any features that existed in a format unreadable to the classifier and convert them to be readable. For date columns, this included changing the format from MM-DD-YYYY to epoch time, and then for all other features, running them through either Scikit-Learn’s LabelEncoder or OneHotEncoder classes.

The following workflow was used to train and tune almost all classification algorithms that were tested. It allowed for a standardized, efficient, and highly effective method of optimizing model performance. The same technique yielded great results on the Kaggle competition for a group member, and was easily applied here for similar increases in performance.

  1. Import standard libraries (numpy, pandas, etc.).
  2. Set np.random.seed(42) to ensure any numpy-based RNGs were starting on the same seed.
  3. Load in the dataset. Depending on the model, this was either the LabelEncoded version (for tree-based) or the OneHotEncoded version (for any algorithm that might imply ordinance).
  4. Create a baseline model with all default parameters and score with 3-fold cross validation. This is so that we have a frame of reference as we begin hyperparameter tuning.
  5. Run a RandomizedSearchCV with 3-fold cross validation with a large range of possible parameter values for 500 iterations. RandomizedSearch, unlike GridSearch, does not try every single combination of parameter values, but rather randomly tries a set number of combinations. This allows us to get a narrower idea of what our optimal hyperparameters will be without needing a supercomputer to brute force every possible combination. The ranges of possible parameter values were empirically derived through internet searches, Kaggle notebooks, documentation, and common sense.
  6. Using the best parameters found from the RandomizedSearchCV, run a GridSearchCV with only three possible parameter values for each hyperparameter. The middle value of this range will be the best value found from the RandomizedSearch. Then, include a step down and a step up from this value. For example, if the RandomizedSearch for LogisticRegression told us that the best parameter for C was C: 0.1, the range for C in the GridSearch will be C: [0.01, 0.1, 1.0]. Do this for all tested hyperparameters.
  7. Run the GridSearchCV with 3-fold cross validation and observe which hyperparameters changed in the reported best parameters. For any values that changed, shift the range so that the new best value is in the middle once again. For example, if after running GridSearch with the above three possible values for C and the new best parameters now reported C: 0.01, then our new range for C will be C: [0.001, 0.01, 0.1].
  8. Keep running GridSearchCV, updating the three range values each time, until the best parameters always report the middle value for each tested hyperparameter (i.e., it settles).

The main benefit of using this method to do hyperparameter tuning is that interactions between the hyperparameters are taken into account. If one was to tune each hyperparameter individually, it’s conceivable that the change in one parameter might interact with a previously tuned one in a way that makes the tuning suboptimal. Keeping multiple values open for each parameter with every GridSearchCV iteration will show if any “tuned” parameters suddenly become suboptimal again due to the shift in another (and in fact, this did occur on several occasions).

Scoring was done with 3-fold cross validation using Scikit-Learn’s multiclass roc_auc_score method. AUC was chosen due to the imbalance of the various Outcome Types in the dataset.

After trying a variety of algorithms, we found that tree-based methods were consistently outperforming linear-based methods or neural networks (the former due to underfitting, and the latter due to overfitting).

After following the training and tuning procedure outlined above for several tree-based algorithms, we found that LightGBM performed the highest, achieving an AUC of 0.948. The baseline model (no hyperparameter tuning) for this algorithm was 0.912 AUC, so our method of hyperparameter tuning was able to increase the performance by over 0.03, which we were quite happy with. The final parameters for LightGBM were as follows:

Params for best score: {‘colsample_bytree’: 1, ‘learning_rate’: 0.2, ‘max_depth’: 2, ‘min_child_weight’: 0.01, ‘min_split_gain’: 0.001, ‘objective’: ‘auc’, ‘random_state’: 42, ‘subsample’: 0.25}

XGBoost and CatBoost both achieved a final AUC score of at least 0.94 as well, while random forests achieved a final score of only 0.923.

After training LightGBM, we used it’s built-in method to display the feature importances of the final model. Those results were as follows:

Image by Author

Interestingly, “Kennel Number” had the highest predictive power. This refers to which shelter itself the record belonged to. There are several reasons this might be the case: perhaps some shelters have more resources, shelters specialize in different intake conditions, or the location around the shelter has some sort of lurking variable.

Next was “Intake Date”, which was also interesting. We speculate that the model is finding patterns in either the day of the week or the time of day an animal was admitted (e.g, animals confiscated due to neglect may happen at different times than lost pets taken to the shelter after being found by someone else).

In addition, we were also surprised at some features which were placed fairly low in the importances. Age is a great example of this, as we expected the age of the animal to have a strong impact on the outcome (e.g, older dogs get euthanized more frequently), but it seems that this may not necessarily be the case, or if it is, it’s a weaker relationship than expected.

Overall, we had a great time exercising our data science knowledge in a context that we all felt is important. It’s useful to see how these tools and techniques can be applied to real world problems, and we’re excited to expand that toolkit even further next semester!

Read More …


Write a comment