Predicting Crime in San Francisco – Algobeans


Editor’s Observe: Because of the potential biases in machine studying fashions and the information that they’re skilled on, the weblog homeowners Annalyn and Kenneth don’t help using machine studying for predictive policing. This submit was meant to show random forest ideas, to not encourage its functions in regulation enforcement. A brand new submit with a special utility is being developed.

Can a number of wrongs make a proper? Whereas it could appear counter-intuitive, that is potential, generally even preferable, in designing predictive fashions for complicated issues equivalent to crime prediction.

Within the movie Minority Report, law enforcement officials had been capable of predict and forestall murders earlier than they occurred. Whereas present know-how is nowhere close to, predictive policing has been carried out in some cities to establish places with excessive crime. Location-based crime data could possibly be coupled with different knowledge sources, equivalent to earnings ranges of residents, and even the climate, to forecast crime prevalence. On this chapter we construct a easy random forest to forecast crime in San Francisco, California, USA.

Random forests mix the predictions of a number of choice bushes. Recall from our earlier chapter that in constructing a decision tree, the dataset is repeatedly divided into subtrees, guided by the perfect mixture of variables. Nevertheless, discovering the best mixture of variables may be troublesome. As an example, a call tree constructed based mostly on a small pattern is likely to be not be generalizable to future, giant samples. To beat this, a number of choice bushes could possibly be constructed, by randomizing the mixture and order of variables used. The aggregated consequence from these forest of bushes would kind an ensemble, generally known as a random forest.

Random forest predictions are sometimes higher than that from particular person choice bushes. The chart under compares the accuracy of a random forest to that of its 1000 constituent choice bushes. Solely 12 out of 1000 particular person bushes yielded an accuracy higher than the random forest. In different phrases, there’s a 99% certainty that predictions from a random forest could be higher than that from a person choice tree.


Histogram exhibiting the accuracy of 1000 choice bushes. Whereas the common accuracy of choice bushes is 67.1%, the random forest mannequin has an accuracy of 72.4%, which is best than 99% of the choice bushes.

Random forests are extensively used as a result of they’re simple to implement and quick to compute. Not like most different fashions, a random forest may be made extra complicated (by rising the variety of bushes) to enhance prediction accuracy with out the danger of overfitting.

Previous analysis present that crime tends to happen on hotter days. Open knowledge from the San Francisco Police Division (SFPD) and Nationwide Oceanic and Atmospheric Administration (NOAA) had been used to check this speculation. The SFPD knowledge comprises data on crimes, together with location, date, and crime class. The NOAA knowledge offers data on every day temperature and precipitation within the metropolis.

crime heatmap.png

A warmth map of crime ranges in San Francisco. Colours point out crime severity, which may be very low (grey), low (yellow), reasonable (orange), or excessive (pink)

From the warmth map, we are able to see that crime happens primarily within the boxed space north-west of the town, so we additional look at this space by dividing it into smaller rectangles measuring 900ft by 700ft (260m by 220m).

Realistically, SFPD can solely afford to pay attention further patrols in sure areas on account of restricted manpower. Therefore, the mannequin is tasked to pick out about 30% of the rectangles every day that it predicts to have the very best likelihood of a violent crime occurring, so that SFPD can increase patrol in these areas. Data from 2014 to 2015 was used to train the model, while data in 2016 (Jan – Aug) was used to test the model’s accuracy.

A random forest of 1000 decision trees successfully predicted 72.4% of all the violent crimes that happened in 2016 (Jan – Aug). A sample of the predictions can be seen below:


Crime predictions for 7 consecutive days in 2016. Circles denote locations where a violent crime is predicted to happen. Solid circles denote correct predictions. Crosses denote locations where a violent crime happened, but was not predicted by the model.

Based on predictions illustrated above, SFPD should allocate more resources to areas coded red, and fewer to areas coded gray. While it may seem obvious that we need more patrols in areas with historically high crime, the model goes further to pinpoint crime likelihood in non-red areas. For instance, on Day 4, a crime in a gray area (lower right) was correctly predicted despite no violent crimes occuring there in the prior 3 days.

Random forest also allows us to see which variables contribute most to its prediction accuracy. Based on the chart below, crime appears to be best forecasted using crime history, location, day of the year and maximum temperature of the day.


Top 12 variables contributing to the random forest’s accuracy in predicting crime.

A random forest is an example of an ensemble, which is a combination of predictions from different models. In an ensemble, predictions could be combined either by majority-voting or by taking averages. Below is an illustration of how an ensemble formed by majority-voting yields more accurate predictions than the individual models it is based on:


Example of three individual models attempting to predict 10 outputs of either Blue or Red. The correct predictions are Blue for all 10 outputs. An ensemble formed by majority voting based on the three individual models yields the highest prediction accuracy.

As a random forest is an ensemble of multiple decision trees, it leverages “wisdom of the crowd”, and is often more accurate than any individual decision tree. This is because each individual model has its own strengths and weakness in predicting certain outputs. As there is only one correct prediction but many possible wrong predictions, individual models that yield correct predictions tend to reinforce each other, while wrong predictions cancel each other out.

For this effect to work however, models included in the ensemble must not make the same kind of mistakes. In other words, the models must be uncorrelated. This is achieved via a technique called bootstrap aggregating (bagging).

In random forest, bagging is used to create thousands of decision trees with minimal correlation. (See a recap on How Decision Trees Work.) In bagging, a random subset of the coaching knowledge is chosen to coach every tree. Moreover, the mannequin randomly restricts the variables which can be used on the splits of every tree. Therefore, the bushes grown are dissimilar, however they nonetheless retain sure predictive energy.

The diagram under reveals how variables are restricted at every break up:


How a tree is created in a random forest

Within the above instance, there are 9 variables represented by 9 colours. At every break up, a subset of variables is randomly sampled from the unique 9. Inside this subset, the algorithm chooses the perfect variable for the break up. The dimensions of the subset was set to the sq. root of the unique variety of variables. Therefore, in our instance, this quantity is 3.

Black field. Random forests are thought of “black-boxes”, as a result of they comprise randomly generated choice bushes, and will not be guided by explicitly pointers in predictions. We have no idea how precisely the mannequin got here to the conclusion {that a} violent crime would happen at a particular location, as an alternative we solely know {that a} majority of the 1000 choice bushes thought so. This will likely result in moral considerations when utilized in areas like medical analysis.

Extrapolation. Random forests are additionally unable to extrapolate predictions for circumstances that haven’t been beforehand encountered. For instance, given {that a} pen prices $2, 2 pens value $4, and three pens value $6, how a lot would 10 pens value? A random forest wouldn’t know the reply if it had not encountered a scenario with 10 pens, however a linear regression model would be capable of extrapolate a pattern and deduce the reply of $20.

Did you study one thing helpful immediately? We might be glad to tell you when we’ve got new tutorials, in order that your studying continues!

Join under to get bite-sized tutorials delivered to your inbox:

Free Data Science Tutorials

Copyright © 2015-Current All rights reserved. Be a cool bean.


Source link

Write a comment