How to create useful features for Machine Learning
Not too long ago, a member of Data School Insiders requested the next query in our non-public discussion board:
I am new to Machine Studying. Once I need to create a predictive mannequin, what are the methods I ought to use to do “characteristic engineering”?
Nice query! Let’s begin with the fundamentals:
What’s characteristic engineering?
Characteristic engineering is the method of making options (additionally referred to as “attributes”) that do not exist already within the dataset. Which means in case your dataset already accommodates sufficient “helpful” options, you do not essentially have to engineer extra options.
However what’s a “helpful” characteristic? It is a characteristic that your Machine Studying mannequin can be taught from in an effort to extra precisely predict the worth of your goal variable. In different phrases, it is a characteristic that helps your mannequin to make higher predictions!
Let’s fake you may have a dataset with a “datetime” column:
- In case your objective is to predict the temperature, you may use the datetime column to engineer an integer
hourcharacteristic (0-23), because the hour of the day is a helpful predictor of the temperature.
- In case your objective is to predict the variety of vehicles on the street, you may use the datetime column to engineer boolean
is_holidayoptions (True/False), since these are helpful predictors of site visitors.
Thus when creating options, you need to take the goal variable (“temperature” or “variety of vehicles”) into consideration, as a result of completely different options can be helpful for completely different targets.
Contemplating mannequin kind
Ideally, you also needs to take into consideration the kind of Machine Studying mannequin you are utilizing:
- When you’re utilizing a linear mannequin (reminiscent of linear regression), the
hourcharacteristic won’t be helpful for predicting temperature since there is a non-linear relationship between
hour(0-23) and temperature. As an alternative, you may create an
is_nightboolean characteristic that represents the hours 0-Four and 20-23.
- When you’re utilizing a non-linear mannequin (reminiscent of choice timber), the
hourcharacteristic may work nicely for predicting temperature since choice timber can be taught from non-linear options.
Artwork versus science?
There have been some insightful feedback within the discussion board concerning the technique of characteristic engineering:
Characteristic engineering is extra of an artwork than a proper course of. You could know the area you are working with to give you cheap options. When you’re making an attempt to foretell bank card fraud, for instance, research the most typical fraud schemes, the primary vulnerabilities within the bank card system, which behaviors are thought-about suspicious in a web-based transaction, and so forth. Then, search for information to signify these options, and take a look at which mixtures of options result in one of the best outcomes.
Additionally, remember the fact that characteristic engineering makes use of completely different strategies on completely different information sorts: categorical, numerical, spacial, textual content, audio, photographs, lacking information, and so forth.
If you would like to be taught extra about characteristic engineering, listed here are just a few sources I like to recommend:
Non-Mathematical Feature Engineering Techniques for Data Science (weblog put up) features a handful of examples of primary characteristic engineering methods.
A Few Useful Things to Know about Machine Learning is a extremely readable paper by Pedro Domingos (creator of The Master Algorithm) about characteristic engineering, overfitting, the curse of dimensionality and different essential Machine Studying matters.
Feature Engineering Made Easy (e-book) covers the characteristic engineering workflow in-depth and features a ton of Python and scikit-learn code. (It was written by Sinan Ozdemir, one among my former information science co-instructors!)
Machine Learning with Text in Python is my on-line course that provides you hands-on expertise with characteristic engineering, Pure Language Processing, ensembling, mannequin analysis, and far more that can assist you to grasp Machine Studying and extract worth out of your text-based information.
Do you may have questions on characteristic engineering or sources to counsel? Let me know within the feedback beneath!
P.S. Are you new to Machine Studying? Take a look at my free video sequence, Introduction to Machine Learning in Python with scikit-learn.