How to encode categorical features for ML with scikit-learn


In an effort to embody categorical options in your Machine Studying mannequin, you need to encode them numerically utilizing “dummy” or “one-hot” encoding. However how do you do that accurately utilizing scikit-learn?

On this 28-minute video, you will be taught:

  • use OneHotEncoder and ColumnTransformer to encode your categorical options and put together your function matrix in a single step
  • embody this step inside a Pipeline so to cross-validate your mannequin and preprocessing steps concurrently
  • Why you need to use scikit-learn (moderately than pandas) for preprocessing your dataset

If you wish to observe together with the code, you possibly can download the Jupyter notebook from GitHub.

Click on on a timestamp under to leap to a selected part:

0:22 Why do you have to use a Pipeline?
2:30 Preview of the lesson
3:35 Loading and getting ready a dataset
6:11 Cross-validating a easy mannequin
10:00 Encoding categorical options with OneHotEncoder
15:01 Deciding on columns for preprocessing with ColumnTransformer
19:00 Making a two-step Pipeline
19:54 Cross-validating a Pipeline
21:44 Making predictions on new knowledge
23:43 Recap of the lesson
24:50 Why do you have to use scikit-learn (moderately than pandas) for preprocessing?

P.S. Wish to grasp Machine Studying in Python? Enroll in my on-line course, Machine Learning with Text in Python!


Source link

Write a comment