Feature Ranking with Recursive Feature Elimination in Scikit-Learn

[ad_1]

Figure

 

Feature choice is a crucial activity for any machine studying utility. This is very essential when the information in query has many options. The optimum variety of options additionally results in improved mannequin accuracy. Obtaining an important options and the variety of optimum options will be obtained by way of characteristic significance or characteristic rating. In this piece, we’ll discover characteristic rating.

 

Recursive Feature Elimination

 
The first merchandise wanted for recursive characteristic elimination is an estimator; for instance, a linear mannequin or a choice tree mannequin.

These fashions have coefficients for linear fashions and have importances in resolution tree fashions. In choosing the optimum variety of options, the estimator is educated and the options are chosen by way of the coefficients, or by way of the characteristic importances. The least essential options are eliminated. This course of is repeated recursively till the optimum variety of options is obtained.

 

Application in Sklearn

 
Scikit-learn makes it attainable to implement recursive characteristic elimination by way of the sklearn.feature_selection.RFE class. The class takes the next parameters:

  • estimator — a machine studying estimator that may present options importances by way of the coef_ or feature_importances_ attributes.
  • n_features_to_select — the variety of options to pick out. Selects half if it is not specified.
  • step — an integer that signifies the variety of options to be eliminated at every iteration, or a quantity between Zero and 1 to point the share of options to take away at every iteration.

Once fitted, the next attributes will be obtained:

  • ranking_ — the rating of the options.
  • n_features_ — the variety of options which were chosen.
  • support_ — an array that signifies whether or not or not a characteristic was chosen.

 

Application

 
As famous earlier, we’ll must work with an estimator that gives a feature_importance_s attribute or a coeff_ attribute. Let’s work by way of a fast instance. The dataset has 13 options—we’ll work on getting the optimum variety of options.

import pandas as pddf = pd.read_csv(‘heart.csv’)df.head()

Image for post

Let’s receive the X and y options.

X = df.drop([‘target’],axis=1)
y = df[‘target’]

We’ll cut up it right into a testing and coaching set to organize for modeling:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=0)

Let’s get a few imports out of the best way:

  • Pipeline — since we’ll carry out some cross-validation. It’s greatest apply in order to keep away from knowledge leakage.
  • RepeatedStratifiedKFold — for repeated stratified cross-validation.
  • cross_val_score — for evaluating the rating on cross-validation.
  • GradientBoostingClassifier — the estimator we’ll use.
  • numpy — in order that we are able to compute the imply of the scores.
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import RFE
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier

The first step is to create an occasion of the RFE class whereas specifying the estimator and the variety of options you’d like to pick out. In this case, we’re choosing 6:

rfe = RFE(estimator=GradientBoostingClassifier(), n_features_to_select=6)

Next, we create an occasion of the mannequin we’d like to make use of:

mannequin = GradientBoostingClassifier()

We’ll use a Pipeline to remodel the information. In the Pipeline we specify rfe for the characteristic choice step and the mannequin that’ll be used in the subsequent step.

We then specify a RepeatedStratifiedKFold with 10 splits and 5 repeats. The stratified Ok fold ensures that the variety of samples from every class is nicely balanced in every fold. RepeatedStratifiedKFold repeats the stratified Ok fold the desired variety of instances, with a unique randomization in every repetition.

pipe = Pipeline([(‘Feature Selection’, rfe), (‘Model’, model)])
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=5, random_state=36851234)
n_scores = cross_val_score(pipe, X_train, y_train, scoring=’accuracy’, cv=cv, n_jobs=-1)
np.imply(n_scores)

The subsequent step is to suit this pipeline to the dataset.

pipe.match(X_train, y_train)

With that in place, we are able to verify the assist and the rating. The assist signifies whether or not or not a characteristic was chosen.

rfe.support_
array([ True, False,  True, False,  True, False, False,  True, False,True, False,  True,  True])

We can put that right into a dataframe and verify the end result.

pd.DataFrame(rfe.support_,index=X.columns,columns=[‘Rank’])

Image for post

We may verify the relative rankings.

rf_df = pd.DataFrame(rfe.ranking_,index=X.columns,columns=[‘Rank’]).sort_values(by=’Rank’,ascending=True)rf_df.head()

Image for post

 

Automatic Feature Selection

 
Instead of manually configuring the variety of options, it could be very good if we may routinely choose them. This will be achieved by way of recursive characteristic elimination and cross-validation. This is finished by way of the sklearn.feature_selection.RFECV class. The class takes the next parameters:

  • estimator — just like the RFE class.
  • min_features_to_select — the minimal variety of options to be chosen.
  • cv— the cross-validation splitting technique.

The attributes returned are:

  • n_features_ — the optimum variety of options chosen by way of cross-validation.
  • support_ — the array containing data on the choice of a characteristic.
  • ranking_ — the rating of the options.
  • grid_scores_ — the scores obtained from cross-validation.

The first step is to import the category and create its occasion.

from sklearn.feature_selection import RFECVrfecv = RFECV(estimator=GradientBoostingClassifier())

The subsequent step is to specify the pipeline and the cv. In this pipeline we use the simply created rfecv.

pipeline = Pipeline([(‘Feature Selection’, rfecv), (‘Model’, model)])
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=5, random_state=36851234)
n_scores = cross_val_score(pipeline, X_train, y_train, scoring=’accuracy’, cv=cv, n_jobs=-1)
np.imply(n_scores)

Let’s match the pipeline after which receive the optimum variety of options.

pipeline.match(X_train,y_train)

The optimum variety of options will be obtained by way of the n_features_ attribute.

print(“Optimal number of features : %d” % rfecv.n_features_)Optimal variety of options : 7

The rankings and assist will be obtained similar to final time.

rfecv.support_rfecv_df = pd.DataFrame(rfecv.ranking_,index=X.columns,columns=[‘Rank’]).sort_values(by=’Rank’,ascending=True)
rfecv_df.head()

With the grid_scores_ we are able to plot a graph displaying the cross-validated scores.

import matplotlib.pyplot as plt
plt.determine(figsize=(12,6))
plt.xlabel(“Number of features selected”)
plt.ylabel(“Cross validation score (nb of correct classifications)”)
plt.plot(vary(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.present()
Figure

Numbers of options towards the accuracy plot

 

Final Thoughts

 
The course of for making use of this in a regression downside is similar. Just guarantee to make use of regression metrics as an alternative of accuracy. I hope this piece has given you some perception on choosing the optimum variety of options in your machine studying issues.

mwitiderrick/Feature-Ranking-with-Recursive-Feature-Elimination
Feature Ranking with Recursive Feature Elimination – mwitiderrick/Feature-Ranking-with-Recursive-Feature-Elimination
 

 
Bio: Derrick Mwiti is a knowledge analyst, a author, and a mentor. He is pushed by delivering nice outcomes in each activity, and is a mentor at Lapid Leaders Africa.

Original. Reposted with permission.

Related:

[ad_2]

Source hyperlink

Write a comment