## MLxtend: A Library with Interesting Tools for Data Science Tasks | by Esmaeil Alizadeh | Dec, 2020

[ad_1]

MLxtend library is developed by Sebastian Raschka (a professor of statistics at the University of Wisconsin-Madison). The library has nice API documentation as well as many examples.

You can install the MLxtend package through the Python Package Index (PyPi) by running `pip install mlxtend`

.

In this post, I’m using the wine data set obtained from the Kaggle. The data contains 13 attributes of alcohol for three types of wine. This is a multiclass classification dataset, and you can find the description of the dataset here.

First, let’s import the data and prepare the input variables X (feature set) and the output variable y (target).

For creating counterfactual records (in the context of machine learning), we need to modify the features of some records from the training set in order to change the model prediction [2]. This may be helpful in explaining the behavior of a trained model. The algorithm used in the library to create counterfactual records is developed by Wachter *et al* [3].

You can create counterfactual records using *create_counterfactual()* from the library. Note that this implementation works with any scikit-learn estimator that supports the `predict()`

function. Below is an example of creating a counterfactual record for an ML model. The counterfactual record is highlighted in a red dot within the classifier’s decision regions (we will go over how to draw decision regions of classifiers later in the post).

An interesting and different way to look at PCA results is through a correlation circle that can be plotted using *plot_pca_correlation_graph()*. We basically compute the correlation between the original dataset columns and the PCs (principal components). Then, these correlations are plotted as vectors on a unit-circle. The axes of the circle are the selected dimensions (*a.k.a. *PCs). You can specify the PCs you’re interested in by passing them as a tuple to `dimensions `

function argument. The correlation circle axes labels show the percentage of the explained variance for the corresponding PC [1].

Remember that the normalization is important in PCA because the PCA projects the original data on to the directions that maximize the variance.

You often hear about the bias-variance tradeoff to show the model performance. In supervised learning, the goal often is to minimize both the bias error (to prevent underfitting) and variance (to prevent overfitting) so that our model can generalize beyond the training set [4]. This process is known as a bias-variance tradeoff.

Note that we cannot calculate the actual bias and variance for a predictive model, and the bias-variance tradeoff is a concept that an ML engineer should always consider and tries to find a sweet spot between the two.

Having said that, we can still study the model’s expected generalization error for certain problems. In particular, we can use the bias-variance decomposition to decompose the generalization error into a sum of 1) bias, 2) variance, and 3) *irreducible error* [4, 5].

The bias-variance decomposition can be implemented through *bias_variance_decomp()* in the library. An example of such implementation for a decision tree classifier is given below.

`>>> Average expected loss: 0.108 `

>>> Average bias: 0.032

>>> Average variance: 0.076

MLxtend library has an out-of-the-box function *plot_decision_regions()* to draw a classifier’s decision regions in 1 or 2 dimensions.

Here, I will draw decision regions for several scikit-learn as well as MLxtend models. Let’s first import the models and initialize them.

Now that we have initialized all the classifiers, let’s train the models and draw decision boundaries using *plot_decision_regions()** *from the MLxtend library.

Another useful tool from MLxtend is the ability to draw a matrix of scatter plots for features (using *scatterplotmatrix()*). In order to add another dimension to the scatter plots, we can also assign different colors for different target classes.

By the way, for plotting similar scatter plots, you can also use Pandas’ *scatter_matrix()** *or seaborn’s *pairplot()** *function.

The bootstrap is an easy way to estimate a sample statistic and generate the corresponding confidence interval by drawing random samples with replacement. For this, you can use the function *bootstrap()* from the library. Note that you can pass a custom statistic to the bootstrap function through argument `func`

. The custom function must return a scalar value.

Read More …

[ad_2]