How to update your scikit-learn code for 2018

[ad_1]

July 4, 2018 · Python machine learning

In 2015, I created a 4-hour video sequence known as Introduction to machine learning in Python with scikit-learn. Within the years since, a whole bunch of 1000’s of scholars have watched these movies, and 1000’s proceed to take action each month.

On the time of the recording, I used to be utilizing Python 2.7 and scikit-learn 0.16. Though the video content material stays totally related, a few of the code is now outdated as a consequence of adjustments within the language.

I not too long ago up to date the Jupyter notebooks proven within the movies to make use of Python 3.6 and scikit-learn 0.19.1 so as to benefit from the most recent language options. (You can download the updated notebooks from GitHub.) Throughout this course of, I documented my adjustments (under) in order that others can have a better time updating their very own code.

After all, this isn’t an exhaustive checklist of all scikit-learn adjustments, reasonably it solely consists of adjustments that affected my code. The one method you may actually sustain with adjustments to the library is to learn the detailed scikit-learn release notes.

I hope that is useful to you. Please let me know within the feedback part under when you’ve got any questions!

Contents

Half 1: scikit-learn adjustments

Half 2: Python adjustments

Half 3: Different adjustments


Mannequin analysis lessons and capabilities have been moved

What modified: In scikit-learn 0.18, the lessons and capabilities from the cross_validation, grid_search, and learning_curve modules have been moved into a brand new model_selection module.

The best way to replace your code: You might want to replace the import statements.

Earlier than:

from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score
from sklearn.grid_search import GridSearchCV
from sklearn.grid_search import RandomizedSearchCV

After:

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

Additional studying: Model Selection Enhancements and API Changes


Grid search and randomized search have modified how they report outcomes

What modified: Beginning in scikit-learn 0.18, the outcomes of a grid search or randomized search are accessed through the cv_results_ attribute, changing the grid_scores_ attribute.

Clarification: The grid_scores_ attribute was a listing of named tuples, during which every tuple represented the outcomes of testing a single set of parameters. The cv_results_ attribute, however, is a dictionary of 1D arrays, during which every array represents a single metric (akin to mean_test_score) throughout all units of parameters. The construction was modified in order that the outcomes can simply be transformed right into a pandas DataFrame, which is particularly helpful since cv_results_ offers considerably extra details about the search outcomes than grid_scores_ did.

The best way to replace your code: You must convert cv_results_ to a DataFrame (as proven under) earlier than exploring the outcomes.

Earlier than:

# view the imply and commonplace deviation of the take a look at scores for every set of parameters
grid.grid_scores_

# look at the outcomes of the primary set of parameters
grid.grid_scores_[0].parameters
grid.grid_scores_[0].mean_validation_score

# checklist all the imply take a look at scores
[result.mean_validation_score for result in grid.grid_scores_]

After:

# convert the search outcomes right into a pandas DataFrame
import pandas as pd
outcomes = pd.DataFrame(grid.cv_results_)

# view the imply and commonplace deviation of the take a look at scores for every set of parameters
outcomes[['mean_test_score', 'std_test_score', 'params']]

# look at the outcomes of the primary set of parameters
outcomes['params'][0]
outcomes['mean_test_score'][0]

# checklist all the imply take a look at scores
outcomes['mean_test_score']

Notice: The best_estimator_, best_score_, and best_params_ attributes are nonetheless out there and didn’t change.

Additional studying: Model Selection Enhancements and API Changes

Associated pocket book: Efficiently searching for optimal tuning parameters


Grid search and randomized search can return coaching scores

What modified: Beginning in scikit-learn 0.18, grid search and randomized search can optionally calculate the coaching scores for every cross-validation break up by setting return_train_score=True. Beginning in scikit-learn 0.19.1, the default worth of return_train_score was modified from True to 'warn' to alert customers that calculating coaching scores might decelerate the search considerably.

Clarification: Calculating the coaching scores is just not required so as to choose one of the best set of parameters, and is just helpful for gaining insights on how completely different parameter settings have an effect on the overfitting/underfitting trade-off.

The best way to replace your code: You must explicitly set return_train_score=False except you particularly have to calculate the coaching scores.

Earlier than:

grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy')
rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=20)

After:

grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy', return_train_score=False)
rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=20, return_train_score=False)

Additional studying: scikit-learn 0.19.1 release notes

Associated pocket book: Efficiently searching for optimal tuning parameters


Scoring parameters for loss capabilities have been renamed

What modified: Beginning in scikit-learn 0.18, the names of scoring parameters for which “decrease is healthier” at the moment are prefixed by 'neg_', akin to 'neg_mean_squared_error'.

Clarification: Some mannequin analysis metrics (generally known as “reward capabilities”) have the property that greater values are higher than decrease values, akin to accuracy, precision, and recall. Different metrics (generally known as “loss capabilities”) have the property that decrease values are higher, akin to log loss, imply absolute error, and imply squared error. As a result of optimization instruments akin to GridSearchCV are constructed to maximise the analysis metric (that means they all the time deal with greater values as higher than decrease values), scikit-learn mechanically negates the scores any time a loss operate is chosen because the scoring parameter. The negation of scores nonetheless takes locations in scikit-learn 0.18 (and past), however the affected scoring parameters have been renamed so as to cut back confusion.

The best way to replace your code: Any time you’re utilizing a loss operate as a scoring parameter, it’s essential to add the 'neg_' prefix to the parameter identify. At present, this consists of: 'neg_log_loss', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', and 'neg_median_absolute_error'.

Earlier than:

cross_val_score(linreg, X, y, cv=10, scoring='mean_squared_error')

After:

cross_val_score(linreg, X, y, cv=10, scoring='neg_mean_squared_error')

Notice: This variation solely impacts lessons with a scoring parameter, akin to cross_val_score and GridSearchCV. The capabilities within the metrics module, akin to metrics.mean_squared_error, haven’t been renamed as a result of they proceed to output optimistic scores.

Additional studying: The scoring parameter: defining model evaluation rules

Associated pocket book: Cross-validation for parameter tuning, model selection, and feature selection


Solely 2D knowledge arrays will be handed to fashions

What modified: Beginning in scikit-learn 0.17, solely 2D knowledge arrays will be handed to fashions as enter. 1D knowledge arrays are not accepted.

Clarification: While you go enter knowledge to a mannequin (to suit or predict, for instance), the information should now be explicitly formed (n_samples, n_features). In different phrases, every row of the array ought to symbolize a pattern, and every column ought to symbolize a characteristic. Earlier to scikit-learn 0.17, you may go a 1D knowledge array to a mannequin, and it will infer how that array ought to be interpreted. That’s not allowed as a result of it might probably trigger confusion about whether or not the array components ought to be interpreted as samples or options.

The best way to replace your code: If you happen to attempt to go a listing akin to [3, 5, 4, 2] to a mannequin, it will likely be interpreted as a 1D array of form (4,) and will not be accepted. If you happen to meant for it to be interpreted as 1 pattern with Four options, then its form must be modified to (1, 4). (Three choices are proven under for methods to accomplish this.) If you happen to meant for it to be interpreted as Four samples with 1 characteristic, then its form must be modified to (4, 1).

Earlier than:

knn.predict([3, 5, 4, 2])

After:

# possibility 1: go the information as a nested checklist, which will likely be interpreted as having form (1, 4)
knn.predict([[3, 5, 4, 2]])

# possibility 2: explicitly change the form to be (1, 4)
import numpy as np
knn.predict(np.reshape([3, 5, 4, 2], (1, 4)))

# possibility 3: explicitly change the primary dimension to be 1, let NumPy infer that the second dimension ought to be 4
knn.predict(np.reshape([3, 5, 4, 2], (1, -1)))

Associated pocket book: Training a machine learning model with scikit-learn


Print is not a press release

What modified: Beginning in Python 3, print is a operate reasonably than a press release.

The best way to replace your code: You might want to convert your print statements to capabilities.

Earlier than:

print X.form

After:

print(X.form)

Additional studying: What’s New In Python 3.0


Many Python Three capabilities output iterators as an alternative of lists

What modified: Beginning in Python 3, the vary and zip capabilities (amongst others) return iterators as an alternative of lists.

The best way to replace your code: If it’s essential to output a listing, you may explicitly convert the output of vary and zip utilizing the checklist operate.

Earlier than:

k_range = vary(1, 26)
print(zip(feature_cols, linreg.coef_))

After:

k_range = checklist(vary(1, 26))
print(checklist(zip(feature_cols, linreg.coef_)))

Additional studying: Python 3’s range is more powerful than Python 2’s xrange


IPython Pocket book is now known as Jupyter Pocket book

What modified: Beginning in late 2015, the official identify of the “IPython Pocket book” was modified to “Jupyter Pocket book”.

Clarification: Initially, IPython was an interactive Python shell, and the IPython Pocket book was a browser-based interactive setting that used IPython as its “kernel” (execution setting). Over time, the IPython Pocket book gained assist for different kernels (akin to Julia and R) and thus turned language agnostic. The identify was modified from “IPython Pocket book” to “Jupyter Pocket book” to keep away from implying that it solely supported the Python programming language, although IPython continues to be the default kernel for the Pocket book.

The best way to replace your code: Assuming you could have the Jupyter Pocket book put in, it is best to kind jupyter pocket book on the command line (as an alternative of ipython pocket book) to open the Pocket book dashboard.

Additional studying: The Big Split

Associated pocket book: Setting up Python for machine learning: scikit-learn and Jupyter Notebook


Exterior datasets have been moved to the GitHub repository

What modified: The code from the video sequence relied on two exterior datasets, which have now been moved to the GitHub repository.

Clarification: Within the video sequence, I used two exterior datasets as examples, and browse the information into pandas through URL. A type of information has since been taken offline, and the opposite file has since been modified, which broke my code. To guard in opposition to these issues occurring once more, I situated the unique information, moved them to the GitHub repository, and now check with them within the code utilizing relative paths.

The best way to replace your code: When studying within the information, check with them utilizing relative paths (as proven under). Notice that it will solely work if the information information are in your native machine in a knowledge subdirectory, which will be achieved by cloning or downloading the GitHub repository.

Earlier than:

# learn the promoting dataset through URL
url = 'http://www-bcf.usc.edu/~gareth/ISL/Promoting.csv'
knowledge = pd.read_csv(url, index_col=0)

# learn the diabetes dataset through URL
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.knowledge'
pima = pd.read_csv(url, header=None, names=col_names)

After:

# learn the promoting dataset through relative path
path = 'knowledge/Promoting.csv'
knowledge = pd.read_csv(path, index_col=0)

# learn the diabetes dataset through relative path
path = 'knowledge/pima-indians-diabetes.knowledge'
pima = pd.read_csv(path, header=None, names=col_names)



[ad_2]

Source link

Write a comment