Selecting the best model in scikit-learn using cross-validation




[ad_1]

In this video, we’ll learn about K-fold cross-validation and how it can be used for selecting optimal tuning parameters, choosing between models, and selecting features. We’ll compare cross-validation with the train/test split procedure, and we’ll also discuss some variations of cross-validation that can result in more accurate estimates of model performance.

Download the notebook: https://github.com/justmarkham/scikit-learn-videos
Documentation on cross-validation: http://scikit-learn.org/stable/modules/cross_validation.html
Documentation on model evaluation: http://scikit-learn.org/stable/modules/model_evaluation.html
GitHub issue on negative mean squared error: https://github.com/scikit-learn/scikit-learn/issues/2439
An Introduction to Statistical Learning: http://www-bcf.usc.edu/~gareth/ISL/
K-fold and leave-one-out cross-validation: https://www.youtube.com/watch?v=nZAM5OXrktY
Cross-validation the right and wrong ways: https://www.youtube.com/watch?v=S06JpVoNaA0
Accurately Measuring Model Prediction Error: http://scott.fortmann-roe.com/docs/MeasuringError.html
An Introduction to Feature Selection: http://machinelearningmastery.com/an-introduction-to-feature-selection/
Harvard CS109: https://github.com/cs109/content/blob/master/lec_10_cross_val.ipynb
Cross-validation pitfalls: http://www.jcheminf.com/content/pdf/1758-2946-6-10.pdf

WANT TO GET BETTER AT MACHINE LEARNING? HERE ARE YOUR NEXT STEPS:

1) WATCH my scikit-learn video series:
https://www.youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A

2) SUBSCRIBE for more videos:
https://www.youtube.com/dataschool?sub_confirmation=1

3) JOIN “Data School Insiders” to access bonus content:
https://www.patreon.com/dataschool

4) ENROLL in my Machine Learning course:
https://www.dataschool.io/learn/

5) LET’S CONNECT!
– Newsletter: https://www.dataschool.io/subscribe/
– Twitter: https://twitter.com/justmarkham
– Facebook: https://www.facebook.com/DataScienceSchool/
– LinkedIn: https://www.linkedin.com/in/justmarkham/

Source


[ad_2]

Comment List

  • Data School
    November 29, 2020

    Note: This video was recorded using Python 2.7 and scikit-learn 0.16. Recently, I updated the code to use Python 3.6 and scikit-learn 0.19.1. You can download the updated code here: https://github.com/justmarkham/scikit-learn-videos

  • Data School
    November 29, 2020

    Domain of liking something always was dominated on BY spontaneity BUT never it was without reasons. I liked all your videos with much enthusiasm and the simple reason is they are just BRILLIANT!

  • Data School
    November 29, 2020

    Another great class!!!!

  • Data School
    November 29, 2020

    Clear and net presentation !

  • Data School
    November 29, 2020

    Great video! The only line of code that I needed to update is reshaping the data to pass into the binarize function and then flatten the return ndarray.

    '''y_pred_class_2 = binarize(y_pred_prob.reshape((192,1)), threshold=0.3).flatten() '''

  • Data School
    November 29, 2020

    Great videos

  • Data School
    November 29, 2020

    really good. thank you.

  • Data School
    November 29, 2020

    Thank you for your super clear and thorough videos. Do you talk about the validation/dev set in your workflow course? I have a hard time with that concept, but my teachers require that we use it. It gets confusing for me when I try to select a model. Because I've heard you should tune your models to the validation set, and then get an "unbiased" score on the test set. But they say you're supposed to select your model according to the validation score. I don't understand why, because I thought that was biased because we used it to tune the models. Is the validation score the better of the two to use to select the model because if you use the test set to select the model, then we're biased to the test set and then have no unbiased score left? I almost feel like there should be one more split in the data, to create a second test set. So the first test set will be used to choose the model, and the second test set is used to show an unbiased score. There is honestly a lot of conflicting information about this entire topic on the internet. I've seen people say the model should be chosen based on the test set, and the validation set is only present to help tune the model. Cross validation is not always possible, because it can be very slow with big datasets.

  • Data School
    November 29, 2020

    i just want to know how we use cross validation using non negative matrix factorization model?

  • Data School
    November 29, 2020

    best tutorial in youtube

  • Data School
    November 29, 2020

    For those who are getting import error with "cross_validation" use " from sklearn.model_selection import…."

  • Data School
    November 29, 2020

    The best tutorials ever!!

  • Data School
    November 29, 2020

    Around 29:40, when you're redoing the cross-validation with just the TV and Radio columns, wouldn't you need to refit the linear model? Because the previous one was fitted using TV, Radio, and Newspaper, but this new one is only involving TV and Radio.

  • Data School
    November 29, 2020

    All your videos W.I.N – you’re the best

  • Data School
    November 29, 2020

    Thank you for the clear explanation. So helpful

  • Data School
    November 29, 2020

    First of all let me thank you at first for the extraordinary work you are doing………
    You are explaining extraordinary things in a very ordinary way…………..

    I had a query regarding FIT with the cross validation………..
    When we do linear regression with a SINGLE test train split data set………we get a SINGLE FIT to predict over the test data……..
    Whereas when we do cross validation (for eg cv=10) in linear regression…..we have 10 training datasets……………
    BUT DO WE ALSO HAVE 10 FIT MODELS AS WELL FOR EACH TRAINING MODEL ??????
    OR
    DO WE HAVE AN AVERAGE OF ALL THE 10 FIT MODELS?????

    ###########################################################################
    I am able to get coefficient & intercept of the fit model via single training dataset
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25,random_state=1,)

    lm=LinearRegression()

    fitting=lm.fit(X_train,y_train)

    fitting.coef_

    fitting.intercept_
    How to get the intercept & coefficient for the fit model via cross validation???????

    ############################################################################
    what is the significance of cross_val_predict???? Does it have any relation my query???

  • Data School
    November 29, 2020

    When we use cross_val_score using X,Y. Is the model trained there or we need to train it again??

  • Data School
    November 29, 2020

    THE best tutorial ever! Thank u

  • Data School
    November 29, 2020

    another hit.. Bamm!!!!!!

  • Data School
    November 29, 2020

    Hi Kevin, how are you!! First of all, thanks for your videos, you are awesome for posting them for us, and also because you are a great teacher and very good explaining.

    I have two doubts:
    1. why is cross validation better than fitting a model (or training it) with all the data?
    2. is cross validation a usefull method for timeseries?

  • Data School
    November 29, 2020

    Osm sir, tqq

  • Data School
    November 29, 2020

    Hey i have a doubt. If i use scoting as R2 in cross validation and get negative results should I flip the signs same way as mse scoring?

  • Data School
    November 29, 2020

    Hi! Continue the good work! I would suggest in a future video to refer also Nested Cross Validation for two reasons: 1) estimating the true generalization error, 2) Comparing between models. Thank you!

  • Data School
    November 29, 2020

    @Data School: Hi sir, I have been working with Customer Churn dataset. It contains 21 features including the target variable churn which is an integer type. While I was doing missing value treatment, I tried with mean imputation in dependents and city columns(both are numeric in nature). mean value was in fraction(i.e. float64). This results in error in train_test_split(). "TypeError: Singleton array cannot be considered a valid collection". I searched in stack-overflow, they said the error might be caused due to float value. If I go with mean imputation what would be a possible solution. I would be grateful for your response.

  • Data School
    November 29, 2020

    Thank you so much for your effort and your videos very helpful, I am doing a neural network regression modelling using keras library, if you may could you please do a video that show us how to best tune the parameters in the network and what parameters that effectively require tuning, and thank you so much

  • Data School
    November 29, 2020

    Hi, thank you so much for this video, but I still have one question. after all that, we still have to use tran_test_split to split our data and random_state will still cause a variance? if you post a video of how you do a model from scratch, I would definitely pay for it lol

  • Data School
    November 29, 2020

    Thank you so much. Your tutorial and style of explaining is exceptional.

  • Data School
    November 29, 2020

    Newbie question here:

    After all the process of Cross-Validation, do fitting the data and making prediction need a different process?
    Or you just do as normal fitting the training data and then make predictions on your test data?

  • Data School
    November 29, 2020

    You are charismatic, thank you!

  • Data School
    November 29, 2020

    Outstanding work once again Kevin. A treasure to newcomers in the area.

Write a comment