## Data science in Python: pandas, seaborn, scikit-learn

In this video, we’ll cover the data science pipeline from data ingestion (with pandas) to data visualization (with seaborn) to machine learning (with scikit-learn). We’ll learn how to train and interpret a linear regression model, and then compare three possible evaluation metrics for regression problems. Finally, we’ll apply the train/test split procedure to decide which features to include in our model.

pandas installation instructions: http://pandas.pydata.org/pandas-docs/stable/install.html
seaborn installation instructions: http://seaborn.pydata.org/installing.html
Longer linear regression notebook: https://github.com/justmarkham/DAT5/blob/master/notebooks/09_linear_regression.ipynb
Chapter 3 of Introduction to Statistical Learning: http://www-bcf.usc.edu/~gareth/ISL/
Videos related to Chapter 3: https://www.dataschool.io/15-hours-of-expert-machine-learning-videos/
Quick reference guide to linear regression: https://www.dataschool.io/applying-and-interpreting-linear-regression/
Introduction to linear regression: http://people.duke.edu/~rnau/regintro.htm
pandas Q&A video series: https://www.dataschool.io/easier-data-analysis-with-pandas/
pandas 3-part tutorial: http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/
seaborn tutorial: http://seaborn.pydata.org/tutorial.html
seaborn example gallery: http://seaborn.pydata.org/examples/index.html

WANT TO GET BETTER AT MACHINE LEARNING? HERE ARE YOUR NEXT STEPS:

1) WATCH my scikit-learn video series:

2) SUBSCRIBE for more videos:

3) JOIN “Data School Insiders” to access bonus content:
https://www.patreon.com/dataschool

4) ENROLL in my Machine Learning course:
https://www.dataschool.io/learn/

5) LET’S CONNECT!

Source

### Comment List

• Data School
November 16, 2020

Note: This video was recorded using Python 2.7 and scikit-learn 0.16. Recently, I updated the code to use Python 3.6 and scikit-learn 0.19.1. You can download the updated code here: https://github.com/justmarkham/scikit-learn-videos

• Data School
November 16, 2020

when I use seaborn to pairplot the data, it doesn't show data for first column i.e. 'TV'

• Data School
November 16, 2020

• Data School
November 16, 2020

Dude, you're the best professor ever! Thanks a lot

• Data School
November 16, 2020

Do you have a tutorial that covers sklearn.datasets?

• Data School
November 16, 2020

a nice video

• Data School
November 16, 2020

Hi, the file URL isn' valid. Can you please share it?

• Data School
November 16, 2020

If I start from hundred of features, is there a way to automatically test combinations?

• Data School
November 16, 2020

Hi Kevin, I'm new to both Python and machine learning. Your tutorials are great learning materials. I understanding this is a 5-year old presentation and I'm wondering if you would still answer a question I have related to this tutorial. Specifically, when I was trying to get the pairplots you demonstrated, I got the following error: KeyError: "['Sales'] not in index" and I got three blank boxes. What was wrong? Many Thanks for your help. FYI, I also tried to find answers by Googling online and haven't been able to find any answers that work.

• Data School
November 16, 2020
• Data School
November 16, 2020

dude you're one of the best

• Data School
November 16, 2020

Impressive teacher!

• Data School
November 16, 2020

thank you! very clear and helpful

• Data School
November 16, 2020

I am getting a parser error for reading the csv file from the website. (3:00)

• Data School
November 16, 2020

Really appreciate that you also explain the algorithms and how to find the coefficient governing the equations. Thank you so much!

• Data School
November 16, 2020

Is the url still valid?

• Data School
November 16, 2020

I dont see, http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv is accessible now. Any alternative for that link? Thanks.

• Data School
November 16, 2020

Thank you so much for putting together this amazing series. I have a qq though : We get an RMSE of ~1.4, and you say that its good "given that the Sales range from 5 -25", could you elaborate a bit on this please? Thanks once again!

• Data School
November 16, 2020

I am answering your question 5 years later but I would love to see more video tutorials from you about scikit-learn (e.g Neural network models (supervised)) or
scikit-multilearn if you want!! 🙂 Thnx a lot Kevin!

• Data School
November 16, 2020

I have changed the code as follow but the problem still exists:

model = LogisticRegression(solver='lbfgs', multi_class='auto')

/opt/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
"of iterations.", ConvergenceWarning)

• Data School
November 16, 2020

Thanks a lot, the shift+Tab is no longer available, what else could do?

• Data School
November 16, 2020

I hate the limited functionality scikit-learn provides – if you do linear regression you most certainly want to (and should) look at confidence intervals – why aren’t they implemented!?

• Data School
November 16, 2020

thanks alot !!

• Data School
November 16, 2020

Great video. With that being said, I'm still having some trouble understanding the 0.046 result for tv. I guess I'm having trouble seeing how useful the result is. What would an actionable insight based on the result? SO if a company spent, say, 5k each on tv, newspaper, and radio ads, they can expect 233 increase in sales numbers?

• Data School
November 16, 2020

To be candid, this is the best video I've ever watched on scikit-learn. Thumbs up!!!

• Data School
November 16, 2020
• Data School
November 16, 2020

Kinda complete one, putting together all at-once! The best, I have watched until now!

• Data School
November 16, 2020

Thank you very much
Your teaching methodology is awesome making things crystal clear.

• Data School
November 16, 2020

can I use the heatmap to see a relation

• Data School
November 16, 2020

Your video tutorial is outstanding! You can simplify complex concepts in an elegant manner. And unlike other instructors you don't show-off on how smart you are. That's why we know that you're really a smart guy 🙂

• Data School
November 16, 2020

Hi, Kevin! Thank you for your videos about pandas and scikit-learn. You help me, to learn about Data science very well. But, at this point I got an issue to access the csv file from this link. http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv. I think the URL is already expired, and no csv file inside. Could you please to re-upload the file at bit.ly or something? Thank you, and hopefully you read this comment.

• Data School
November 16, 2020

Thanks a lot for this great material you've put together. Very very helpful!

• Data School
November 16, 2020

Thanks for the wonderful video. I have one ques as we can predict data for test dataset using train/test predict function. how to predict test data using cross_val_score / cross_val_predict bcos here X & y both are needed but Test data does not have y it has only X

• Data School
November 16, 2020

the best tutorial on watsapp

• Data School
November 16, 2020

Your teaching methodology is best,you step by step teaching method is very helpful for me to understand.You are the best.

• Data School
November 16, 2020

• Data School
November 16, 2020

…also, can you tell us more about 'random_state=1' parameter you used to split the data into test and train. Thanks a lot!

• Data School
November 16, 2020

Thank you for the great tutorials, Kevin! I got a problem importing data: tried the url mentioned in the video as well as the file cited in ur GitHub. Could you please help with that. Many thanks, Sarnai

• Data School
November 16, 2020

sir your videos are very good help me a lot thanks a lot for such a wonderful lectures.

• Data School
November 16, 2020