Data Science Best Practices with pandas (PyCon 2019)




[ad_1]

The pandas library is a powerful tool for multiple phases of the data science workflow, including data cleaning, visualization, and exploratory data analysis. However, the size and complexity of the pandas library makes it challenging to discover the best way to accomplish any given task.

In this tutorial, you’ll use pandas to answer questions about a real-world dataset. Through each exercise, you’ll learn important data science skills as well as “best practices” for using pandas. By the end of the tutorial, you’ll be more fluent at using pandas to correctly and efficiently answer your own data science questions.

EXERCISES:
05:14 1. Introduction to the TED Talks dataset
10:45 2. Which talks provoke the most online discussion?
18:58 3. Visualize the distribution of comments
34:20 4. Plot the number of talks that took place each year
50:30 5. What were the “best” events in TED history to attend?
1:01:28 6. Unpack the ratings data
1:13:36 7. Count the total number of ratings received by each talk
1:22:55 8. Which occupations deliver the funniest TED talks on average?

DOWNLOAD the dataset and Jupyter notebook:
https://github.com/justmarkham/pycon-2019-tutorial

WATCH my introductory series, Data Analysis with pandas:
https://www.youtube.com/playlist?list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y

JOIN the “Data School Insiders” community:
https://www.patreon.com/dataschool

LET’S CONNECT!
– Email Newsletter: https://www.dataschool.io/subscribe/
– LinkedIn: https://www.linkedin.com/in/justmarkham/
– Twitter: https://twitter.com/justmarkham
– Facebook: https://www.facebook.com/DataScienceSchool/
– YouTube: https://www.youtube.com/dataschool?sub_confirmation=1

Source


[ad_2]

Comment List

  • Data School
    November 26, 2020

    Want to skip the introduction and get right to the code? Start watching here: 5:14

  • Data School
    November 26, 2020

    Excellent lesson again. How do I plot the "talks per year" as bars instead of lineplot? Maybe even having the line following the top of the bars.

  • Data School
    November 26, 2020

    thanks for ur work

  • Data School
    November 26, 2020

    In your question 46:06, why the data is incomplete?
    I'm trying to check the rows of data to use the below code

    ted[ted.film_datetime.dt.year == 2015].shape

    ted[ted.film_datetime.dt.year == 2016].shape

    ted[ted.film_datetime.dt.year == 2017].shape

    In 2015, 2016 ted data, all rows are around 200, but in 2017 data is around 100
    Because of that, There's no sharp decline data in ted?

    Thank you for uploading this video :))

  • Data School
    November 26, 2020

    Thanks Kevin, I'm learning a lot from your videos 😀
    Hope u have a great day!

  • Data School
    November 26, 2020

    Thanks Kevin for the classes, very well done and helpful, as all of your videos 😊😎. Cheers for that!!

  • Data School
    November 26, 2020

    Hi, thanks for the ast literal_eval trick. I am not in data but in architecture, learning on my own with free python stuff. I was stuck with a string prb and my brain just told me "you have seen smthg about a string turned into the correct data type, go look for this video with the blue frame!"
    Thanks 😉

  • Data School
    November 26, 2020

    why are we ignoring the occupation that have a count less than 5. How are they affecting the result.

  • Data School
    November 26, 2020

    ted.loc[ ted[ 'views_per_comment' ].argmin() , : ] to get the min row. Hope its useful

  • Data School
    November 26, 2020

    It was a great content for beginners like me. Thanks for sharing.

  • Data School
    November 26, 2020

    Hi can anyone help..I am trying to make a loop to copy and classify images i have according to category and image index number from a csv file, however I am only managing to keep looping in the same spot ;/ ..this is the code..i tried glob , index , os.walk, iterrows, itertuples, everything. Can someone please indicate what I'm doing wrong. pls pls lps help i am getting so frustrated and annoyed at this. pls pls help me with pandas

    import glob
    import shutil
    import pandas as pd
    import numpy as np
    import os

    df=pd.read_csv('Data_entry_2017.csv')
    #df.head(12)

    #a
    a=(df.iloc[0,0])

    i=1
    #b
    b=(df.iloc[0,i])

    #Cardiomegaly
    #print(type(new_str2))

    src = (r'L:/xrayChestImages/images_001/images/') #00000001_000
    dest = (r"L:/chest_xray/")

    for ind in df.index:
    if b == 'Cardiomegaly':
    new_str=str(a)
    new_str2=str(b)
    dest2=os.path.join(dest, new_str)
    src2=os.path.join(src, new_str)
    print(new_str)
    print (src2)
    print (dest2)
    shutil.copy2(src2, dest2) #file #dest_dir3 1st row
    i=i+1
    b=(df.iloc[0,i])

  • Data School
    November 26, 2020

    Hi Kevin, how are you? By the end of answering Question 4, I was trying to get the bonus exercise you asked done myself: calculate the average delay between filming and publishing. I figured out that there are 10 observations' days_between_filming_publishing is negative values, which do not make sense, I was assuming… I have a feeling that among the 7 out of 10, there are probably typo possibility causing the published_date is way ahead of filming_date. Imagine, how can publishing_date is ahead of filming date, that was impossible. Not mentioning that published_date is 335 days ahead of filming date. My question is how I can replace those 7 observations? Or just simply filter them from the dataset? Please kindly advise, thank you. Angela

  • Data School
    November 26, 2020

    Thank you so much Kevin!! Your tutorials and videos really save me.

  • Data School
    November 26, 2020

    I want to better more efficient pandas code. How do I go to there…lol. Sorry, couldn't resist. I'm sure this is a great video and the author is very knowledgeable.

  • Data School
    November 26, 2020

    do we have any similar function like "complete" in
    r

  • Data School
    November 26, 2020

    Such an amazing tutorial. Thank you!

  • Data School
    November 26, 2020

    Perfffffffffffffffectttt!!!!

  • Data School
    November 26, 2020

    I learn more you than 4 years in NYU

  • Data School
    November 26, 2020

    Kevin is our supreme Leader!

  • Data School
    November 26, 2020

    In the Unpack the ratings data section, you wrote a function and did not use it to unpack the ratings series. You used lambda function. Any reason why this is so?. Thank you for your excellent videos. Corey Schafer (another born teacher like you) recommended your pandas videos and you did not disappoint.

  • Data School
    November 26, 2020

    Awesomeeeeeeeeeeeee

  • Data School
    November 26, 2020
  • Data School
    November 26, 2020

    50:30
    5. What were the best events in TED history to attend?

  • Data School
    November 26, 2020

    45:15
    .sort_index()

Write a comment