Python Pandas Tutorial 5: Handle Missing Data: fillna, dropna, interpolate




[ad_1]

In this tutorial we’ll learn how to handle missing data in pandas using fillna, interpolate and dropna methods. You can fill missing values using a value or list of values or use one of the interpolation methods.

Topics that are covered in this Python Pandas Video:
0:00 Introduction
2:30 Convert string column into the date type
3:15 Use date as an index of dataframe usine set_index() method
4:10 Use fillna() method in dataframe
7:35 Use fillna(method=”ffill”) method in dataframe
8:57 Use fillna(method=”bfill”) method in dataframe
9:56 “axis” parameter in fillna() method in dataframe
11:18 “limit” parameter in fillna() method in dataframe
13:46 interpolate() to do interpolation in dataframe
15:34 interpolate() method “time”
16:50 dropna() method Drop all the rows which has “na” in dataframe
17:50 “how” parameter in dropna() method
18:33 “thresh” parameter in dropna() method

Code link: https://github.com/codebasics/py/tree/master/pandas/5_handling_missing_data_fillna_dropna_interpolate

Popular Playlist:
Complete python course: https://www.youtube.com/playlist?list=PLeo1K3hjS3uv5U-Lmlnucd7gqF-3ehIh0

Data science course: https://www.youtube.com/playlist?list=PLeo1K3hjS3us_ELKYSj_Fth2tIEkdKXvV

Machine learning tutorials: https://www.youtube.com/playlist?list=PLeo1K3hjS3uvCeTYTeyfe0-rN5r8zn9rw

Pandas tutorials: https://www.youtube.com/playlist?list=PLeo1K3hjS3uuASpe-1LjfG5f14Bnozjwy

Git github tutorials: https://www.youtube.com/playlist?list=PLeo1K3hjS3usJuxZZUBdjAcilgfQHkRzW

Matplotlib course: https://www.youtube.com/playlist?list=PLeo1K3hjS3uu4Lr8_kro2AqaO6CFYgKOl

Data structures course: https://www.youtube.com/playlist?list=PLeo1K3hjS3uu_n_a__MI_KktGTLYopZ12

Data Science Project – Real Estate Price Prediction: https://www.youtube.com/watch?v=rdfbcdP75KI&list=PLeo1K3hjS3uu7clOTtwsp94PcHbzqpAdg

To download csv and code for all tutorials: go to https://github.com/codebasics/py, click on a green button to clone or download the entire repository and then go to relevant folder to get access to that specific file.

Website: http://codebasicshub.com/
Facebook: https://www.facebook.com/codebasicshub
Twitter: https://twitter.com/codebasicshub

Source


[ad_2]

Comment List

  • codebasics
    December 8, 2020

    Step by step roadmap to learn data science in 6 months: https://www.youtube.com/watch?v=H4YcqULY1-Q

  • codebasics
    December 8, 2020

    the threshold parameter is so useful! never heard about this before! thank you!

  • codebasics
    December 8, 2020
  • codebasics
    December 8, 2020

    This is a very useful video! Thank you🙂

  • codebasics
    December 8, 2020

    hi,
    i face this type of error. kindly suggest me to handle

    df.set_index('Date', inplace=True)

    df
    KeyError: "None of ['Date'] are in the columns"

  • codebasics
    December 8, 2020

    Thank you for step by step explanation. Good job!

  • codebasics
    December 8, 2020

    How can we implement predictive mean matching ?

  • codebasics
    December 8, 2020

    I think your playlist has more rich content than paid courses

  • codebasics
    December 8, 2020

    This is the best rich in content best for free you are a man of god

  • codebasics
    December 8, 2020

    Hi, thank you for the tutorial. I am stuck on a problem, i have a data set with several fields and one column named 'rating'. This column is empty there is no data in it, i have over 2000 records in that dataset and i need to fill the column 'rating' with random numbers 1,2,3,4,5. How do i do that? Many thanks

  • codebasics
    December 8, 2020

    Thank you so much! So helpful!

  • codebasics
    December 8, 2020

    Will the time method not work in interpolation if the index is not timestamp

  • codebasics
    December 8, 2020

    import pandas as pd

    import numpy as nd

    file=pd.read_csv('titanic.csv')

    x=file.drop(['PassengerId','Name','SibSp','Parch','Ticket','Cabin','Embarked','Survived'],axis='columns')

    y=file[['Survived']]

    x=x.interpolate()

    y=y.interpolate()

    from sklearn.preprocessing import LabelEncoder

    le_Sex=LabelEncoder()

    x['Sex_n']=le_Sex.fit_transform(x['Sex'])

    x1=x.drop('Sex',axis='columns')

    x1.fillna(method='ffill')

    from sklearn.model_selection import train_test_split

    x1_train,x1_test,y_train,y_test=train_test_split(x1,y,test_size=0.3)

    from sklearn.tree import DecisionTreeClassifier

    titanic=DecisionTreeClassifier()

    titanic.fit(x1_train,y_train)

    y_pred=titanic.predict(x1_test)

    #print(y_test.head())

    #print(y_pred[0:5])

    #pclass,sex,age,fare

    #1-male

    #0-female

    z=titanic.predict([[1,0,59,60]])

    print(z)

    z1=titanic.score(x1_test,y_test)

    print(z1)

    #score – 79

  • codebasics
    December 8, 2020

    Interpolate will definitely boost my kaggle score! Thanks so Much!

  • codebasics
    December 8, 2020

    what about interpolation's effect on categorical column , still it is null….?

  • codebasics
    December 8, 2020

    thank you sir

  • codebasics
    December 8, 2020

    can I use 'parse_date' after creating my dataframe as you used it during reading the data . I want to parse_date after creating my df object . can I do it?

  • codebasics
    December 8, 2020

    When I use: new_df = df.dropna(tresh=1) I get: dropna() got an unexpected keyword argument 'tresh'. Anyone else struggling with the same issue?

  • codebasics
    December 8, 2020

    As always very well and understandably explained! Many thanks for that!

  • codebasics
    December 8, 2020

    Didn't use subset attribute of dropna

  • codebasics
    December 8, 2020

    Hello sir amazing video. I need help plz can we discuss?? Plz

  • codebasics
    December 8, 2020

    completeness of the material is commendable. keep it up thanks a lot 😀

  • codebasics
    December 8, 2020

    Assuming from your data that you have all the events, how can you fill in the temperature based on the event, eg if the event was "sunny" fill in 32. etc ?

  • codebasics
    December 8, 2020

    Nice.. particular that interpolation method.. i noticed tho if the missing data is the first data, it wont fill in.. can we somehow "back-interpolate" it?

  • codebasics
    December 8, 2020

    why interpolation is not done for event column? how to make interpolation for categorical vales?

  • codebasics
    December 8, 2020

    Sir i am working in speech to text with rnn project…my dataset contains speech and corresponding texts. It is basically bengali text. Only in 9 pair of data,corresponding bengali text is showing in csv file.but if i read it with pandas dataframe, shows NaN.how can i handle this missing value problem….will i skip this 9 data?

  • codebasics
    December 8, 2020

    BTW – i have also subscribed. Thank you once again.

    Wow. Thank you for uploading series on pandas. Currently going through each and every video and it seems to be a better video.
    Could you please help me to understand below scenario –
    16:45 – Lets assume, we have two dates…Eg. Invoice Pay date, Invoice rec date..is it possible to specify particular date for guessing using interpolate ?

  • codebasics
    December 8, 2020

    Very well explained sir. I appreciate that you suggested those little tricks rather than just sticking to the concept.

  • codebasics
    December 8, 2020

    ffill and bfill is still giving nan values if the very first and very last value of the dataframe has nan as original values. how to fix this?

  • codebasics
    December 8, 2020

    brilliant!

  • codebasics
    December 8, 2020

    Excellent work!!

  • codebasics
    December 8, 2020

    What if its not a date?

  • codebasics
    December 8, 2020

    At 20:20 , you passed dt in DatetimeIndex() to make it DateIndex type. But when we will create a date range from pd.date_range it itself is DatetimeIndex type and we can skip the pd.DatetimeIndex function part.

  • codebasics
    December 8, 2020

    brilliant ! thanks

  • codebasics
    December 8, 2020

    Hi,
    Excellent examples and explanation.
    I am facing an issue , after using dictionary with fillna method for replacing 0 values in 'event' column , the df still has only 0s.
    Krish

  • codebasics
    December 8, 2020

    Really a life saver bro!! Thanks a ton!!

  • codebasics
    December 8, 2020

    Thank brader good

  • codebasics
    December 8, 2020

    brother what kind of method do you use to record this video so clear and professional please let me know i am trying to videos for data analysis. Thanks.

  • codebasics
    December 8, 2020

    Jiro?
    Thanks a lot!

  • codebasics
    December 8, 2020

    what if the first value of the column is NAN, then how does the forward fill works?

  • codebasics
    December 8, 2020

    Well explanation than most paid courses. Thanks a lot.

  • codebasics
    December 8, 2020

    the perfect tutorial thanks a lot

  • codebasics
    December 8, 2020

    Thalaiva you are great

  • codebasics
    December 8, 2020

    you are noble soul !!

  • codebasics
    December 8, 2020

    How to learn coding for beginners | Learn coding for free: https://www.youtube.com/watch?v=CptrlyD0LJ8

  • codebasics
    December 8, 2020

    Wow ,learned a lot to handle datasets. Thank you Sir

  • codebasics
    December 8, 2020

    16:10 – Interpolate with method='time' is indeed quite powerful.

Write a comment