How do I find and remove duplicate rows in pandas?




[ad_1]

During the data cleaning process, you will often need to figure out whether you have duplicate data, and if so, how to deal with it. In this video, I’ll demonstrate the two key methods for finding and removing duplicate rows, as well as how to modify their behavior to suit your specific needs.

SUBSCRIBE to learn data science with Python:
https://www.youtube.com/dataschool?sub_confirmation=1

JOIN the “Data School Insiders” community and receive exclusive rewards:
https://www.patreon.com/dataschool

== RESOURCES ==
GitHub repository for the series: https://github.com/justmarkham/pandas-videos
“duplicated” documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html
“drop_duplicates” documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html

== LET’S CONNECT! ==
Newsletter: https://www.dataschool.io/subscribe/
Twitter: https://twitter.com/justmarkham
Facebook: https://www.facebook.com/DataScienceSchool/
LinkedIn: https://www.linkedin.com/in/justmarkham/

Source


[ad_2]

Comment List

  • Data School
    November 26, 2020

    Great video. This helped me tremendously.
    How would you go about finding duplicates "case insensitive" with a certain field?

  • Data School
    November 26, 2020

    Really, your teaching method is very good, your videoes give more knowledge, Thanks Data School

  • Data School
    November 26, 2020

    love u brother . u r changing so many lives, thanku ….the best teacher award goes to Data school.

  • Data School
    November 26, 2020

    wow! you are already teaching data science in 2014 when it is not even popular! Btw, your videos are really good, you speak slow and clear, easy to understand and for me to catch. Kudos to you!

  • Data School
    November 26, 2020

    At the end are you saying that "age" + "zip code" must TOGETHER be duplicates? Or are you saying "age" duplicates and "zip code" duplicates must remove their individual duplicates from their respective columns? Thanks

  • Data School
    November 26, 2020

    Thanks for awesome videos on Pandas. I was able to automate few excel reporting at my work.. but stuck with something very complex(its complex for me!). Could you please help on some complex excel calculations using Python.?
    for ex. suppose I have data in below format.
    db_instance Hostname Disk_group disk_path disk_size disk_used header_status
    abc_cr host1 data01 dev/mapper/asm01 240 90 Member
    abc_cr host1 data01 dev/mapper/asm02 240 100 Member
    abc_cr host1 data01 dev/mapper/asm03 240 60 Member
    abc_xy host1 data01 dev/mapper/asm01 240 90 Member
    abc_xy host1 data01 dev/mapper/asm02 240 100 Member
    abc_xy host1 data01 dev/mapper/asm03 240 60 Member
    abc_cr host1 acfs01 dev/mapper/asm04 90 30 Member
    abc_cr host1 acfs01 dev/mapper/asm05 90 60 Member
    abc_xy host1 acfs01 dev/mapper/asm04 90 30 Member
    abc_xy host1 acfs01 dev/mapper/asm05 90 60 Member
    host1 unassigned dev/mapper/asm06 180 0 Candidate
    host1 unassigned dev/mapper/asm07 180 0 Former
    res_du host2 data01 dev/mapper/asm01 240 90 Member
    res_du host2 data01 dev/mapper/asm02 240 100 Member
    res_du host2 data01 dev/mapper/asm03 240 60 Member
    res_hg host2 data01 dev/mapper/asm01 240 90 Member
    res_hg host2 data01 dev/mapper/asm02 240 100 Member
    res_hg host2 data01 dev/mapper/asm03 240 60 Member
    res_pq host2 acfs01 dev/mapper/asm04 90 30 Member
    res_pq host2 acfs01 dev/mapper/asm05 90 60 Member
    res_mn host2 acfs01 dev/mapper/asm04 90 30 Member
    res_mn host2 acfs01 dev/mapper/asm05 90 60 Member
    host2 unassigned dev/mapper/asm06 180 0 Candidate
    host2 unassigned dev/mapper/asm07 180 0 Former

    As you can see, disk_path is duplicated for each host..because of multiple db_instance. (Even though you see similar disk_paths for host1 & host2, but actually they are different disks from storage end.. but admins follow similar name conventions when they configure disks at host side, resulting similar disk_paths for different hosts)
    My queries are, How
    1. to remove duplicates for disks_path for each host?(considering only two columns Hostname & disk_path, that's how I remove duplicates in excel, I am not worried for db_instance)
    2. once we remove duplicates, calculate total size of 'Member' disks… also total size of 'Candidate' and 'Former' disks combined.
    3. to add another column 'Percent used', which will is result of 'disk_used'/'disk_size'*100 for each row.

    Thanks in advance!

  • Data School
    November 26, 2020

    long live and prosper!

  • Data School
    November 26, 2020

    How to Remove Leading and Trailing space in data frame

  • Data School
    November 26, 2020

    I have watched a lot of your videos; and I must say that the way, you explain is really good. Just to inform you that I am new to programming let alone Python.
    I want to learn a new thing from you. Let me give you a brief. I am working on a dataset to predict App Rating from Google Play Store. There is an attribute by name "Rating" which has a lot of null values. I want to replace those null values using a median from another attribute by name "Reviews". But I want to categorize the attribute "Reviews" in multiple categories like:
    1st category would be for the reviews less than 100,000,
    2nd category would be for the reviews between 100,001 and 1,000,000,
    3rd category would be for the reviews between 1,000,001 and 5,000,000 and
    4th category would be for the reviews anything more than 5,000,000.
    Although, I tried a lot, I failed to create multiple categories. I was able to create only 2 categories using the below command:
    gps['Reviews Group'] = [1 if x <= 1000000 else 2 for x in gps['Reviews']]
    gps is the Data Set.
    I replaced the Null Values using the below command:
    gps['Rating'] = gps.groupby('Reviews Group')['Rating'].transform(lambda x: x.fillna(x.median()))

    Please help me create multiple categories for "Reviews" as mentioned above and replace all the Null Values in "Rating".

  • Data School
    November 26, 2020

    lol, just when I felt you wouldn't handle the exact subject I was looking for: there came the bonus! Thanks!

  • Data School
    November 26, 2020

    You are the greatest teacher in the world

  • Data School
    November 26, 2020

    I can solve the duplicate data from my CSV file~~~ Thank you.
    However, I suggest you can do more in this video. I think you can show after the delete result list. Such as:
    >> new_data=df.drop_duplicates(keep='first')
    >> new_data.head(24898)
    If you have to add it, I think this video will be more perfect~~~

  • Data School
    November 26, 2020

    you're amazing we need more videos in your channel

  • Data School
    November 26, 2020

    very useful videos.. can you please tell me how to find duplicate of just one specific row?

  • Data School
    November 26, 2020

    Thank you! Love it.

  • Data School
    November 26, 2020

    Yo! You are a superb teacher!

  • Data School
    November 26, 2020

    i get a error when i run users.drop_duplicates(subset=['age','zip_code']).shape . error "'bool' object is not callable" even i get the same error if i run users.duplicated().sum()

  • Data School
    November 26, 2020

    Awesome videos Kevin. Thanks a to for the knowledge share.

  • Data School
    November 26, 2020

    Hi, I am wondering whether you could identify an issue that I am having whilst cleaning a dataset with the help of your tutorials. I will post the commands that I have used below:

    df["is_duplicate"]= df.duplicated() # make a new column with a mark of if row is a duplicate or not

    df.is_duplicate.value_counts()
    -> False 25804
    True 1591

    df.drop_duplicates(keep='first', inplace=True) #attempt to drop all duplicates, other than the first instance

    df.is_duplicate.value_counts() #
    -> False 25804
    True 728

    I am struggling to identify why there are still some duplicates that are marked 'True'?

    Kind regards,

  • Data School
    November 26, 2020

    How to keep rows that contains null values in any column and remove completed rows?

  • Data School
    November 26, 2020

    great video!!

  • Data School
    November 26, 2020

    💯+ like. Thank you very much sir.

  • Data School
    November 26, 2020

    I always find what I need in your channel.. and more… Thank you

  • Data School
    November 26, 2020

    A very much appreciated efforts. Thanks a million for sharing with us your python knowledge. It has been a wonderful journey with your precise explanation. keep the hard work! Warm regards.

  • Data School
    November 26, 2020

    import pandas as pd

    import numpy as np

    #Create a DataFrame

    d = {

    'Name':['Alisa','Bobby','jodha','jack','raghu','Cathrine',

    'Alisa','Bobby','kumar','Alisa','Alex','Cathrine'],

    'Age':[26,24,23,22,23,24,26,24,22,23,24,24],

    'Score':[85,63,55,74,31,77,85,63,42,62,89,77]}

    df = pd.DataFrame(d,columns=['Name','Age','Score'])

    df
    ============================================================
    When I write below code :

    df1 = df[df['Score']==85]

    df.drop_duplicates(['Name'])

    df

    Result
    Alisa is still shown, it is only deleting once alisa, I want both alisa to be remove,

    Can you please provide code ?

  • Data School
    November 26, 2020

    I´m beginner, so I can say that the video is very clear and precise, and anyone can understand the contents. thanks for sharing

  • Data School
    November 26, 2020

    You are amazing!

  • Data School
    November 26, 2020

    I have some missing dates in my dataset and want to add the missing dates to the dataset. I used isnull() to track these dates but I don't know how to add those dates into my dataset..Can you please help.Thanks

  • Data School
    November 26, 2020

    1.75X speed is your friend 🙂

  • Data School
    November 26, 2020

    Thanks so much for this! You helped me combine 629 files and remove 250k duplicate rows!
    You're the man! Subscribed

  • Data School
    November 26, 2020

    Thanks for the video

  • Data School
    November 26, 2020

    you are amazing.
    thank you ever much <3

  • Data School
    November 26, 2020

    I didn't find much in Duplicates. Thanks so much sir. I can't thank u enough.

  • Data School
    November 26, 2020

    Thank you for this useful tutorial. Quick question, how do you check whether a value in column A is present in column B or not; not necessarily on the same row. It is like the samething that VLOOKUP function looks for in Excel. Many thanks for your feed-back!

  • Data School
    November 26, 2020

    You should have used sort_values option with users.loc[users.duplicated(keep=False)].sort_values(by='age')

  • Data School
    November 26, 2020

    I really need help guys.

    I have a table that has a column : Column name – " Neighbourhood"

    This Column has A LOT of names repeated MANY times.

    To be specific, the column "Neighbourhood" has 10 Names that are repeated ALOT of times.

    My question is :

    I NEED HELP IN CREATING A SEPARATE COLUMN SPECIFYING HOW MANY TIMES EACH ELEMENT IN "NEIGHBORHOOD" HAS BEEN COUNTED.

    If anyone help me please.

  • Data School
    November 26, 2020

    感谢。

Write a comment