How do I find and remove duplicate rows in pandas?
[ad_1]
During the data cleaning process, you will often need to figure out whether you have duplicate data, and if so, how to deal with it. In this video, I’ll demonstrate the two key methods for finding and removing duplicate rows, as well as how to modify their behavior to suit your specific needs.
SUBSCRIBE to learn data science with Python:
https://www.youtube.com/dataschool?sub_confirmation=1
JOIN the “Data School Insiders” community and receive exclusive rewards:
https://www.patreon.com/dataschool
== RESOURCES ==
GitHub repository for the series: https://github.com/justmarkham/pandas-videos
“duplicated” documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html
“drop_duplicates” documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
== LET’S CONNECT! ==
Newsletter: https://www.dataschool.io/subscribe/
Twitter: https://twitter.com/justmarkham
Facebook: https://www.facebook.com/DataScienceSchool/
LinkedIn: https://www.linkedin.com/in/justmarkham/
Source
[ad_2]
Great video. This helped me tremendously.
How would you go about finding duplicates "case insensitive" with a certain field?
Really, your teaching method is very good, your videoes give more knowledge, Thanks Data School
love u brother . u r changing so many lives, thanku ….the best teacher award goes to Data school.
wow! you are already teaching data science in 2014 when it is not even popular! Btw, your videos are really good, you speak slow and clear, easy to understand and for me to catch. Kudos to you!
At the end are you saying that "age" + "zip code" must TOGETHER be duplicates? Or are you saying "age" duplicates and "zip code" duplicates must remove their individual duplicates from their respective columns? Thanks
Thanks for awesome videos on Pandas. I was able to automate few excel reporting at my work.. but stuck with something very complex(its complex for me!). Could you please help on some complex excel calculations using Python.?
for ex. suppose I have data in below format.
db_instance Hostname Disk_group disk_path disk_size disk_used header_status
abc_cr host1 data01 dev/mapper/asm01 240 90 Member
abc_cr host1 data01 dev/mapper/asm02 240 100 Member
abc_cr host1 data01 dev/mapper/asm03 240 60 Member
abc_xy host1 data01 dev/mapper/asm01 240 90 Member
abc_xy host1 data01 dev/mapper/asm02 240 100 Member
abc_xy host1 data01 dev/mapper/asm03 240 60 Member
abc_cr host1 acfs01 dev/mapper/asm04 90 30 Member
abc_cr host1 acfs01 dev/mapper/asm05 90 60 Member
abc_xy host1 acfs01 dev/mapper/asm04 90 30 Member
abc_xy host1 acfs01 dev/mapper/asm05 90 60 Member
host1 unassigned dev/mapper/asm06 180 0 Candidate
host1 unassigned dev/mapper/asm07 180 0 Former
res_du host2 data01 dev/mapper/asm01 240 90 Member
res_du host2 data01 dev/mapper/asm02 240 100 Member
res_du host2 data01 dev/mapper/asm03 240 60 Member
res_hg host2 data01 dev/mapper/asm01 240 90 Member
res_hg host2 data01 dev/mapper/asm02 240 100 Member
res_hg host2 data01 dev/mapper/asm03 240 60 Member
res_pq host2 acfs01 dev/mapper/asm04 90 30 Member
res_pq host2 acfs01 dev/mapper/asm05 90 60 Member
res_mn host2 acfs01 dev/mapper/asm04 90 30 Member
res_mn host2 acfs01 dev/mapper/asm05 90 60 Member
host2 unassigned dev/mapper/asm06 180 0 Candidate
host2 unassigned dev/mapper/asm07 180 0 Former
As you can see, disk_path is duplicated for each host..because of multiple db_instance. (Even though you see similar disk_paths for host1 & host2, but actually they are different disks from storage end.. but admins follow similar name conventions when they configure disks at host side, resulting similar disk_paths for different hosts)
My queries are, How
1. to remove duplicates for disks_path for each host?(considering only two columns Hostname & disk_path, that's how I remove duplicates in excel, I am not worried for db_instance)
2. once we remove duplicates, calculate total size of 'Member' disks… also total size of 'Candidate' and 'Former' disks combined.
3. to add another column 'Percent used', which will is result of 'disk_used'/'disk_size'*100 for each row.
Thanks in advance!
long live and prosper!
How to Remove Leading and Trailing space in data frame
I have watched a lot of your videos; and I must say that the way, you explain is really good. Just to inform you that I am new to programming let alone Python.
I want to learn a new thing from you. Let me give you a brief. I am working on a dataset to predict App Rating from Google Play Store. There is an attribute by name "Rating" which has a lot of null values. I want to replace those null values using a median from another attribute by name "Reviews". But I want to categorize the attribute "Reviews" in multiple categories like:
1st category would be for the reviews less than 100,000,
2nd category would be for the reviews between 100,001 and 1,000,000,
3rd category would be for the reviews between 1,000,001 and 5,000,000 and
4th category would be for the reviews anything more than 5,000,000.
Although, I tried a lot, I failed to create multiple categories. I was able to create only 2 categories using the below command:
gps['Reviews Group'] = [1 if x <= 1000000 else 2 for x in gps['Reviews']]
gps is the Data Set.
I replaced the Null Values using the below command:
gps['Rating'] = gps.groupby('Reviews Group')['Rating'].transform(lambda x: x.fillna(x.median()))
Please help me create multiple categories for "Reviews" as mentioned above and replace all the Null Values in "Rating".
lol, just when I felt you wouldn't handle the exact subject I was looking for: there came the bonus! Thanks!
You are the greatest teacher in the world
I can solve the duplicate data from my CSV file~~~ Thank you.
However, I suggest you can do more in this video. I think you can show after the delete result list. Such as:
>> new_data=df.drop_duplicates(keep='first')
>> new_data.head(24898)
If you have to add it, I think this video will be more perfect~~~
you're amazing we need more videos in your channel
very useful videos.. can you please tell me how to find duplicate of just one specific row?
Thank you! Love it.
Yo! You are a superb teacher!
i get a error when i run users.drop_duplicates(subset=['age','zip_code']).shape . error "'bool' object is not callable" even i get the same error if i run users.duplicated().sum()
Awesome videos Kevin. Thanks a to for the knowledge share.
Hi, I am wondering whether you could identify an issue that I am having whilst cleaning a dataset with the help of your tutorials. I will post the commands that I have used below:
df["is_duplicate"]= df.duplicated() # make a new column with a mark of if row is a duplicate or not
df.is_duplicate.value_counts()
-> False 25804
True 1591
df.drop_duplicates(keep='first', inplace=True) #attempt to drop all duplicates, other than the first instance
df.is_duplicate.value_counts() #
-> False 25804
True 728
I am struggling to identify why there are still some duplicates that are marked 'True'?
Kind regards,
How to keep rows that contains null values in any column and remove completed rows?
great video!!
💯+ like. Thank you very much sir.
I always find what I need in your channel.. and more… Thank you
A very much appreciated efforts. Thanks a million for sharing with us your python knowledge. It has been a wonderful journey with your precise explanation. keep the hard work! Warm regards.
import pandas as pd
import numpy as np
#Create a DataFrame
d = {
'Name':['Alisa','Bobby','jodha','jack','raghu','Cathrine',
'Alisa','Bobby','kumar','Alisa','Alex','Cathrine'],
'Age':[26,24,23,22,23,24,26,24,22,23,24,24],
'Score':[85,63,55,74,31,77,85,63,42,62,89,77]}
df = pd.DataFrame(d,columns=['Name','Age','Score'])
df
============================================================
When I write below code :
df1 = df[df['Score']==85]
df.drop_duplicates(['Name'])
df
Result
Alisa is still shown, it is only deleting once alisa, I want both alisa to be remove,
Can you please provide code ?
I´m beginner, so I can say that the video is very clear and precise, and anyone can understand the contents. thanks for sharing
You are amazing!
I have some missing dates in my dataset and want to add the missing dates to the dataset. I used isnull() to track these dates but I don't know how to add those dates into my dataset..Can you please help.Thanks
1.75X speed is your friend 🙂
Thanks so much for this! You helped me combine 629 files and remove 250k duplicate rows!
You're the man! Subscribed
Thanks for the video
you are amazing.
thank you ever much <3
I didn't find much in Duplicates. Thanks so much sir. I can't thank u enough.
Thank you for this useful tutorial. Quick question, how do you check whether a value in column A is present in column B or not; not necessarily on the same row. It is like the samething that VLOOKUP function looks for in Excel. Many thanks for your feed-back!
You should have used sort_values option with users.loc[users.duplicated(keep=False)].sort_values(by='age')
I really need help guys.
I have a table that has a column : Column name – " Neighbourhood"
This Column has A LOT of names repeated MANY times.
To be specific, the column "Neighbourhood" has 10 Names that are repeated ALOT of times.
My question is :
I NEED HELP IN CREATING A SEPARATE COLUMN SPECIFYING HOW MANY TIMES EACH ELEMENT IN "NEIGHBORHOOD" HAS BEEN COUNTED.
If anyone help me please.
感谢。