Using Data Science Skills Now: Cleaning up text data. | by Dawn Moyer | Dec, 2020


Be a workplace Hero! Automate the annoying task of cleaning up the spelling and formatting of categorical text columns with fuzzy matching.

Image by PublicDomainPictures from Pixabay

The Problem

The Data

import pandas as pd  
import numpy as np
from platform import python_version
print('python version: ' + python_version())
path = '<path to your files>'# create dataframes
official_list_df = pd.read_excel(path + 'holiday official list.xlsx')
hand_typed_list_df = pd.read_excel(path + 'holidays.xlsx')
# create lists
official_list = official_list_df['Holiday Name'].fillna('*').to_list()
hand_typed_list = hand_typed_list_df['Holiday'].fillna('*').to_list()
validation_list = hand_typed_list_df['Validation'].fillna('*').to_list()

Possible Solutions

import fuzzymatcherhand_typed_list_join_on = [‘Holiday’]
official_list_join_on = [‘Holiday Name’]
matched = fuzzymatcher.fuzzy_left_join(hand_typed_list_df,
hand_typed_list_join_on, #left join on
official_list_join_on, #right join on
right_id_col=’Holiday Name’)
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
# create a results dataset so we can review
results_df = pd.DataFrame(columns = ['Hand_Typed_Name', 'Matched_Name', 'Match_Score', 'Allow_Flag'
, 'Validation_value', 'Matched_Flag', 'Success_Code'])
# iterate over each element on the lists
for index, row in hand_typed_list_df.iterrows():
result = process.extractOne(row.Holiday, official_list)
# does the potential match value equal the validation value?
if str(result[0]) == row.Validation:
res = 'Yes'
res = 'No'
# does the match score meet the minimum threshold set?
if result[1] > 86:
allow = 'Yes'
allow = 'No'
# are my match and scoring cutoff flags consistant?
if res == allow:
final = 'Passed'
final = 'Needs Review'

# append the data to the results dataframe
new_row = { 'Hand_Typed_Name' : str(row.Holiday)
, 'Matched_Name' : str(result[0])
, 'Match_Score' : str(result[1])
, 'Allow_Flag' : str(allow)
, 'Validation_value' : str(row.Validation)
, 'Matched_Flag' : str(res)
, 'Success_Code' : str(final)
results_df = results_df.append(new_row, ignore_index=True)



Read More …


Write a comment