zoukankan html css js c++ java

Kaggle-data-cleaning(5)

Inconsistent-data-entry

教程

本讲中我们将学习如何清理不一致的文本条目

Do some preliminary text pre-processing

首先使用head查看文件前几行

假设我们有兴趣清理“城市”列，以确保其中没有数据输入不一致。当然，我们可以手动检查每一行，并在发现不一致时手动纠正它们。不过，还有一种更有效的方法

# get all the unique values in the 'City' column
cities = suicide_attacks['City'].unique()

# sort them alphabetically and then take a closer look
cities.sort()
cities

仅查看此内容，我就会看到由于数据输入不一致而导致的一些问题：例如，“ Lahore”和“ Lahore ”（后者有多余的空格）或“ Lakki Marwat”和“ Lakki marwat”（大小写）。

我要做的第一件事是使所有内容都变成小写（如果愿意，我可以在结尾处改回它），并删除单元格开头和结尾的所有空格。在文本数据中，大写字母和尾随空格的不一致非常常见，通过执行此操作，您可以修复80％的文本数据输入不一致。

# convert to lower case
suicide_attacks['City'] = suicide_attacks['City'].str.lower()
# remove trailing white spaces
suicide_attacks['City'] = suicide_attacks['City'].str.strip()

Use fuzzy matching to correct inconsistent data entry

好吧，让我们再看一下“城市”列，看看是否还有更多需要我们清理的数据

# get all the unique values in the 'City' column
cities = suicide_attacks['City'].unique()

# sort them alphabetically and then take a closer look
cities.sort()
cities

看起来确实还有一些不一致之处： “d. i khan”和“d.i khan”可能应该相同。（我查了一下，“ d.g khan”是一个单独的城市，所以我不应该将它们结合在一起。）

我将使用Fuzzywuzzy包来帮助确定哪个字符串彼此最接近。这个数据集足够小，我们可能可以手工纠正错误，但是这种方法无法很好地扩展。（您想手动纠正一千个错误吗？一万个怎么办？通常，尽早自动化是个好主意。此外，这很有趣！）

模糊匹配：自动查找与目标字符串非常相似的文本字符串的过程。通常，如果将一个字符串转换为另一个字符串，则将一个字符串更改为与另一个字符串“更接近”的次数越少，所需更改的字符就越少。因此，“ apple”和“ snapple”是彼此分开的两个变化（加上“ s”和“ n”），而“ in”和“ on”是一个变化（将“ i”替换为“ o”）。您将无法始终依靠100％的模糊匹配，但这通常会为您节省至少一点时间。

Fuzzywuzzy返回给定两个字符串的比率。该比率越接近100，则两个字符串之间的编辑距离越小。在这里，我们将从与“ d.i khan”最接近的城市列表中获取十个字符串。

# get the top 10 closest matches to "d.i khan"
matches = fuzzywuzzy.process.extract("d.i khan", cities, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

# take a look at them
matches

我们可以看到城市中的两个项目非常接近“ d.i khan”：“ d。i khan”和“ d.i khan”。我们还可以看到“ dg khan”是一个单独的城市，比率为88。由于我们不想将“ dg khan”替换为“ di khan”，因此我们将“城市”列中所有具有与“ d。i khan”的比率> 90。

# function to replace rows in the provided column of the provided dataframe
# that match the provided string above the provided ratio with the provided string
def replace_matches_in_column(df, column, string_to_match, min_ratio = 90):
    # get a list of unique strings
    strings = df[column].unique()
    
    # get the top 10 closest matches to our input string
    matches = fuzzywuzzy.process.extract(string_to_match, strings, 
                                         limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

    # only get matches with a ratio > 90
    close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]

    # get the rows of all the close matches in our dataframe
    rows_with_matches = df[column].isin(close_matches)

    # replace all rows with close matches with the input matches 
    df.loc[rows_with_matches, column] = string_to_match
    
    # let us know the function's done
    print("All done!")

处理之后可以将类似的列进行替换，从而实现了功能。

练习

1) Examine another column

在下面编写代码，以查看“Province”列中的所有唯一值。

# TODO: Your code here
# convert to lower case
suicide_attacks['Province'] = suicide_attacks['Province'].str.lower()
# remove trailing white spaces
suicide_attacks['Province'] = suicide_attacks['Province'].str.strip()

# get the top 10 closest matches to "d.i khan"
Province = suicide_attacks['Province'].unique()
print(Province)

Output:

['capital' 'sindh' 'baluchistan' 'punjab' 'fata' 'kpk' 'ajk' 'balochistan']

2) Do some text pre-processing

将suicide_attacks数据帧中“Province”列中的每个条目转换为小写。

# TODO: Your code here
suicide_attacks['Province'] = suicide_attacks['Province'].str.lower()
# Check your answer
q2.check()

3) Continue working with cities

在本教程中，我们着重于清除“City”列中的不一致之处。运行下面的代码单元以查看结尾的唯一值列表。

# get all the unique values in the 'City' column
cities = suicide_attacks['City'].unique()

# sort them alphabetically and then take a closer look
cities.sort()
cities

Output：

array(['attock', 'bajaur agency', 'bannu', 'bhakkar', 'buner', 'chakwal',
       'chaman', 'charsadda', 'd.g khan', 'd.i khan', 'dara adam khel',
       'fateh jang', 'ghallanai, mohmand agency', 'gujrat', 'hangu',
       'haripur', 'hayatabad', 'islamabad', 'jacobabad', 'karachi',
       'karak', 'khanewal', 'khuzdar', 'khyber agency', 'kohat',
       'kuram agency', 'kurram agency', 'lahore', 'lakki marwat',
       'lasbela', 'lower dir', 'malakand', 'mansehra', 'mardan',
       'mohmand agency', 'mosal kor, mohmand agency', 'multan',
       'muzaffarabad', 'north waziristan', 'nowshehra', 'orakzai agency',
       'peshawar', 'pishin', 'poonch', 'quetta', 'rawalpindi', 'sargodha',
       'sehwan town', 'shabqadar-charsadda', 'shangla', 'shikarpur',
       'sialkot', 'south waziristan', 'sudhanoti', 'sukkur', 'swabi',
       'swat', 'taftan', 'tangi, charsadda district', 'tank', 'taunsa',
       'tirah valley', 'totalai', 'upper dir', 'wagah', 'zhob'],
      dtype=object)

再看一看“City”列，看看是否还需要清理数据。

看起来“ kuram agency”和“ kurram agency”应该是同一座城市。更正数据框中的“City”列，以使与“ kuram agency”的每个匹配项都显示为“ kurram agency”。

# TODO: Your code here!

rows_with_matches = (suicide_attacks['City'] == 'kuram agency')
suicide_attacks.loc[rows_with_matches, 'City'] = 'kurram agency'
# Check your answer
q3.check()

查看全文

相关阅读:
CDH5.2安装更换hive元数据存储数据库遇到的问题
 SSH 互信
 【记录】Java NIO实现网络模块遇到的BUG
Http2协议简介
 synchronized(this) 与 synchronized(class) 理解
 【记录】spring boot 图片上传与显示
 Cookie-Session机制
 linux利用用户组给用户赋予不同的权限
 java .equals()和==的区别
 String直接赋值和使用new的区别

原文地址：https://www.cnblogs.com/caishunzhe/p/13440840.html