zoukankan html css js c++ java

自然语言处理之中文分词算法

中文分词算法一般分为三类：

1.基于词表的分词算法

正向最大匹配算法FMM
逆向最大匹配算法BMM
双向最大匹配算法BM

2.基于统计模型的分词算法：基于N-gram语言模型的分词算法

3.基于序列标注的分词算法

基于HMM
基于CRF
基于深度学习的端到端的分词算法

下面介绍三类基于词表的分词算法

一、正向最大匹配算法

概念：对于一般文本，从左到右，以贪心的方式切分出当前位置上长度最大的词。条件是必须基于字典，原理是单词的颗粒度越大，所能表示的含义越确切

步骤：

从一个字符串的开始位置选择一个最大长度的词长片段，如果序列不足最大词长，则选择全部序列
首先看该片段是否在字典中，如果是，则算为一个分出来的词，如果不是，则从右边开始减少一个字符，然后看短一点的这个片段是否在词典中，依次循环，直至剩下单字
此时序列变为第2步截取分词后剩下的部分序列

#使用正向最大匹配算法实现中文分词
words_dict = []#存放载入的词典

def init():
    '''
    读取词典文件
    载入词典
    :return:
    '''

    with open("dic/dic.txt","r",encoding="utf8") as dict_input:
        for word in dict_input:
            word_dict.append(word.strip())

#实现正向最大匹配算法中的切词方法
def cut_words(raw_sentence,words_dict):
    #统计字典中最长的词
    max_length = max(len(word) for word in words_dict)#找到句子中最长的词
    sentence = raw_sentence
    #统计序列长度
    word_length = len(sentence)
    #存储切分好的词语
    cut_word_list = []
    while word_length > 0:
        max_cut_length = min(max_length,max_cut_length)#选取词长和句子长中最小的一个
        subSentence = sentence[0:max_cut_length]
        while max_cut_length > 0:
            if subSentence in words_dict:#如果这个最长的词在我们的词典中，那么它就是最长的词了
                cut_word_list.append(subSentence)
                break
            elif max_cut_length == 1:#如果是单字作为一个的时候
                cut_word_list.append(subSentence)
                break
            else:#如果这个词不在字典中，并且也不是单字作为一个词的，就要把匹配长度-1
                max_cut_length = max_cut_length -1
                subSentence = subSentence[0:max_cut_length]#这时要把右边的词去掉
        sentence = sentence[max_cut_length:]#把找的最大的词去掉，剩下的继续循环
        word_length = word_length - max_cut_length
    # words = "/".join(cut_word_list)
    return cut_word_list

def main():
    """
    用于用户交互
    :return:
    """
    init()
    while True:
        print("请输入要分词的序列")
        input_str = input()
        if not input_str:
            break
        result = cut_words(input_str,word_dict)
        print("分词结果",result)

if __name__ == '__main__':
    main()

二、逆向最大匹配算法

BMM与FMM类似，只是分词顺序变为从右至左

但是，BMM和FMM对于歧义词的处理能力一般

#使用逆向最大匹配算法实现中文分词
words_dict = []

def init():
    """
    读取字典文件
    获取字典
    :return:
    """
    with open("dict/dic.txt","r",encoding="utf8") as dic_input:
        for word in dic_input:
            words_dict.append(word.strip())

#实现逆向最大匹配算法中的切词方法
def cut_words(raw_sentence,words_dict):
    #统计词典中词的最大长度
    max_length = max(len(word) for word in words_dict)
    sentence = raw_sentence.strip()
    #统计序列长度
    words_length = len(sentence)
    #存储切分好的词
    cut_word_list = []
    #判断是否需要继续切词
    while words_length > 0:
        max_cut_length = min(max_length, max_cut_length)  # 选取词长和句子长中最小的一个
        subSentence = sentence[-max_cut_length:]#从后往前取max_cut_length这么长
        while max_cut_length > 0:
            if subSentence in words_dict:
                cut_word_list.append(subSentence)
                break
            elif max_cut_length == 1:
                cut_word_list.append(subSentence)
                break
            else:
                max_cut_length = max_cut_length -1
                subSentence = subSentence[-max_cut_length:]
        sentence = sentence[0:-max_cut_length]
        words_length = words_length - max_cut_length
    cut_word_list.reverse()#切完之后的词是乱序的  这里为其逆序一下
    # words = "/".join(cut_word_list)
    return  cut_word_list

def main():
    """
    用于用户交互
    :return:
    """
    init()
    while True:
        print("请输入要分词的序列:")
        input_str = input()
        if not input_str:
            break
        result = cut_words(input_str,word_dict)
        print("分词结果:",result)

if __name__ == '__main__':
    main()

三、双向最大匹配算法

BI是将FMM和BMM得到的结果进行比较，得到正确的分词方法

启发式规则：

如果正、反向分词结果词数不同，则取分词数量较少的那个
如果分词词数相同：

分词的结果相同，则说明没有歧义，可返回任意一个
分词结果不同，则返回单字较少的那个

import BMM,FMM
#使用双向最大匹配算法实现中文分词
words_dict = []

def init():
    """
    读取字典文件
    获取字典
    :return:
    """
    with open("dict/dic.txt","r",encoding="utf8") as dic_input:
        for word in dic_input:
            words_dict.append(word.strip())

#实现双向最大匹配算法中的切词方法
def cut_words(raw_sentence,words_dict):
    bmm_word_list = BMM.cut_words(raw_sentence,words_dict)
    fmm_word_list = FMM.cut_words(raw_sentence,words_dict)
    bmm_word_list_size = len(bmm_word_list)
    fmm_word_list_size = len(fmm_word_list)
    if bmm_word_list_size != fmm_word_list_size:
        if bmm_word_list_size < fmm_word_list_size:
            return bmm_word_list
        else:
            return fmm_word_list
    else:
        FSingle = 0
        BSingle = 0
        isSame = True
        for i in range(len(fmm_word_list)):
            if fmm_word_list[i] not in bmm_word_list:#如果fmm和bmm的分词结果是不相同的
                isSame = False
            if len(fmm_word_list[i]) == 1:
                FSingle = FSingle + 1#如果fmm列表里的词长度为1，也就是说是单个词，那么就把单个词的数量+1
            if len(bmm_word_list[i]) == 1:
                BSingle = BSingle + 1
        if isSame:
            return fmm_word_list
        elif BSingle > FSingle:
            return fmm_word_list
        else:
            return bmm_word_list


def main():
    """
    用于用户交互
    :return:
    """
    init()
    while True:
        print("请输入要分词的序列:")
        input_str = input()
        if not input_str:
            break
        result = cut_words(input_str,words_dict)
        print("分词结果:",result)

if __name__ == '__main__':
    main()

查看全文

相关阅读:
Python中利用xpath解析HTML
常见聚类算法——K均值、凝聚层次聚类和DBSCAN比较
 格式化字符串format函数
 编程语言这个垂直方向
 CLR,GC 表示什么意思？
ASP.Net MVC开发基础学习笔记：一、走向MVC模式
 NPOI 通过excel模板写入数据并导出
 SQL 注意事项
 解决微信公众号OAuth出现40029(invalid code，不合法的oauth_code)的错误
 iis 站点中文乱码解决方案

原文地址：https://www.cnblogs.com/bep-feijin/p/9639932.html