zoukankan html css js c++ java

针对句子或文章的 -- 关于切词组化的正序排布

有时候会有种需求就是：将一段句子里面的词语进行分词，分词结果根据构造的词典进行分词

(当然了解NLP的同学，jieba里面的自定义词典可以满足需求) 这里自己写了轮子共享下：

需求：

a = "i love you so much and want to do something for you, can you give me one chance or want to marry you."

require_list = ["want to", "so much", "one chance"]

## 结果需要
Finish Result: ['i', 'love', 'you', 'so much', 'and', 'want to', 'do', 'something', 'for', 'you,', 'can', 'you', 'give', 'me', 'one chance', 'or', 'want to', 'marry', 'you.']


采用了逆向词组判定和递归的想法：

# 关于切词组化的正序排布
def tokenize(need_article1, rule_phrase, new_list):
    need_article = need_article1
    len_need_article = len(need_article)
    for word_join in rule_phrase:

        while word_join in need_article:
        # if word_join in need_article:
            length = len(word_join)
            index_first = need_article.find(word_join)

            if index_first == -1:
                break
            try:
                if index_first + length < len_need_article:
                    if need_article[index_first + length] == " " and (index_first == 0 or need_article[index_first - 1] == " "):

                        need_article_temp = need_article[:index_first + length]
                        specfic_phrase = need_article_temp[index_first::]

                        if need_article_temp[:index_first]:

                            before_phrase = need_article_temp[:index_first]

                            conbine_list = tokenize(before_phrase, rule_phrase, new_list)

                            if conbine_list:
                                new_list.append(specfic_phrase)

                                need_article =need_article[index_first + length:]

                        else:
                            new_list.append(specfic_phrase)
                            need_article = need_article[index_first + length:]

                    else:
                        break
            except Exception as e:
                # 这里越界不做处理，直接将句子扔进递归
                need_article_temp = need_article[:index_first + length]

                specfic_phrase = need_article_temp[index_first::]
                if need_article_temp[:index_first]:
                    before_phrase = need_article_temp[:index_first]

                    conbine_list = tokenize(before_phrase, rule_phrase, new_list)

                    if conbine_list:
                        new_list.append(specfic_phrase)

                        need_article = need_article[index_first + length:]

                else:
                    new_list.append(specfic_phrase)
                    need_article = need_article[index_first + length:]
    else:
        conbine_list2 = need_article.split()
        new_list.extend(conbine_list2)
    if need_article1 == need_article:
        if not conbine_list2:
            conbine_list = need_article.split()
            new_list.extend(conbine_list)
        return need_article

    return new_list

def tokenize_foo(a, list1, new_list):

    new_list_order = tokenize(a, list1, new_list)
    if a == new_list_order:
        # 还是返回了原来的字符串, 没有任何的词汇组合
        new_list_order = new_list_order.split()

    return new_list_order

主程序：

if __name__ == '__main__':


    a = "i love you so much and want to do something for you, can you give me one chance or want to marry you."

    list1 = ["want to", "one chance", "so much"]

    new_list = []
    new_list_order = tokenize_foo(a, list1, new_list)
    print(new_list_order)

　结果展示

日积月累，小小的力量，大大的梦想...

查看全文

相关阅读:
js中操作Map集合
 js中json字符串与map的相互转化
 安卓开发--个人软件开发-day05
javabean工具类方法lombok
hadoop的datanode没有正常启动
 安卓开发--个人软件开发-day04
安卓开发--个人软件开发-day03
bigData学习笔记-hadoop总结
 spring框架学习之mybatis
安卓开发--个人软件开发-day02

原文地址：https://www.cnblogs.com/harp-yestar/p/tokenize_pharase.html