zoukankan      html  css  js  c++  java
  • python自然语言处理——3.8 分割

    微信公众号:数据运营人
    本系列为博主的读书学习笔记,如需转载请注明出处。

    第三章 加工原料文本

    3.8 分割断句分词

    3.8 分割

    断句

    在词级水平处理文本通常假定能够将文本划分成单个句子,一些语料库已经提供在句子级别的访问,计算布朗语料库中每个句子的平均词数:

    import nltk
    len(nltk.corpus.brown.words()) / len(nltk.corpus.brown.sents())

    20.250994070456922

    sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')
    text = nltk.corpus.gutenberg.raw('chesterton-thursday.txt')
    sents = sent_tokenizer.tokenize(text)
    pprint.pprint(sents[171:181])

    ['"Nonsense!','" said Gregory, who was very rational when anyone else attempted paradox.','"Why do all the clerks and navvies in the railway trains look so sad and tired,…','I will tell you.','It is because they know that the train is going right.','It is because they know that whatever place they have taken a ticket for that …','It is because after they have passed Sloane Square they know that the next stat…','Oh, their wild rapture!','oh, their eyes like stars and their souls again in Eden, if the next station w…''" "It is you who are unpoetical," replied the poet Syme.']

    分词

    在中文中,三个字符的字符串:爱国人(ai4 “love” [verb], guo3 “country”,ren2 “person”) 可以被分词为“爱国/人” , “country-loving person” ,或者“爱/国人” , “ love country-person” 。

    例1-1:从分词表示字符串seg1和seg2 中重建文本分词。 seg1 和 seg2 表示假设的一些儿童讲话的初始和最终分词。函数 segment() 可以使用它们重现分词的文本。

    def segment(text, segs):
        words = []
        last = 0
        for i in range(len(segs)):
            if segs[i] == '1':
                words.append(text[last:i+1])
                last = i+1
        words.append(text[last:])
        return words
    text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
    seg1 = "0000000000000001000000000010000000000000000100000000000"
    seg2 = "0100100100100001001001000010100100010010000100010010000"
    print(segment(text, seg1))
    print(segment(text, seg2))

    ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
    ['do', 'you', 'see', 'the', 'kitty', 'see', 'the', 'doggy', 'do', 'you','like', 'the', kitty', 'like', 'the', 'doggy']

    例1-2:计算存储词典和重构源文本的成本。

    def evaluate(text, segs):
        words = segment(text, segs)
        text_size = len(words)
        lexicon_size = len(' '.join(list(set(words))))
        return text_size + lexicon_size
    text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
    seg1 = "0000000000000001000000000010000000000000000100000000000"
    seg2 = "0100100100100001001001000010100100010010000100010010000"
    seg3 = "0000100100000011001000000110000100010000001100010000001"
    print(segment(text, seg3))
    print(evaluate(text, seg3))
    print(evaluate(text, seg2))
    print(evaluate(text, seg1))

    ['doyou', 'see', 'thekitt', 'y', 'see', 'thedogg', 'y', 'doyou', 'like','thekitt', 'y', 'like', 'thedogg', 'y']
    46
    47
    63
    例1-3:使用模拟退火算法的非确定性搜索:一开始仅搜索短语分词;随机扰动 0 和 1 ,它们与“温度”成比例;每次迭代温度都会降低,扰动边界会减少。

    from random import randint
    def flip(segs, pos):
        return segs[:pos] + str(1-int(segs[pos])) + segs[pos+1:]
    def flip_n(segs, n):
        for i in range(n):
            segs = flip(segs, randint(0,len(segs)-1))
        return segs
    def anneal(text, segs, iterations, cooling_rate):  
        temperature = float(len(segs))
        while temperature > 0.5:
            best_segs, best = segs, evaluate(text, segs)
            for i in range(iterations):
                guess = flip_n(segs, int(round(temperature)))
                score = evaluate(text, guess)
                if score < best:
                    best, best_segs = score, guess
            score, segs = best, best_segs
            temperature = temperature / cooling_rate
            print(evaluate(text, segs), segment(text, segs))
        return segs
    text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
    seg1 = "0000000000000001000000000010000000000000000100000000000"
    anneal(text, seg1, 50001.2)

    60 ['doyouseetheki', 'tty', 'see', 'thedoggy', 'doyouliketh', 'ekittylike', 'thedoggy']
    58 ['doy', 'ouseetheki', 'ttysee', 'thedoggy', 'doy', 'o', 'ulikethekittylike', 'thedoggy']
    56 ['doyou', 'seetheki', 'ttysee', 'thedoggy', 'doyou', 'liketh', 'ekittylike', 'thedoggy']
    54 ['doyou', 'seethekit', 'tysee', 'thedoggy', 'doyou', 'likethekittylike', 'thedoggy']
    53 ['doyou', 'seethekit', 'tysee', 'thedoggy', 'doyou', 'like', 'thekitty', 'like', 'thedoggy']
    51 ['doyou', 'seethekittysee', 'thedoggy', 'doyou', 'like', 'thekitty', 'like', 'thedoggy']42 ['doyou', 'see', 'thekitty', 'see', 'thedoggy', 'doyou', 'like', 'thekitty', 'like', 'thedoggy']
    '0000100100000001001000000010000100010000000100010000000'

  • 相关阅读:
    对象之间是有联系的
    java发展历程、常用dos命令与jDK工具使用
    java环境变量、集成开发环境与使用两个类
    C++中,将单精度浮点数转换成2进制数
    Java代码规范、基本类型和实例演练
    java方法的理解、调用栈与异常处理
    java面向对象、构造方法 之内部类
    java读代码步骤
    Java中break、continue、return语句的使用区别
    数学图像处理--空间滤波
  • 原文地址:https://www.cnblogs.com/ly803744/p/10531012.html
Copyright © 2011-2022 走看看