zoukankan      html  css  js  c++  java
  • NLTK——NLTK的正则表达式分词器(nltk.regexp_tokenize)

    在《Python自然语言处理》一书中的P121出现来一段利用NLTK自带的正则表达式分词器——nlt.regexp_tokenize,书中代码为:

    1 text = 'That U.S.A. poster-print ex-costs-ed $12.40 ... 8% ?  _'
    2     pattern = r'''(?x)    # set flag to allow verbose regexps
    3         ([A-Z].)+        # abbreviations, e.g. U.S.A.
    4        |w+(-w+)*        # words with optional internal hyphens
    5        |$?d+(.d+)?%?  # currency and percentages, e.g. $12.40, 82%
    6        |...            # ellipsis
    7        |(?:[.,;"'?():-_`])  # these are separate tokens; includes ], [
    8      '''

    其中text变量结尾的“8%”和“_”是我自己加上去的。

    预期输出应该是:

    1 ['That', 'U.S.A.', 'poster-print', 'ex-costs-ed', '$12.40', '...', '8%', '?', '_']

    可实际代码是:

    1 [('', '', ''), ('A.', '', ''), ('', '-print', ''), ('', '-ed', ''), ('', '', '.40'), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]

    会出现这样的问题是由于nltk.internals.compile_regexp_to_noncapturing()在V3.1版本的NLTK中已经被抛弃(尽管在更早的版本中它仍然可以运行),为此我们把之前定义的pattern稍作修改(参考:https://blog.csdn.net/baimafujinji/article/details/51051505

    1 pattern = r'''(?x)    # set flag to allow verbose regexps
    2         (?:[A-Z].)+        # abbreviations, e.g. U.S.A.
    3        |w+(?:-w+)*        # words with optional internal hyphens
    4        |$?d+(?:.d+)?%?  # currency and percentages, e.g. $12.40, 82%
    5        #|w+(?:-w+)* 
    6        |...            # ellipsis
    7        |(?:[.,;"'?():-_`])  # these are separate tokens; includes ], [
    8      '''

    实际输出结果是:

    1 ['That', 'U.S.A.', 'poster-print', 'ex-costs-ed', '$12.40', '...', '8', '?', '_']

    我们发现‘8’应该显示成‘8%’才对,后发现将第三行的‘*’去掉或者将第三四行调换位置即可正常显示,修改后代码如下:

    1 pattern = r'''(?x)    # set flag to allow verbose regexps
    2         (?:[A-Z].)+        # abbreviations, e.g. U.S.A.
    3        #|w+(?:-w+)*        # words with optional internal hyphens
    4        |$?d+(?:.d+)?%?  # currency and percentages, e.g. $12.40, 82%
    5        |w+(?:-w+)* 
    6        |...            # ellipsis
    7        |(?:[.,;"'?():-_`])  # these are separate tokens; includes ], [
    8      '''

    此时结果显示正常,所以得出结论就是‘*’影响了它下面的正则表达式中的百分号'%'的匹配。至于为什么就不得而知了。

     

  • 相关阅读:
    Leetcode 121. Best Time to Buy and Sell Stock
    Leetcode 120. Triangle
    Leetcode 26. Remove Duplicates from Sorted Array
    Leetcode 767. Reorganize String
    Leetcode 6. ZigZag Conversion
    KMP HDU 1686 Oulipo
    多重背包 HDU 2844 Coins
    Line belt 三分嵌套
    三分板子 zoj 3203
    二分板子 poj 3122 pie
  • 原文地址:https://www.cnblogs.com/LCharles/p/10876017.html
Copyright © 2011-2022 走看看