zoukankan      html  css  js  c++  java
  • 自然语言处理----语料库

        本文重点介绍预料库的一般操作。

        1.  使用nltk加载自己的预料库

     1 >>> from nltk.corpus import PlaintextCorpusReader
     2 >>> corpus_root=r'D:/00001/2002/Annual_txt'
     3 >>> reader=PlaintextCorpusReader(corpus_root, '.*')
     4 >>> reader.fileids()
     5 ['2001 Business Highlights .txt', 'Back Cover.txt', 'Balance Sheet.txt', 'Cheung Kong Infrastructure Holdings Limited.txt', 'Consolidated Balance Sheet.txt', 'Consolidated Cash Flow Statement.txt', 'C
     6 onsolidated Profit & Loss Account .txt', 'Consolidated Statement of Recognised Gains and Losses.txt', 'Contents.txt', 'Corporate Information.txt', 'Cover.txt', 'Development Projects.txt', "Directors'
     7 Biographical Information.txt", 'Extracts from Hutchison Whampoa Limited Financial Statements.txt', 'Financial Highlights.txt', 'Group Financial Summary.txt', 'Group Structure.txt', 'Hongkong Electric
     8 Holdings Limited.txt', 'Hutchison Whampoa Limited.txt', 'Management Discussion and Analysis.txt', 'Notes to Financial Statements.txt', 'Notice of Annual General Meeting.txt', 'Overseas Properties.txt'
     9 , 'Rental Properties.txt', 'Report of the Auditors.txt', 'Report of the Chairman and the Managing Director.txt', 'Report of the Directors.txt', 'Schedule of Major Properties.txt']
    10 >>>
    View Code

        这里将本地'D:/00001/2002/Annual_txt'文件夹作为一个预料库,操作里面的文件。

        

        2. 预料库的一般操作

        1) fileids(): 获取预料库中的文件列表

        2) fileids([categories]): 获取分类对应的语料库中的文件

        3)categories(): 获取语料库的分类

        4) categories([fileids]): 获取文件对应的语料库中的分类

        5) raw(): 获取语料库中的原始内容

        6)raw(fileids=[f1, f2, f3]): 获取指定文件中的原始内容

        7) words(): 获取整个语料库中的词汇

        8)words([fileids=[f1, f2, f3]): 获取指定文件中的词汇

        9) words(categories=[c1, c2]): 获取指定分类中的词汇

      10) sents(): 获取整个语料库中的句子

        11)sents([fileids=[f1, f2, f3]): 获取指定文件中的句子

        12) sents(categories=[c1, c2]): 获取指定分类中的句子

        13) abspath(fileid): 获取指定文件在磁盘上的位置

        14) encoding(fileid): 获取指定文件的文件编码(如果知道的话)

        15) open(fileid): 获取打开指定语料库文件的文件流

        16)readme(): 获取语料库的README文件的内容

     1 >>> from nltk.corpus import PlaintextCorpusReader
     2 >>> corpus_root=r'D:/00001/2002/Annual_txt'
     3 >>> reader=PlaintextCorpusReader(corpus_root, '.*')
     4 >>> reader.fileids()
     5 ['2001 Business Highlights .txt', 'Back Cover.txt', 'Balance Sheet.txt', 'Cheung Kong Infrastructure Holdings Limited.txt', 'Consolidated Balance Sheet.txt', 'Consolidated Cash Flow Statement.txt', 'C
     6 onsolidated Profit & Loss Account .txt', 'Consolidated Statement of Recognised Gains and Losses.txt', 'Contents.txt', 'Corporate Information.txt', 'Cover.txt', 'Development Projects.txt', "Directors'
     7 Biographical Information.txt", 'Extracts from Hutchison Whampoa Limited Financial Statements.txt', 'Financial Highlights.txt', 'Group Financial Summary.txt', 'Group Structure.txt', 'Hongkong Electric
     8 Holdings Limited.txt', 'Hutchison Whampoa Limited.txt', 'Management Discussion and Analysis.txt', 'Notes to Financial Statements.txt', 'Notice of Annual General Meeting.txt', 'Overseas Properties.txt'
     9 , 'Rental Properties.txt', 'Report of the Auditors.txt', 'Report of the Chairman and the Managing Director.txt', 'Report of the Directors.txt', 'Schedule of Major Properties.txt']
    10 >>> reader.categories()
    11 Traceback (most recent call last):
    12   File "<stdin>", line 1, in <module>
    13 AttributeError: 'PlaintextCorpusReader' object has no attribute 'categories'
    14 >>> reader.raw()[:100]
    15 u'CHEUNG KONG (HOLDINGS) LIMITED
    
    ANNUAL REPORT 2001
    
    2001 Business Highlights
    
    4
    
    January
    
    '
    16 >>> reader.raw(fileids='Balance Sheet.txt')[:100]
    17 u'CHEUNG KONG (HOLDINGS) LIMITED
    
    ANNUAL REPORT 2001
    
    Balance Sheet
    
    As at 31st December, 2001
    '
    18 >>> reader.raw()[:100]
    19 u'CHEUNG KONG (HOLDINGS) LIMITED
    
    ANNUAL REPORT 2001
    
    2001 Business Highlights
    
    4
    
    January
    
    '
    20 >>> reader.sents()[:5]
    21 [[u'CHEUNG', u'KONG', u'(', u'HOLDINGS', u')', u'LIMITED'], [u'ANNUAL', u'REPORT', u'2001'], [u'2001', u'Business', u'Highlights'], [u'4'], [u'January']]
    22 >>> reader.sents(fileids='Balance Sheet.txt')[:5]
    23 [[u'CHEUNG', u'KONG', u'(', u'HOLDINGS', u')', u'LIMITED'], [u'ANNUAL', u'REPORT', u'2001'], [u'Balance', u'Sheet'], [u'As', u'at', u'31st', u'December', u',', u'2001'], [u'44']]
    24 >>> reader.words(fileids='Balance Sheet.txt')[:5]
    25 [u'CHEUNG', u'KONG', u'(', u'HOLDINGS', u')']
    26 >>> reader.words()[:20]
    27 [u'CHEUNG', u'KONG', u'(', u'HOLDINGS', u')', u'LIMITED', u'ANNUAL', u'REPORT', u'2001', u'2001', u'Business', u'Highlights', u'4', u'January', u'*', u'Successfully', u'raised', u'a', u'5', u'-']
    28 >>> reader.abspath('Balance Sheet.txt')
    29 FileSystemPathPointer('D:\00001\2002\Annual_txt\Balance Sheet.txt')
    30 >>> reader.open('Balance Sheet.txt')
    31 <nltk.data.SeekableUnicodeStreamReader object at 0x064822B0>
    View Code

    显然,自定义的预料库因为没有设置categories属性,所以涉及相关的操作无法进行。

  • 相关阅读:
    join_tab计算代价
    outer join test
    突然觉得mysql优化器蛮简单
    将数据库字段从float修改为decimal
    小米初体验
    简述安装android开发环境
    Rust语言:安全地并发
    awk里的各种坑
    ubuntu下使用C语言开发一个cgi程序
    Ubuntu下安装和配置Apache2
  • 原文地址:https://www.cnblogs.com/no-tears-girl/p/6955640.html
Copyright © 2011-2022 走看看