zoukankan      html  css  js  c++  java
  • 【语言处理与Python】11.4使用XML\11.5使用Toolbox数据

    11.4使用Toolbox数据

    语言结构中使用XML

    (2) <entry>
    <headword>whale</headword>
    <pos>noun</pos>
    <gloss>anyofthe larger cetaceanmammalshaving a streamlined
    bodyand breathing through a blowhole onthe head</gloss>
    </entry>

    XML的作用

    (关于XML更多的基础知识请自己查询相关资料)

    ElementTree接口

    >>>from nltk.etree.ElementTreeimport ElementTree
    >>>merchant= ElementTree().parse(merchant_file) 
    >>>merchant
    <Element PLAYat 22fa800> 
    >>>merchant[0]
    <ElementTITLEat 22fa828> 
    >>>merchant[0].text
    'The MerchantofVenice'
    >>>merchant.getchildren() 
    [<Element TITLEat 22fa828>, <Element PERSONAE at 22fa7b0>, <Element SCNDE
    SCRat 2300170>,
    <ElementPLAYSUBTat 2300198>, <ElementACTat 23001e8>, <ElementACTat 2
    34ec88>,
    <ElementACTat 23c87d8>, <ElementACTat 2439198>, <ElementACTat 24923c8
    >]

    我们可以使用更多的方法来操作XML:

    >>>for i, act in enumerate(merchant.findall('ACT')):
    ... for j, scene in enumerate(act.findall('SCENE')):
    ... for k,speechin enumerate(scene.findall('SPEECH')):
    ... for line in speech.findall('LINE'):
    ... if 'music' in str(line.text):
    ... print "Act %dScene %dSpeech %d:%s"%(i+1, j+1, k+1, line.text)
    Act3Scene2Speech9: Let musicsoundwhilehedoth makehis choice;
    Act3Scene2Speech9: Fadingin music:that the comparison
    Act3Scene2Speech9:Andwhatis musicthen? Thenmusicis
    Act5Scene1Speech23:Andbring yourmusicforth into the air.
    Act5Scene1Speech23: Herewillwesit and let the sounds ofmusic
    Act5Scene1Speech23:Anddrawher homewithmusic.
    Act5Scene1Speech24: I am never merrywhenI hear sweet music.
    Act5Scene1Speech25: Orany air ofmusictouch their ears,
    Act5Scene1Speech25: Bythe sweet powerof music:therefore the poet
    Act5Scene1Speech25: Butmusicfor the time doth changehis nature.
    Act5Scene1Speech25: Themanthat hathnomusicin himself,
    Act5Scene1Speech25: Let nosuchmanbe trusted. Markthe music.
    Act5Scene1Speech29: It is yourmusic,madam,of the house.
    Act5Scene1Speech32: Nobetter a musicianthan the wren.

    我们也可以查查演员的顺序。我们可以使用频率分布看看谁最能说:

    >>>speaker_seq = [s.text for s in merchant.findall('ACT/SCENE/SPEECH/SPEAKER
    ')]
    >>>speaker_freq = nltk.FreqDist(speaker_seq)
    >>>top5 =speaker_freq.keys()[:5]
    >>>top5
    ['PORTIA', 'SHYLOCK', 'BASSANIO', 'GRATIANO', 'ANTONIO']

    我们也可以查看对话中谁跟着谁的模式。

    >>>mapping= nltk.defaultdict(lambda: 'OTH')
    >>>for s in top5:
    ...     mapping[s]= s[:4]
    ...
    >>>speaker_seq2 = [mapping[s] for s in speaker_seq]
    >>>cfd =nltk.ConditionalFreqDist(nltk.ibigrams(speaker_seq2))
    >>>cfd.tabulate()

    使用ElementTree访问Toolbox数据

    我们可以用toolbox.xml()来访问Toolbox文件。

    >>>from nltk.corpusimport toolbox
    >>>lexicon = toolbox.xml('rotokas.dic')

    可以通过这样的方式来访问内容:

    >>>lexicon[3][0]
    <Element lx at 77bd28>
    >>>lexicon[3][0].tag
    'lx'
    >>>lexicon[3][0].text
    'kaa'

    我们也可以使用路径访问XML的内容:

    >>>[lexeme.text.lower() for lexeme in lexicon.findall('record/lx')]
    ['kaa', 'kaa', 'kaa', 'kaakaaro', 'kaakaaviko', 'kaakaavo', 'kaakaoko',
    'kaakasi', 'kaakau', 'kaakauko', 'kaakito', 'kaakuupato', ..., 'kuvuto']
    >>>import sys
    >>>from nltk.etree.ElementTreeimport ElementTree
    >>>tree = ElementTree(lexicon[3])
    >>>tree.write(sys.stdout) 
    <record>
    <lx>kaa</lx>
    <ps>N</ps>
    <pt>MASC</pt>
    <cl>isi</cl>
    <ge>cookingbanana</ge>
    <tkp>bananabilong kukim</tkp>
    <pt>itoo</pt>
    <sf>FLORA</sf>
    <dt>12/Aug/2005</dt>
    <ex>Taeaviiria kaaisi kovopaueva kaparapasia.</ex>
    <xp>Taeavii bin planim gadenbanana bilongkukim tasol long paia.</xp>
    <xe>Taeaviplantedbanana in orderto cookit.</xe>
    </record>

    格式化条目

    我们可以根据自己的需要,来生成特定的格式输出。

    >>>html= "<table>\n"
    >>>for entry in lexicon[70:80]:
    ... lx = entry.findtext('lx')
    ... ps = entry.findtext('ps')
    ... ge = entry.findtext('ge')
    ... html +=" <tr><td>%s</td><td>%s</td><td>%s</td></tr>\n"%(lx, ps,ge)
    >>>html+="</table>"
    >>>print html
    <table>
    <tr><td>kakae</td><td>???</td><td>small</td></tr>
    <tr><td>kakae</td><td>CLASS</td><td>child</td></tr>
    <tr><td>kakaevira</td><td>ADV</td><td>small-like</td></tr>
    <tr><td>kakapikoa</td><td>???</td><td>small</td></tr>
    <tr><td>kakapikoto</td><td>N</td><td>newbornbaby</td></tr>
    <tr><td>kakapu</td><td>V</td><td>placein sling for purposeof carrying</td></tr>
    <tr><td>kakapua</td><td>N</td><td>slingfor lifting</td></tr>
    <tr><td>kakara</td><td>N</td><td>armband</td></tr>
    <tr><td>Kakarapaia</td><td>N</td><td>villagename</td></tr>
    <tr><td>kakarau</td><td>N</td><td>frog</td></tr>
    </table>

    11.5使用Toolbox数据

    为每个条目添加一个字段

    例11-2. 为词汇条目添加新的cv字段
    from nltk.etree.ElementTreeimport SubElement
    def cv(s):
        s = s.lower()
        s = re.sub(r'[^a-z]', r'_', s)
        s = re.sub(r'[aeiou]', r'V', s)
        s = re.sub(r'[^V_]', r'C', s)
        return (s)
    def add_cv_field(entry):
        for field in entry:
            if field.tag =='lx':
                cv_field = SubElement(entry,'cv')
                cv_field.text = cv(field.text)
    >>>lexicon = toolbox.xml('rotokas.dic')
    >>>add_cv_field(lexicon[53])
    >>>print nltk.to_sfm_string(lexicon[53])
    \lx kaeviro
    \ps V
    \pt A
    \ge lift off
    \ge take off
    \tkp goantap
    \sc MOTION
    \vx 1
    \nt usedto describe action of plane
    \dt 03/Jun/2005
    \ex Pitakaeviroroekepakekesia oavuripierevo kiuvu.
    \xp Pitai goantap nalukim hauswini bagarapim.
    \xe Peterwentto look at the housethat the winddestroyed.
    \cv CVVCVCV

    验证Toolbox词汇

    Toolbox格式的许多词汇不符合任何特定的模式。有些条目可能包括额外的字段,或以一种新的方式排序现有字段。

    例如,我们可以在FreqDist的帮助下,很容易的找到频率异常的字段序列:

    >>>fd = nltk.FreqDist(':'.join(field.tag for field in entry) for entry in lexicon)
    >>>fd.items()
    [('lx:ps:pt:ge:tkp:dt:ex:xp:xe', 41),('lx:rt:ps:pt:ge:tkp:dt:ex:xp:xe', 37),
    ('lx:rt:ps:pt:ge:tkp:dt:ex:xp:xe:ex:xp:xe', 27), ('lx:ps:pt:ge:tkp:nt:dt:ex:xp:xe', 20),
    ..., ('lx:alt:rt:ps:pt:ge:eng:eng:eng:tkp:tkp:dt:ex:xp:xe:ex:xp:xe:ex:xp:xe', 1)]
  • 相关阅读:
    (转)大型网站架构演化发展历程
    Android编译错误, Ignoring InnerClasses attribute for an anonymous inner class
    (转)写让别人能读懂的代码
    JVectorMap地图插件.Net版开源
    (转)淘宝技术发展
    (转)史上最全的MSSQL复习笔记
    (转)jieba中文分词的.NET版本:jieba.NET
    (转)微软牛津计划介绍——屌爆了的自然数据处理解决方案(人脸/语音识别,计算机视觉与语言理解)
    (转)分布式中使用Redis实现Session共享(二)
    (转)分布式中使用Redis实现Session共享(一)
  • 原文地址:https://www.cnblogs.com/createMoMo/p/3120933.html
Copyright © 2011-2022 走看看