zoukankan      html  css  js  c++  java
  • 自然语言交流系统 phxnet团队 创新实训 个人博客 (十四)

    关于WikiExtractor的学习笔记:

    WikiExtractor是一个Python 脚本,专门用于提取和清洗Wikipedia的dump数据,支持Python 2.7 或者 Python 3.3+,无额外依赖,安装和使用都非常方便:

    安装:

    git clone https://github.com/attardi/wikiextractor.git
    cd wikiextractor/
    sudo python setup.py install

    使用:

    WikiExtractor.py -o enwiki enwiki-latest-pages-articles.xml.bz2
    ......
    INFO: 53665431  Pampapaul
    INFO: 53665433  Charles Frederick Zimpel
    INFO: Finished 11-process extraction of 5375019 articles in 8363.5s (642.7 art/s)

    这个过程总计花了2个多小时,提取了大概537万多篇文章。关于我的机器配置,可参考:《深度学习主机攒机小记

    提取后的文件按一定顺序切分存储在多个子目录下:

    每个子目录下的又存放若干个以wiki_num命名的文件,每个大小在1M左右,这个大小可以通过参数 -b 控制:

    -b n[KMG], --bytes n[KMG] maximum bytes per output file (default 1M)

    我们看一下wiki_00里的具体内容:

    <doc id="12" url="https://en.wikipedia.org/wiki?curid=12" title="Anarchism">
    Anarchism

    Anarchism is a political philosophy that advocates self-governed societies based on voluntary institutions. These are often described as stateless societies, although several authors have defined them more specifically as institutions based on non-hierarchical free associations. Anarchism holds the state to be undesirable, unnecessary, and harmful.
    ...
    Criticisms of anarchism include moral criticisms and pragmatic criticisms. Anarchism is often evaluated as unfeasible or utopian by its critics.


    </doc>
    <doc id="25" url="https://en.wikipedia.org/wiki?curid=25" title="Autism">
    Autism

    Autism is a neurodevelopmental disorder characterized by impaired social interaction, verbal and non-verbal communication, and restricted and repetitive behavior. Parents usually notice signs in the first two years of their child's life. These signs often develop gradually, though some children with autism reach their developmental milestones at a normal pace and then regress. The diagnostic criteria require that symptoms become apparent in early childhood, typically before age three.
    ...
    </doc>
    ...

    每个wiki_num文件里又存放若干个doc,每个doc都有相关的tag标记,包括id, url, title等,很好区分。

    如果您愿意花几块钱请我喝杯茶的话,可以用手机扫描下方的二维码,通过 支付宝 捐赠。我会努力写出更好的文章。 
    (捐赠不显示捐赠者的个人信息,如需要,请注明您的联系方式) 
    Thank you for your kindly donation!!

     

    
    
     
  • 相关阅读:
    Viusal Studio 2022 正式版安装秘钥
    关于云计算,云存储,和自己开发的云存储的小工具
    网盘工具比较,以及自己开发的网盘工具
    VARIANT及相关类
    关于 BSTR, CComBSTR and _bstr_t
    如何真正发挥Google Docs的威力
    ORM框架EntitysCodeGenerate自定义分页查询及快捷执行SQL(CreateSQL)示例
    关于Java Servlet的中文乱码
    ORM框架VB/C#.Net实体代码生成工具(EntitysCodeGenerate) 【ECG】4.3 介绍
    通用JS验证框架(ChkInputs)概述
  • 原文地址:https://www.cnblogs.com/qiaoyanlin/p/6891559.html
Copyright © 2011-2022 走看看