zoukankan      html  css  js  c++  java
  • Wikipedia Processing

    Wikipedia Processing

    For Chinese, https://dumps.wikimedia.org/zhwiki/latest/

    zhwiki-latest-pages-articles.xml.bz2

    For English, https://dumps.wikimedia.org/enwiki/latest/

    enwiki-latest-pages-articles.xml.bz2

    Chinese

    Processing by following order:

    1. Extraction
    2. Convert Traditional Chinese to Simplified Chinese
    3. Keep in utf-8 characters
    4. keep in Chinese characters
    5. Segmentation

    Extraction

    Extracting plain text from zhwiki-20200101-pages-articles.xml.bz2 by following code.

    # encoding:utf8
    
    import sys
    from gensim.corpora import WikiCorpus
    from tqdm import tqdm
    
    
    if __name__ == '__main__':
        if len(sys.argv) < 3:
            print('Usage: python3 wikipedia_extraction.py wikipedia.xml.bz2 wikipedia.txt')
        file_name = sys.argv[1:]
    
        fo = open(file_name[1], encoding='utf8', mode='w')
        wiki = WikiCorpus(fname=file_name[0], lemmatize=False, dictionary=dict())
        for article in tqdm(wiki.get_texts()):
            for sentence in article:
                fo.write("%s" % sentence)
            fo.write("
    ")
    
    

    Converting

    To convert Traditional Chinese to Simplified Chinese by following bash command.

    opencc -i wikipedia.zh.txt -o wikipedia.zhs.txt -c t2s.json
    

    t2s.json obtained from https://github.com/BYVoid/OpenCC/blob/master/data/config/t2s.json , but we can see it as follow.

    {
      "name": "Traditional Chinese to Simplified Chinese",
      "segmentation": {
        "type": "mmseg",
        "dict": {
          "type": "ocd",
          "file": "TSPhrases.ocd"
        }
      },
      "conversion_chain": [{
        "dict": {
          "type": "group",
          "dicts": [{
            "type": "ocd",
            "file": "TSPhrases.ocd"
          }, {
            "type": "ocd",
            "file": "TSCharacters.ocd"
          }]
        }
      }]
    }
    

    Keep utf-8

    To use following bash command to keep utf-8 characters in.

    iconv -c -t UTF-8 -o wikipedia.zhs.utf8.txt wikipedia.zhs.txt
    

    Keep Chinese

    Keeping only Chinese characters in corpus by following code.

    # encoding:utf8
    # Filter out un-Chinese characters
    
    import sys
    from tqdm import tqdm
    
    
    if __name__ == '__main__':
        if len(sys.argv) < 3:
            print("Usage: python3 wikipedia.zhs.utf8.txt wikipedia.zhs.utf8.chi.txt")
            exit(1)
    
        fout = open(sys.argv[2], encoding='utf8', mode='w')
        with open(sys.argv[1], encoding='utf8') as fin:
            for line in tqdm(fin):
                for word in line:
                    for char in word:
                        if char == ' ' or char == '
    ':
                            fout.write(char)
                        if char >= u'u4e00' and char <= u'u9fa5':  # is a Chinese character
                            fout.write(char)
    
    

    Segmentation

    To segment corpus by following code. This is a simple segmentation program.

    # encoding:utf8
    # Just a simple segmentation program
    
    import sys
    import jieba
    from tqdm import tqdm
    
    
    def sentences(fpath):
        with open(fpath, encoding='utf8') as f:
            for line in f:
                yield line.strip()
    
    
    if __name__ == '__main__':
        if len(sys.argv) < 3:
            print("Usage: python3 SimSeg.py in-path out-path")
            exit(1)
    
        jieba.initialize()
        f = open(sys.argv[2], encoding='utf8', mode='w')
        for sentence in tqdm(sentences(sys.argv[1])):
            words = list(jieba.cut(sentence, cut_all=False))
            while ' ' in words:
                words.remove(" ")
            f.write("%s
    " % " ".join(words))
    
    
  • 相关阅读:
    如何将网格式报表打印成其它样式
    拥有与实力不相称的脾气是种灾难——北漂18年(23)
    8.8.1 Optimizing Queries with EXPLAIN
    mysql 没有rowid 怎么实现根据rowid回表呢?
    secondary index
    8.5.5 Bulk Data Loading for InnoDB Tables 批量数据加载
    mysql 中key 指的是索引
    8.5.4 Optimizing InnoDB Redo Logging 优化InnoDB Redo 日志
    8.5.3 Optimizing InnoDB Read-Only Transactions 优化InnoDB 只读事务
    8.5.1 Optimizing Storage Layout for InnoDB Tables InnoDB表的存储布局优化
  • 原文地址:https://www.cnblogs.com/fengyubo/p/12228432.html
Copyright © 2011-2022 走看看