zoukankan      html  css  js  c++  java
  • 处理搜狐新闻语料

    数据集来源:http://www.sogou.com/labs/resource/cs.php

    目的:得到title集合文本,content集合文本

    代码:

    #python2
    import chardet
    with open("news_sohusite_xml.dat",'r') as h:
        x=h.readlines()
    # print(x[3])
    
    topics=x[3::6]
    print(len(topics))
    contents=x[4::6]
    
    type = chardet.detect(x[3])
    print(type)
    
    # a = topics[0].decode(type["encoding"])
    
    for i in topics:
        with open("sohusite_topics.txt","a") as f_out:
            f_out.write(i[14:-16].decode("gb18030").encode("utf-8")+'
    ')
    #         f_out.write(i[14:-16].decode(type["encoding"]).encode("utf-8")+'
    ')
            
    for i in contents:
        with open("sohusite_contents.txt","a") as f_outt:
            f_outt.write(i[9:-11].decode("gb18030").encode("utf-8")+'
    ')

    解码编码上花了点时间:原本用chardet.detect可以得到文本编码是gb2312,但是decode的时候会报错:

    UnicodeDecodeError :'gb2312' codec can't decode bytes:illegal multibyte sequence

    解决办法:

  • 相关阅读:
    Fibonacci数列2
    足球队
    网页导航
    Catenyms
    某种密码
    大逃亡
    球的序列
    圆内三角形统计
    最小平方数

  • 原文地址:https://www.cnblogs.com/helloworld0604/p/9492682.html
Copyright © 2011-2022 走看看