zoukankan      html  css  js  c++  java
  • 利用chardet检测网页编码

    环境:Win7_x64 + python3.4.3

    需要先下载chardet并进行安装,下载地址:https://pypi.python.org/packages/source/c/chardet/chardet-2.3.0.tar.gz

    安装:进入解压后的目录,在命令窗口执行: Python setup.py install

    写个测试的python脚本吧(DetectURLCoding.py):

    #coding:utf-8  
    '''''python 3.x'''  
      
    import sys  
    import urllib.request  
    import chardet  
      
    # 将data写入文件fname  
    def writeFile(fname, data):  
        f = open(fname, "wb")  
        if f:  
            f.write(data)  
            f.close()  
      
    def blog_detect(blogurl):  
        '''''检测编码方式'''  
        try:  
            fp = urllib.request.urlopen(blogurl)  
        except Exception as e:  
            print(e)  
            print('download exception-[%s]' %blogurl)  
            return 0  
        blog = fp.read()    # python3.x read the html as html code bytearray  
        fp.close()  
        #writeFile("t.html", blog)  
          
        # get encoding string  
        codedetect = chardet.detect(blog)['encoding']  
        print('%s <- %s' %(blogurl, codedetect))  
        return 1  
      
    if __name__=='__main__':  
        if len(sys.argv) == 1:  
            print('''''usage: 
                python DetectURLCoding.py http://xxx.com''')  
        else:  
            v = blog_detect(sys.argv[1])  
            print(v)  # 何问起 hovertree.com

    运行结果:

    D:profileDesktop>PYTHON de.py http://hovertree.com/  
    http://hovertree.com/ <- utf-8  
    1  
      
    D:profileDesktop>PYTHON de.py http://photo.cankaoxiaoxi.com/roll10/2015/0318/709734.shtml  
    http://photo.cankaoxiaoxi.com/roll10/2015/0318/709734.shtml <- utf-8  
    1  

    web前端:http://www.cnblogs.com/roucheng/p/texiao.html

  • 相关阅读:
    Redux
    版本控制(.git + .svn + SourceTree)
    前端埋点
    前端IDE:VSCode + WebStorm
    浏览器
    Mutation Observer
    函数节流与函数去抖
    React 初识
    Ajax
    JS
  • 原文地址:https://www.cnblogs.com/roucheng/p/chardet.html
Copyright © 2011-2022 走看看