zoukankan      html  css  js  c++  java
  • Python爬虫学习

    download
    https://www.python.org/downloads/release/python-352/

    python实现简单爬虫功能
    http://www.cnblogs.com/fnng/p/3576154.html

    关于api-ms-win-crt-runtimel1-1-0.dll缺失的解决方案
    https://www.microsoft.com/zh-cn/download/confirmation.aspx?id=48145

    can't use a string pattern on a bytes-like object
    imglist = re.findall(imgre,html.decode('GBK'))

    inconsistent use of tabs and space in indentation
    把tab替换成空格

    UnicodeDecodeError:'gbk' codec can't decode byte 0xaf in position 197:illegal multibyte sequence
    html.decode('utf-8')

    以下是3.5.2版本的python所能用的

    #coding=utf-8
    import urllib.request
    import re
    
    def getHtml(url):
        page = urllib.request.urlopen(url)
        html = page.read()
        return html
    
    def getImg(html):
        reg = r'src="(.+?.jpg)" pic_ext'
        imgre = re.compile(reg)
        imglist = re.findall(imgre,html.decode('utf-8'))
        x = 0
        for imgurl in imglist:
            urllib.request.urlretrieve(imgurl,'D://%s.jpg' % x)
            x+=1
        print(x)
    
     
    
    html = getHtml("http://tieba.baidu.com/p/2460150866");
    
    getImg(html)
    

    如果网页是用GBK字符集,则相应做修改
    charset=gbk

    #coding=utf-8
    import urllib.request
    import re
    import datetime,time
    
    def getHtml(url):
        page = urllib.request.urlopen(url)
        html = page.read()
        return html
    
    def getImg(html):
        reg = r'file="(.+?.jpg)"'
        imgre = re.compile(reg)
        imglist = re.findall(imgre,html.decode('gbk'))
        x = 0
        for imgurl in imglist:
            urllib.request.urlretrieve(imgurl,'D://06_Download//py//%s.jpg' % x)
            x+=1
        print("得到文件总数",x)
    
    
    starttime= datetime.datetime.now()
    html = getHtml("http://www.cmfish.com/bbs/forum.php?mod=viewthread&tid=306167&extra=page%3D1");
    getImg(html)
    usetime= datetime.datetime.now()-starttime
    print('所花时间:',usetime) 
    


  • 相关阅读:
    Java在ACM中的应用
    acm->stl
    残缺棋盘--状压DP
    EOJ Monthly 2019.3 A
    【CF1141E】Superhero Battle
    AtCoder Grant Contest 10.F 博弈
    莫比乌斯反演总结
    P2257 YY的GCD
    BZOJ1011 莫比乌斯反演(基础题
    HDU1695 莫比乌斯反演
  • 原文地址:https://www.cnblogs.com/sui84/p/6777018.html
Copyright © 2011-2022 走看看