zoukankan      html  css  js  c++  java
  • python 学习之爬虫练习

    通过学习python,写两个简单的爬虫,没用线程,本地抓取速度还不错,有些瑕疵就是抓的图片有些显示不出来,代码做个笔记记录下:

    # -*- coding:utf-8 -*-
    
    import re
    import urllib.request
    import os
    
    url = "http://www.58pic.com/yuanchuang/0/day-"
    
    def getHtml(url):
        page = urllib.request.urlopen(url)
        html = page.read().decode('gbk')
        return html
    
    def getImg(html,num):
        reg = r'src="(.*?)" '
        imgre = re.compile(reg)
        imglist = re.findall(imgre,html)
        x = 0
        os.mkdir(r"G:collect/%d" % num)
        filePath = r"G:collect/%d/" % num
        for imgurl in imglist:       
            f=open(filePath+str(x)+".jpg",'wb')  
            req=urllib.request.urlopen(imgurl)
            buf=req.read()  
            f.write(buf)
            x+=1
    
    for i in range(1,10):
        getUrl = url+"%d.html" % i
        print(getUrl)
        html = getHtml(getUrl)
        #print(html)
        print(getImg(html,i))

    最终的结果如下图:

    根据上面的初步代码,优化后加强版的爬虫代码,对于链接的状态异常的抛出异常后在继续执行程序。代码如下:

    # -*- coding:utf-8 -*-
    
    import re
    import urllib.request
    import os
    
    url = "http://www.58pic.com/psd/"
    
    def getHtml(url):
        page = urllib.request.urlopen(url)
        html = page.read().decode('gbk')
        return html
    
    def getImg(html,num):
        reg = r'src="(.+?.jpg)" class="show-area-pic" id="show-area-pic" alt="(.*?)"'
        imgre = re.compile(reg)
        imglist = re.findall(imgre,html)
        print(imglist)
        filePath = r"F:Py/collect/%d/" % num
        isCreate = os.path.exists(filePath)
        if isCreate == False :
            os.mkdir(r"F:Py/collect/%d" % num)   
            for img in imglist:
                title = img[1]
                f=open(filePath+title+".jpg",'wb') 
                req=urllib.request.urlopen(img[0])
                buf=req.read()  
                f.write(buf)
                
    
    for i in range(22797263,22797666):
        getUrl = url+"%d.html" % i
        #status = urllib.request.urlopen(getUrl).code
        try:
            html = getHtml(getUrl)
            #print(html)
            getImg(html,i)
        except urllib.request.URLError as e:
            print(e.code)
            print(e.reason)
  • 相关阅读:
    LeetCode
    数据流中的中位数
    二叉搜索树的第k个结点
    对称的二叉树
    按之字形顺序打印二叉树
    把二叉树打印成多行
    二叉树的下一个结点
    链表中环的入口结点
    删除链表中重复的结点
    不用加减乘除做加法
  • 原文地址:https://www.cnblogs.com/bieanju/p/5884781.html
Copyright © 2011-2022 走看看