zoukankan      html  css  js  c++  java
  • python 爬虫学习篇1

    python爬取diameizi网页,然后下载图片

    python 环境是2.7.3

    代码地址:https://gist.github.com/zjjott/5270366

    作者讨论地址:http://tieba.baidu.com/p/2239765168?fr=itb_feed_jing#30880553662l

    需要抓的美女图片地址:http://diameizi.diandian.com/

     1 #coding=utf-8
     2 import os
     3 os.system("wget -r --spider http://diameizi.diandian.com 2>|log.txt")#非常简单的抓取整个网页树结构的语句————实质上是一种偷懒
     4 filein=open('log.txt','r')
     5 fileout=open('dst','w+')#一个装最后的结果的没用的文件
     6 filelist=list(filein)
     7 import urllib2,time
     8 from bs4 import BeautifulSoup
     9 header={
    10     'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:8.0.1) Gecko/20100101 Firefox/8.0.1'} 
    11 def getsite(url):
    12     req=urllib2.Request(url,None,header)
    13     site=urllib2.urlopen(req)
    14     return site.read()##上面这六句基本万金油了。。
    15 try:
    16     dst=set()
    17     for p in filelist:
    18         if p.find('http://diameizi.diandian.com/post')>-1:
    19             p=p[p.find('http'):]
    20             dst.add(p)
    21     i=0
    22     for p in dst:
    23         #if i<191:
    24         #        i+=1
    25         #        continue##断点续传部分
    26         pagesoup=BeautifulSoup(getsite(p))
    27         pageimg=pagesoup.find_all('img')
    28         for href in pageimg:
    29             print i,href['src']
    30             picpath="pic/"+href['src'][-55:-13]+href['src'][-4:]##名字的起法有问题。。。不过效果还行。。
    31             pic=getsite(href['src'])
    32             picfile=open(picpath,'wb')
    33             picfile.write(pic)
    34             i+=1
    35             picfile.close()
    36 finally:
    37     for p in dst:
    38         fileout.write(p)
    39     fileout.close()

     上面的log.txt

    文件大体就是下面的内容。

    Spider mode enabled. Check if remote file exists.
    --2013-03-29 23:00:10--  http://diameizi.diandian.com/
    Resolving diameizi.diandian.com (diameizi.diandian.com)... 113.31.29.120, 113.31.29.121
    Connecting to diameizi.diandian.com (diameizi.diandian.com)|113.31.29.120|:80... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 30502 (30K) [text/html]
    Remote file exists and could contain links to other resources -- retrieving.
    
    --2013-03-29 23:00:11--  http://diameizi.diandian.com/
    Reusing existing connection to diameizi.diandian.com:80.
    HTTP request sent, awaiting response... 200 OK
    Length: unspecified [text/html]
    Saving to: `diameizi.diandian.com/index.html'
    
         0K .......... .......... .........                        94.6K=0.3s
    
    2013-03-29 23:00:12 (94.6 KB/s) - `diameizi.diandian.com/index.html' saved [30502]
    
    Loading robots.txt; please ignore errors.
    --2013-03-29 23:00:12--  http://diameizi.diandian.com/robots.txt
    Reusing existing connection to diameizi.diandian.com:80.
    HTTP request sent, awaiting response... 200 OK
    Length: 209 [text/plain]
    Saving to: `diameizi.diandian.com/robots.txt'
    
         0K                                                       100% 20.8M=0s
    
    2013-03-29 23:00:12 (20.8 MB/s) - `diameizi.diandian.com/robots.txt' saved [209/209]
    
    Removing diameizi.diandian.com/robots.txt.
    Removing diameizi.diandian.com/index.html.
    
    Spider mode enabled. Check if remote file exists.
    --2013-03-29 23:00:12--  http://diameizi.diandian.com/rss
    Reusing existing connection to diameizi.diandian.com:80.
    HTTP request sent, awaiting response... 200 OK
    Length: 0 [text/xml]
    Remote file exists but does not contain any link -- not retrieving.
    
    Removing diameizi.diandian.com/rss.
    unlink: No such file or directory
    
    Spider mode enabled. Check if remote file exists.
    --2013-03-29 23:00:12--  http://diameizi.diandian.com/archive
    Reusing existing connection to diameizi.diandian.com:80.
    HTTP request sent, awaiting response... 200 OK
    Length: 82303 (80K) [text/html]
    Remote file exists and could contain links to other resources -- retrieving.
    
    --2013-03-29 23:00:12--  http://diameizi.diandian.com/archive
    Reusing existing connection to diameizi.diandian.com:80.

    从上面的文本文件中寻找需要的相关资料。

    上面的代码还没有测试成功,因为是2.7.3平台的缘故吧。

    例子上给的应该是python3.x版本。有些出入

  • 相关阅读:
    Django系列:TemplateView,ListView,DetailView
    Django系列:开发自己的RestAPI
    Django系列:Restful CBV
    Django系列:RestFul
    Django系列12:Django模型关系
    B
    All about that base
    Safe Passage
    A
    Isomorphic Inversion
  • 原文地址:https://www.cnblogs.com/spaceship9/p/2989821.html
Copyright © 2011-2022 走看看