zoukankan      html  css  js  c++  java
  • 用python做爬虫的例子

    主要就是用了两个库,urllib和BeautifulSoup.

    作用是从HTML中解析出解梦的查询词和具体的解释。

     1 # -*- coding: utf-8 -*-
     2 import urllib, urllib2
     3 import time, random
     4 from BeautifulSoup import BeautifulSoup
     5 
     6 def fetchURL(str_url):
     7 
     8     user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) 
     9                   AppleWebKit/537.36 (KHTML, like Gecko)'
    10     values     = {}
    11     headers    = {'User-Agent': user_agent}
    12     data       = urllib.urlencode(values)
    13 
    14     content = ''
    15     
    16     try:
    17         request = urllib2.Request(str_url)
    18         response = urllib2.urlopen(request)
    19         html = response.read().decode('gb2312')
    20         content = parse_content_page(html)
    21     except:
    22         content = None
    23 
    24     return content
    25 
    26 def parse_content_page(html):
    27     parsed_html = BeautifulSoup(html)
    28     try:
    29         title   = parsed_html.body.find('h1', attrs={'class':'art_title'}).text
    30         content = parsed_html.body.find('div', attrs={'class':'dream_detail'}).text
    31     except:
    32         return None
    33         
    34     return [title, content]
    35 
    36 
    37 
    38 if __name__ == '__main__':
    39 
    40     foutput = 'jiemeng.txt'
    41     with open(foutput, 'w') as fout:
    42         for i in xrange(1, 10):
    43             reques_url = 'http://tools.2345.com/zhgjm/%s.htm' % str(i)
    44             x = fetchURL(reques_url)
    45             if x != None:
    46                 print >>fout, x[0].encode('utf8')[3:-3]
    47                 print >>fout, x[1].encode('utf8')
    48             
    49             # sleep for a while between two http requests 
    50             seconds = random.random()*10 + 2
    51             time.sleep(seconds)
  • 相关阅读:
    外键的缺陷
    laravel 关联模型
    n的阶乘末尾出现的次数
    JavaScript的self和this使用小结
    cocos2dx中的内存管理方式
    c++ 与 lua 简单交互参数介绍
    c++的单例模式及c++11对单例模式的优化
    cocos2dx帧动画
    cocos2dx中坐标系
    cocos2dx中替代goto的用法:do{}while(0)和CC_BREAK_IF
  • 原文地址:https://www.cnblogs.com/naive/p/4306990.html
Copyright © 2011-2022 走看看