最近学习了python的一个超级牛的库scrapy,写下一些心得。
初看的时候,看的是官方文档,讲的有些晦涩,有些地方也有模糊不清的地方,而且完整的中高级的用例不多,再由于版本更新的问题,原来的一些方法发生了一些改变,所以在博客园上找到一篇博客结合官方文档,爬出一个自己的scrapy,主要目标豆瓣电影top250,接下来上源码:
首先,在自己想要的目录下新建项目:
scrapy startproject douban
进入douban文件夹,看一下目录结构,其中result.txt文件是我的输出文件
接下来,修改douban文件夹下的items.py,将返回项整合在一个item中:
1 # -*- coding: utf-8 -*- 2 3 # Define here the models for your scraped items 4 # 5 # See documentation in: 6 # http://doc.scrapy.org/en/latest/topics/items.html 7 8 import scrapy 9 10 11 class DoubanItem(scrapy.Item): 12 # define the fields for your item here like: 13 # name = scrapy.Field() 14 # pass 15 movie_name = scrapy.Field() 16 movie_director = scrapy.Field() 17 movie_editor = scrapy.Field() 18 movie_roles = scrapy.Field() 19 movie_style = scrapy.Field() 20 movie_date = scrapy.Field() 21 movie_long = scrapy.Field()
然后开始编写自己的爬虫程序,在spiders文件下新建douban_spider.py:
1 # -*- coding: utf-8 -*- 2 from scrapy.spiders import BaseSpider # 这里新版使用spiders 3 from scrapy.selector import HtmlXPathSelector 4 from douban.items import DoubanItem 5 import scrapy 6 import re 7 import sys 8 reload(sys) 9 sys.setdefaultencoding("utf-8") # 设定文件字符编码utf-8 10 11 12 class DoubanSpider(BaseSpider): 13 """docstring for DoubanSpider""" 14 name = "douban" # scrapy爬虫名称 15 allow_domains = ["movie.douban.com"] # 允许域名 16 # 开始检索的URL 17 start_urls = ["http://movie.douban.com/top250" + "?start=" + str(yeshu * 25) + "&filter=&type=" for yeshu in range(0, 10)] 18 19 # 回调函数 20 def parse(self, response): 21 hxs = HtmlXPathSelector(response) 22 movie_link = hxs.xpath('//div[@class="hd"]/a/@href').extract() 23 # movie_next = hxs.xpath('//span[@class="next"]/a/@href').extract() 24 # nextmo = movie_next[0] 25 # if nextmo: 26 # nextmo = "http://movie.douban.com/top" + nextmo 27 # start_urls.append(nextmo) 28 for link in movie_link: 29 # 给出进入二级页面的请求,并使用下面的回调函数 30 yield scrapy.Request(link, callback=self.parse_item) 31 32 # 自己写的回调函数,用于处理二级页面 33 def parse_item(self, response): 34 item_has = HtmlXPathSelector(response) 35 movie_name = item_has.xpath('//h1/span/text()').extract() 36 movie_director = item_has.xpath('//a[@rel=" 37 v:directedBy"]/text()').extract() 38 movie_editor = item_has.xpath('//div[@id=" 39 info"]/span[2]/span[@class="attrs"]/a/text()').extract() 40 movie_roles = item_has.xpath('//a[@rel=" 41 v:starring"]/text()').extract() 42 movie_style = item_has.xpath('//span[@property=" 43 v:genre"]/text()').extract() 44 movie_date = item_has.xpath('//span[@property=" 45 v:initialReleaseDate"]/text()').extract() 46 movie_long = item_has.xpath('//span[@property=" 47 v:runtime"]/text()').extract() 48 item = DoubanItem() 49 item['movie_name'] = movie_name 50 item['movie_director'] = movie_director 51 item['movie_editor'] = movie_editor 52 item['movie_roles'] = movie_roles 53 item['movie_style'] = movie_style 54 item['movie_date'] = movie_date 55 item['movie_long'] = movie_long 56 yield item
最后更改douban下的pinelines.py文件,用于将爬到的数据存到文件中:
1 # -*- coding: utf-8 -*- 2 3 # Define your item pipelines here 4 # 5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting 6 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html 7 NUM = 1 8 9 10 class DoubanPipeline(object): 11 12 # NUM = 1 13 14 def process_item(self, item, spider): 15 movie_name = item['movie_name'] 16 movie_director = item['movie_director'] 17 movie_editor = [line + '、' for line in item['movie_editor']] 18 movie_roles = [line + '、' for line in item['movie_roles']] 19 movie_style = [line + '、' for line in item['movie_style']] 20 movie_date = [line + '、' for line in item['movie_date']] 21 movie_long = item['movie_long'] 22 f = open("result.txt", "a") 23 global NUM 24 f.write(str(NUM)) 25 f.write(" 片名:") 26 NUM += 1 27 print "NAME:", movie_name 28 # for it in movie_name: 29 # f.write(it) 30 # f.write(" ") 31 f.writelines(movie_name) 32 f.write(" 导演:") 33 f.writelines(movie_director) 34 f.write(" 编剧:") 35 f.writelines(movie_editor) 36 f.write(" 主角:") 37 f.writelines(movie_roles) 38 f.write(" 类型:") 39 f.writelines(movie_style) 40 f.write(" 上映时间:") 41 f.writelines(movie_date) 42 f.write(" 影片时长:") 43 f.writelines(movie_long) 44 f.write(" ") 45 f.close() 46 return item
最后的最后,别忘了修改douban下的settings.py中的ITEM_PIPELINES,将它设置成我们自己写的pineline,在默认情况下是被注释掉的:
ITEM_PIPELINES = {
'douban.pipelines.DoubanPipeline': 300,
}
程序工作就完成了,接下来你就可以运行了,需要在程序根目录运行,即第一个douban下:
scrapy crawl douban