zoukankan      html  css  js  c++  java
  • scrapy爬取当当网

    春节已经临近了尾声,也该收收心了。博客好久都没更新了,自己在年前写的爬虫也该“拿”出来了。

    本次爬取的目标是当当网,获取当当网所有的书籍信息。采用scrapy+mongodb来采集存储数据。开干!

    起始url:

    start_urls = ['http://category.dangdang.com/cp01.00.00.00.00.00-shlist.html']

     当当书籍的一级分类二级分类都很明显的展示了出来。

    ok~入口找到了,当当网也没有设置反爬措施,所以可以直接的放心爬取,如果需要大规模爬取的时候可以对爬虫的请求时间设置下,不要给别人的服务器带来太多的压力。

    DOWNLOAD_DELAY = 5

     ok,我直接上代码了!

    items.py
     1 class BookDangdangItem(scrapy.Item):
     2     # 将数据插入到moongodb中
     3     price = scrapy.Field()  # 价格
     4     type_tag = scrapy.Field()  # 所属分类
     5     name = scrapy.Field()  # 书籍名称
     6     image_url = scrapy.Field()  # 首页的图片url
     7     link = scrapy.Field()  # url
     8     star_level = scrapy.Field()  #
     9     pub_time = scrapy.Field()  # 出版时间
    10     publish = scrapy.Field()  # 出版社
    11     brief = scrapy.Field()  # 描述
    12 
    13     detail = scrapy.Field()  # 书籍详情 {}
     spiders.py
      1 # -*- coding: utf-8 -*-
      2 import time
      3 import logging
      4 
      5 import scrapy
      6 from scrapy.http.cookies import CookieJar
      7 
      8 from ..items import BookDangdangItem
      9 from ..settings import DEFAULT_REQUEST_HEADERS
     10 
     11 
     12 class DangdangSpider(scrapy.Spider):
     13     name = 'dangdang'
     14     allowed_domains = ['dangdang.com']
     15     start_urls = ['http://category.dangdang.com/cp01.00.00.00.00.00-shlist.html']
     16     dom = 'http://category.dangdang.com'  # 用于拼接url
     17     cookie_dict = {}
     18 
     19     def start_requests(self):
     20         return [scrapy.Request(url=self.start_urls[0], callback=self.parse, headers=DEFAULT_REQUEST_HEADERS)]
     21 
     22     def parse(self, response):
     23         try:
     24             typestr = response.meta['type']
     25         except(KeyError):
     26             typestr = ""
     27         types = response.xpath('//*[@id="navigation"]/ul/li[1]/div[2]/div[1]/div/span/a')  # 获取书籍分类
     28         tyname = response.xpath('//*[@id="navigation"]/ul/li[1]/@dd_name').extract_first()
     29         if types and tyname == '分类':  # 到分类终止递归
     30             for type in types:
     31                 url = self.dom + type.xpath('@href').extract_first()  # 每一个书籍分类下面的url
     32                 typestr_new = typestr + "{0}>>".format(type.xpath('text()').extract_first())  # 多级分类
     33 
     34                 scrapy.Spider.log(self, "Find url:{0},type{1}".format(url, typestr_new), logging.INFO)  # 配置日志信息
     35                 yield scrapy.Request(url=url, callback=self.parse, meta={'type': typestr_new},
     36                                      headers=DEFAULT_REQUEST_HEADERS)
     37         else:
     38             page = int(response.xpath('//*[@id="go_sort"]/div/div[2]/span[1]/text()').extract_first())  # 当前页
     39             all_page = int(response.xpath('//*[@id="go_sort"]/div/div[2]/span[2]/text()').extract_first().lstrip('/'))  # 总页数
     40             for x in range(page, all_page):  # 处理分页
     41                 yield scrapy.Request(url=self.dom + '/pg{0}-'.format(x) + response.url.split('/')[-1],
     42                                      callback=self.parse_page, headers=DEFAULT_REQUEST_HEADERS,
     43                                      meta={'type': typestr})
     44 
     45     def parse_page(self, response):
     46         """解析列表页中书籍的信息"""
     47         # cookie可以不用加,我加上只是为了测试
     48         cookie_jar = CookieJar()
     49         cookie_jar.extract_cookies(response, response.request)
     50         for k, v in cookie_jar._cookies.items():
     51             for i, j in v.items():
     52                 for m, n in j.items():
     53                     self.cookie_dict[m] = n.value
     54         # print(self.cookie_dict)
     55 
     56         for item in response.xpath('//*[@id="search_nature_rg"]/ul[@class="bigimg"]/li'):
     57             # 所有图书
     58             book = BookDangdangItem()
     59             book['price'] = float(item.xpath('./p[@class="price"]/span[1]/text()').extract_first().lstrip('¥'))
     60             book['type_tag'] = response.meta['type']
     61             book['name'] = item.xpath('./p[@class="name"]/a/text()').extract_first().strip()
     62             book['image_url'] = item.xpath('./a/img/@src').extract_first()
     63             book['link'] = item.xpath('./p[1]/a/@href').extract_first()
     64             book['star_level'] = int(item.xpath('./p[@class="search_star_line"]/span/span/@style').extract_first()
     65                                                  .split(' ')[-1].rstrip('%;'))
     66             try:
     67                 book['pub_time'] = item.xpath('.//p[@class="search_book_author"]/span[2]/text()').extract_first().split('/')[-1]
     68             except Exception as e:
     69                 book['pub_time'] = time.strftime("%Y-%m-%d")
     70             try:
     71                 book['publish'] = item.xpath(
     72                     './p[@class="search_book_author"]/span[3]/a/text()').extract_first().strip()
     73             except Exception as e:
     74                 book['publish'] = "暂无出版社信息"
     75             try:
     76                 book['brief'] = item.xpath('./p[2]/text()').extract_first().strip()
     77             except Exception as e:
     78                 book['brief'] = "暂无书籍简述"
     79             yield scrapy.Request(callback=self.parse_book, cookies=self.cookie_dict,
     80                                  headers=DEFAULT_REQUEST_HEADERS, meta={'item': book}, url=book['link'])
     81 
     82     def parse_book(self, response):
     83         """跟进url解析书籍详情"""
     84         book = response.meta['item']
     85         book['detail'] = {}
     86         info = response.xpath("//ul[@class='key clearfix']/li/text()").extract()
     87         print(info)
     88         for i in info:
     89             t = i.partition("")
     90             k = t[0].replace(" ", "")
     91             v = t[-1]
     92             if v == '':
     93                 v = "暂无详情"
     94             book['detail'][k] = v
     95 
     96         # 这个作者详情当当不同的板块有不同的取法,看了下有很多种,写起来挺麻烦的就暂时这样处理
     97         try:
     98             book['detail']['author_detail'] = response.xpath("//span[@id=‘authorIntroduction-show’]/text()")
     99                 .extract_first().replace('
    ', '')
    100         except Exception as e:
    101             book['detail']['author_detail'] = "暂无作者信息"
    102 
    103         yield book
    View Code

    说明下:cookie在本次爬虫中可以不用加,我加上只是为了测试。请求头做太多的定制。

    piplines.py
     1 from scrapy.conf import settings
     2 from scrapy import signals
     3 from pymongo import MongoClient
     4 
     5 
     6 class DangDangSpiderPipeline(object):
     7     def __init__(self):
     8         # 获取setting中主机名,端口号和集合名
     9         host = settings['MONGODB_HOST']
    10         port = settings['MONGODB_PORT']
    11         dbname = settings['MONGODB_DBNAME']
    12         col = settings['MONGODB_COL']
    13 
    14         # 创建一个mongo实例
    15         client = MongoClient(host=host, port=port)
    16 
    17         # 访问数据库
    18         db = client[dbname]
    19 
    20         # 访问集合
    21         self.col = db[col]
    22 
    23     def process_item(self, item, spider):
    24         data = dict(item)
    25         self.col.insert(data)
    26         return item
    View Code
    settings.py
     1 USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.22 
     2               Safari/537.36 SE 2.X MetaSr 1.0'
     3 
     4 ROBOTSTXT_OBEY = False
     5 
     6 DOWNLOAD_DELAY = 5
     7 
     8 COOKIES_ENABLED = False
     9 
    10 DEFAULT_REQUEST_HEADERS = {
    11     'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    12     'Accept-Language': 'en',
    13     "authority": " www.dangdang.com",
    14     "method": "GET",
    15     "path": "/",
    16     "scheme": "http",
    17     "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    18     "accept-encoding": 'gzip, deflate, br',
    19     "accept-language": 'en-US,en;q=0.9',
    20     "referer": None,
    21     "upgrade-insecure-requests": 1,
    22     "User-Agent": 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.22 
    23                       Safari/537.36 SE 2.X MetaSr 1.0'
    24 }    # 可以不写使用默认的
    25 
    26 ITEM_PIPELINES = {
    27     'dangdangspider.pipelines.DangDangSpiderPipeline': 300,
    28 }
    29 
    30 # Mongodb
    31 # 主机环回地址
    32 MONGODB_HOST = '127.0.0.1'
    33 # 端口号,默认27017
    34 MONGODB_POST = 27017
    35 # 设置数据库名称
    36 MONGODB_DBNAME = 'dangdangs'
    37 # 设置集合名称
    38 MONGODB_COL = 'books'
    View Code

    settings.py请求头可以不用写,使用默认的就行,我之所以加上也是测试用的,同理上面写了两个user-agent。just for fun  :-)

    这次爬虫其实主要的目的是,我想了解下scrapy中的cookie是如何加上的,以及怎么用,有多少中用法!scrapy中的cookie无论是在官方文档或者一些大V的博客里都很少介绍到,但是他在实际的场景中却会用到,那就是模拟登录的时候。或者向淘宝那样需要携带cookie才能访问页面的时候。关于如何使用cookie请查看我的另一篇博客

  • 相关阅读:
    JS常见异常
    Spring boot 的 @Value注解读取配置文件中的00开头的字符串
    常用网址
    IntelliJ使用教程
    eclipse
    swagger
    Mybatis
    Linux常用命令
    阿里云短信
    Flink Checkpoint-轻量级分布式快照
  • 原文地址:https://www.cnblogs.com/pontoon/p/10360487.html
Copyright © 2011-2022 走看看