zoukankan      html  css  js  c++  java
  • 用scrapy爬取京东商城的商品信息

    软件环境:

     1 gevent (1.2.2)
     2 greenlet (0.4.12)
     3 lxml (4.1.1)
     4 pymongo (3.6.0)
     5 pyOpenSSL (17.5.0)
     6 requests (2.18.4)
     7 Scrapy (1.5.0)
     8 SQLAlchemy (1.2.0)
     9 Twisted (17.9.0)
    10 wheel (0.30.0)

    1.创建爬虫项目

    2创建京东网站爬虫. 进入爬虫项目目录,执行命令:

    scrapy genspider jd www.jd.com

    会在spiders目录下会创建和你起的名字一样的py文件:jd.py,这个文件就是用来写你爬虫的请求和响应逻辑的

    3. jd.py文件配置

    分析的amazon网站的url规则:
    https://search.jd.com/Search?
    以防关键字是中文,所以要做urlencode
            1.首先写一个start_request函数,用来发送第一次请求,并把请求结果发给回调函数parse_index,同时把reponse返回值传递给回调函数,response类型<class                 'scrapy.http.response.html.HtmlResponse'>
     1     def start_requests(self):
     2         # https://www.amazon.cn/s/ref=nb_sb_ss_i_1_6?field-keywords=macbook+pro
     3         # 拼接处符合条件的URL地址
     4         # 并通过scrapy.Requst封装请求,并调用回调函数parse_index处理,同时会把response传递给回调函数
     6         url = 'https://search.jd.com/Search?'
     7         # 拼接的时候field-keywords后面是不加等号的
     9         url += urlencode({"keyword": self.keyword, "enc": "utf-8"})
    10         yield scrapy.Request(url,
    11                              callback=self.parse_index,
    12                              )
            2.parse_index从reponse中获取所有的产品详情页url地址,并遍历所有的url地址发送request请求,同时调用回调函数parse_detail去处理结果
     1 def parse_detail(self, response):
     2     """
     3     接收parse_index的回调,并接收response返回值,并解析response
     4     :param response:
     5     :return:
     6     """
     7     jd_url = response.url
     8     sku = jd_url.split('/')[-1].strip(".html")
     9     # price信息是通过jsonp获取,可以通过开发者工具中的script找到它的请求地址
    10     price_url = "https://p.3.cn/prices/mgets?skuIds=J_" + sku
    11     response_price = requests.get(price_url)
    12     # extraParam={"originid":"1"}  skuIds=J_3726834
    13     # 这里是物流信息的请求地址,也是通过jsonp发送的,但目前没有找到它的参数怎么获取的,这个是一个固定的参数,如果有哪位大佬知道,好望指教
    14     express_url = "https://c0.3.cn/stock?skuId=3726834&area=1_72_4137_0&cat=9987,653,655&extraParam={%22originid%22:%221%22}"
    15     response_express = requests.get(express_url)
    16     response_express = json.loads(response_express.text)['stock']['serviceInfo'].split('>')[1].split('<')[0]
    17     title = response.xpath('//*[@class="sku-name"]/text()').extract_first().strip()
    18     price = json.loads(response_price.text)[0]['p']
    19     delivery_method = response_express
    20     # # 把需要的数据保存到Item中,用来会后续储存做准备
    21     item = AmazonItem()
    22     item['title'] = title
    23     item['price'] = price
    24     item['delivery_method'] = delivery_method
    25 
    26     # 最后返回item,如果返回的数据类型是item,engine会检测到并把返回值发给pipelines处理
    27     return item

    4. item.py配置

     1 import scrapy
     2 
     3 
     4 class JdItem(scrapy.Item):
     5     # define the fields for your item here like:
     6     # name = scrapy.Field()
     7     # amazome Item
     8     title = scrapy.Field()
     9     price = scrapy.Field()
    10     delivery_method = scrapy.Field()

    5. pipelines.py配置

     1 from pymongo import MongoClient
     2 
     3 
     4 class MongoPipeline(object):
     5     """
     6     用来保存数据到MongoDB的pipeline
     7     """
     8 
     9     def __init__(self, db, collection, host, port, user, pwd):
    10         """
    11         连接数据库
    12         :param db: databaes name
    13         :param collection: table name
    14         :param host: the ip for server
    15         :param port: thr port for server
    16         :param user: the username for login
    17         :param pwd: the password for login
    18         """
    19         self.db = db
    20         self.collection = collection
    21         self.host = host
    22         self.port = port
    23         self.user = user
    24         self.pwd = pwd
    25 
    26     @classmethod
    27     def from_crawler(cls, crawler):
    28         """
    29         this classmethod is used for to get the configuration from settings
    30         :param crwaler:
    31         :return:
    32         """
    33         db = crawler.settings.get('DB')
    34         collection = crawler.settings.get('COLLECTION')
    35         host = crawler.settings.get('HOST')
    36         port = crawler.settings.get('PORT')
    37         user = crawler.settings.get('USER')
    38         pwd = crawler.settings.get('PWD')
    39 
    40         return cls(db, collection, host, port, user, pwd)
    41 
    42     def open_spider(self, spider):
    43         """
    44         run once time when the spider is starting
    45         :param spider:
    46         :return:
    47         """
    48         # 连接数据库
    50         self.client = MongoClient("mongodb://%s:%s@%s:%s" % (
    51             self.user,
    52             self.pwd,
    53             self.host,
    54             self.port
    55         ))
    56 
    57     def process_item(self, item, spider):
    58         """
    59         storage the data into database
    60         :param item:
    61         :param spider:
    62         :return:
    63         """
          # 获取item数据,并转换成字典格式
    64 d = dict(item)
           # 有空值得不保存
    65 if all(d.values()):
              # 保存到mongodb中
    66 self.client[self.db][self.collection].save(d) 67 return item 68 69 # 表示将item丢弃,不会被后续pipeline处理 70 # raise DropItem()

    6. 配置文件

     1 # database server
     2 DB = "jd"
     3 COLLECTION = "goods"
     4 HOST = "127.0.0.1"
     5 PORT = 27017
     6 USER = "root"
     7 PWD = "123"
     8 ITEM_PIPELINES = {
     9    'MyScrapy.pipelines.MongoPipeline': 300,
    10 }

  • 相关阅读:
    面试容易问到的Linux问题
    Java面试题复习笔记(框架)
    Java面试题复习笔记(前端)
    Java面试题复习笔记(数据库)
    Java面试题复习笔记(Web方向)
    【刷题-LeetCode】191 Number of 1 Bits
    【数学】随机方法计算逆矩阵
    【刷题-LeetCode】190 Reverse Bits
    【刷题-LeetCode】188 Best Time to Buy and Sell Stock IV
    python 30 面向对象之 多态
  • 原文地址:https://www.cnblogs.com/eric_yi/p/8343721.html
Copyright © 2011-2022 走看看