zoukankan      html  css  js  c++  java
  • Python_Crawler_Scrapy06

    Scrapy Doc: https://doc.scrapy.org/en/latest/index.html

    How to use scrapy itemhttps://blog.michaelyin.info/scrapy-tutorial-9-how-use-scrapy-item/

    • how to define Scrapy item,
    • how to use Scrapy item,
    • how to create a custom Item Pipeline to save the data of Item into DB.

    Spiders

    Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items)

    For spiders, the scraping cycle goes through something like this:

    1. You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests.

      The first requests to perform are obtained by calling thestart_requests() method which (by default) generates Request for the URLs specified in the start_urls and the parse method as callback function for the Requests.

    2. In the callback function, you parse the response (web page) and return either dicts with extracted data, Item objects, Request objects, or an iterable of these objects. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.

    3. In callback functions, you parse the page contents, typically usingSelectors (but you can also use BeautifulSoup, lxml or whatever mechanism you prefer) and generate items with the parsed data.

    4. Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or written to a file using Feed exports.

    scrapy.Spider (classscrapy.spiders.Spider)

    This is the simplest spider, and the one from which every other spider must inherit.  It just provides a default start_requests() implementation which sends requests from the start_urls spider attribute and calls the spider’s method parse for each of the resulting responses.

    name: A string which defines the name for this spider. The spider name is how the spider is located (and instantiated) by Scrapy, so it must be unique.

    allowed_domains: An optional list of strings containing domains that this spider is allowed to crawl.

    start_urls: A list of URLs where the spider will begin to crawl from, when no particular URLs are specified.

    start_requests(): This method must return an iterable with the first Requests to crawl for this spider. Scrapy calls it only once, so it is safe to implementstart_requests() as a generator. For example, if you need to start by logging in using a POST request, you could do:

     1 class MySpider(scrapy.Spider):
     2     name = 'myspider'
     3 
     4     def start_requests(self):
     5         return [scrapy.FormRequest("http://www.example.com/login",
     6                                    formdata={'user': 'john', 'pass': 'secret'},
     7                                    callback=self.logged_in)]
     8 
     9     def logged_in(self, response):
    10         # here you would extract links to follow and return Requests for
    11         # each of them, with another callback
    12         pass

    parse(response): This is the default callback used by Scrapy to process downloaded responses, when their requests don’t specify a callback.

    Scrapy Item

    Every Scrapy project have items definition file called items.py, so in this project, you need edit scrapy_spider/items.py

     1 import scrapy
     2 
     3 class TutorialItem(scrapy.Item):
     4     # define the fields for your item here like:
     5     # name = scrapy.Field()
     6     #pass
     7     movie_name = scrapy.Field()
     8     link = scrapy.Field()
     9     desc = scrapy.Field()
    10     quote = scrapy.Field()

    Now let's try to use this item in Scrapy Shell, type scrapy shell to enter.

    scrapy shell https://www.douban.com/top250
    
    
    >>> from tutorial.items import TutorialItem
    >>> item = TutorialItem()
    >>> item["movie_name"] = "test"
    >>> item["movie_name"]
    'test'
    
    >>> item["wrong_field"] = "test"
    KeyError: 'QuoteItem does not support field: wrong_field'
    
    >>> 'movie_name' in item  # is name in the item?
    True
    
    >>> 'quote' in item  # is quote in the item?
    False
  • 相关阅读:
    Hdu5093 Battle ships 二分图
    Hdu 4081 最小生成树
    POJ1201 Intervals差分约束系统(最短路)
    poj1222 EXTENDED LIGHTS OUT 高斯消元||枚举
    Gym 100814C Connecting Graph 并查集+LCA
    Fzu2109 Mountain Number 数位dp
    poj 2774 Long Long Message 后缀数组基础题
    Uva12206 Stammering Aliens 后缀数组&&Hash
    hdu 3518 Boring counting 后缀数组基础题
    数据结构复习之开题篇(持续更新)
  • 原文地址:https://www.cnblogs.com/tlfox2006/p/9739081.html
Copyright © 2011-2022 走看看