zoukankan      html  css  js  c++  java
  • scrapy处理post请求的传参和日志等级

    一.Scrapy的日志等级

      - 在使用scrapy crawl spiderFileName运行程序时,在终端里打印输出的就是scrapy的日志信息。

      - 日志信息的种类:

            ERROR : 一般错误
    
            WARNING : 警告
    
            INFO : 一般的信息
    
            DEBUG : 调试信息  

      - 设置日志信息指定输出:

        在settings.py配置文件中,加入

                        LOG_LEVEL = ‘指定日志信息种类’即可。

                        LOG_FILE = 'log.txt'则表示将日志信息写入到指定文件中进行存储。

    二.请求传参

      - 在某些情况下,我们爬取的数据不在同一个页面中,例如,我们爬取一个电影网站,电影的名称,评分在一级页面,而要爬取的其他电影详情在其二级子页面中。这时我们就需要用到请求传参。

    处理post请求的参数: 

    创建项目:

      

    代码:

    import scrapy
    
    
    class PostSpider(scrapy.Spider):
        name = 'post'
        # allowed_domains = ['www.xxx.com']
        start_urls = ['https://fanyi.baidu.com/sug']
    
        def start_requests(self):
            data = {
                'kw':'dog'
            }
            for url in self.start_urls:
                yield scrapy.FormRequest(url=url,formdata=data,callback=self.parse)
    
        def parse(self, response):
            print(response.text)
    

    settings.py

    USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    

    查看请求的数据: 

     案例二:

    # -*- coding: utf-8 -*-
    import scrapy
    from moviePro.items import MovieproItem
    
    class MovieSpider(scrapy.Spider):
        name = 'movie'
        # allowed_domains = ['www.xxx.com']
        start_urls = ['https://www.4567tv.tv/frim/index1.html']
        #解析详情页中的数据
        def parse_detail(self,response):
            #response.meta返回接收到的meta字典
            item = response.meta['item']
            actor = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[3]/a/text()').extract_first()
            item['actor'] = actor
    
            yield item
    
        def parse(self, response):
            li_list = response.xpath('//li[@class="col-md-6 col-sm-4 col-xs-3"]')
            for li in li_list:
                item = MovieproItem()
                name = li.xpath('./div/a/@title').extract_first()
                detail_url = 'https://www.4567tv.tv'+li.xpath('./div/a/@href').extract_first()
                item['name'] = name
                #meta参数:请求传参.meta字典就会传递给回调函数的response参数
                yield scrapy.Request(url=detail_url,callback=self.parse_detail,meta={'item':item})
    
    settings.py
    LOG_LEVEL = "ERROE"
    LOG_FILE = './log.txt'    #输出日志
    

     items.py

    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class MoveproItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        name = scrapy.Field()
        actor = scrapy.Field()
    
  • 相关阅读:
    loj#6433. 「PKUSC2018」最大前缀和(状压dp)
    PKUWC2019游记
    10. Regular Expression Matching
    9. Palindrome Number
    8. String to Integer (atoi)
    7. Reverse Integer
    6. ZigZag Conversion
    5. Longest Palindromic Substring
    4. Median of Two Sorted Arrays
    3. Longest Substring Without Repeating Characters
  • 原文地址:https://www.cnblogs.com/wqzn/p/10471321.html
Copyright © 2011-2022 走看看