zoukankan html css js c++ java

爬虫--Scrapy-参数等级和请求传参

日志等级

日志等级(种类)：
    ERROR：错误
    WARNING：警告
    INFO：一般信息
    DEBUG：调试信息（默认）
指定输入某一中日志信息：
    settings:LOG_LEVEL = ‘ERROR’
将日志信息存储到制定文件中，而并非显示在终端里：
    settings：LOG_FILE = ‘log.txt’

请求传参：爬取的数据值不在同一个页面中。
    需求：将id97电影网站中电影详情数据进行爬取（名称，类型，导演，语言，片长）

如何让终端显示错误信息

在settings.py中配置

# 指定终端输入指定种类日志信息
LOG_LEVEL = 'ERROR'
# 存储到文件
LOG_FILE = 'log.txt'

请求传参

请求传参：爬取的数据值不在同一个页面中。
（id97电影网站）

在电影网站
中电影详情数据进行爬取（名称，类型，导演，语言，片长）

创建moviePro工程

scrapy startproject moviePro

cd moviePro

scrapy genspider movie www.id97.com

电影名称和类型在一页

电影的其他详情在另外一页

爬虫文件movie.py

import scrapy
from moviePro.items import MovieproItem

class MovieSpider(scrapy.Spider):
    name = 'movie'
    #allowed_domains = ['www.id97.com']
    start_urls = ['https://www.55xia.com/movie']
    print(' start_urls')

    # 用于解析二级页面数据
   
    def parseBySecondPage(self,response):
        # 直接复制网页端的xpath
        director = response.xpath('/html/body/div[1]/div/div/div[1]/div[1]/div[2]/table/tbody/tr[1]/td[1]/span/text()').extract_first()
        language = response.xpath('/html/body/div[1]/div/div/div[1]/div[1]/div[2]/table/tbody/tr[6]/td[2]/text()').extract_first()
        longTime = response.xpath('/html/body/div[1]/div/div/div[1]/div[1]/div[2]/table/tbody/tr[8]/td[2]/text()').extract_first()
        # 取出Request方法的meta参数传递过来的字典(response.meta)
        item = response.meta['item']
        item['director'] = director
        item['language'] = language
        item['longTime'] = longTime
        # 将item提交给管道
        print('将item提交给管道')
        yield item

    def parse(self, response):
        # 需求：将id97电影网站中电影详情数据进行爬取（名称，类型，导演，语言，片长）
        div_list = response.xpath('/html/body/div[1]/div[1]/div[2]/div')
        for div in div_list:
            # extract_first()第一个
            name = div.xpath(".//div[@class='meta']/h1/a/text()").extract_first()
            kind = div.xpath('.//div[@class="otherinfo"]//text()').extract()
            # 将kind列表转化成字符串
            kind = " ".join(kind)
            url = div.xpath('.//div[@class="meta"]/h1/a/@href').extract_first()
            # href="//www.55xia.com/movie/638284.html
            url = 'https:'+url
            # 创建items对象
            item = MovieproItem()
            item['name'] = name       
            item['kind'] = kind
            item['url'] = url
            print('创建items对象')

            # 需要对url发起请求，获取页面数据，进行指定数据解析
            # 问题：如何将剩下的电影详情数据存储到item对象（meta）
            # 需要对url发起请求，获取页面数据，进行指定数据解析
            # meta参数只可以赋值一个字典（将item对象先封装到字典）
            yield scrapy.Request(url=url, callback=self.parseBySecondPage, meta={'item': item})

movie.py

items.py

import scrapy


class MovieproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    kind = scrapy.Field()
    director = scrapy.Field()
    language = scrapy.Field()
    longTime = scrapy.Field()
    url = scrapy.Field()

管道pipelines.py

class MovieproPipeline(object):
    fp = None

    def open_spider(self, spider):
        self.fp = open('movie.txt', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        print('------process_item-------')
        detail = item['name'] + ':' + item['kind'] + ':' + item['director'] + ':' + item['language'] + ':' + item[
            'longTime'] + '


'
        self.fp.write(detail)
        return item

    def close_spider(self, spider):
        self.fp.close()

查看全文

相关阅读:
如何规避javascript多人开发函数重名问题
 用纯css创建一个三角形
 什么是语义化的HTML
什么是AJAX？AJAX的原理
 call()和apply()的认知
 经常遇到的浏览器的兼容性有哪些？原因？解决方法是什么？
为什么要清除浮动，有哪些方式
 常用浏览器的内核分别是什么
 【托业】【怪兽】TEST02
【托业】【怪兽】TEST04

原文地址：https://www.cnblogs.com/foremostxl/p/10093504.html