zoukankan html css js c++ java

135 scrapy框架使用selenium爬取动态网页的数据, crawlspider

主要内容: 爬虫第七天

1 使用scrapy+selenium爬取动态网页的数据:

'''
    使用流程:
        1.在爬虫文件中实例化一个浏览器对象
        2.重写爬虫类父类一方法closed,在刚方法中关闭浏览器对象
        3.在下载中间件中process_response中:
            a:获取爬虫文件中实例化好的浏览器对象
            b:执行浏览器自动化的行为动作
            c:实例化了一个新的响应对象,并且将浏览器获取的页面源码数据加载到了该对象中
            d:返回这个新的响应对象
'''

2 crawlspider: 比较适用于对网站爬取批量网页, 相比于Spider类，CrawlSpider主要使用规则(rules)来提取链接.

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

# 比较试用于对网站爬取批量网页, 相比于Spider类，CrawlSpider主要使用规则(rules)来提取链接
class CrawlSpider(CrawlSpider):
    name = 'crawl'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.qiushibaike.com/pic/']
    link = LinkExtractor(allow=r'/pic/page/d+?s')
    print(link)
    link1 = LinkExtractor(allow=r'/pic/$')
    print(link1)
    # 连接提取器:前提(follow=False),作用就是用来提取起始url对应页面中符合要求的连接
    rules = (
        # 规则解析器对象:将连接提取器提取到的连接对应的页面源码数据根据只用要求进行解析
        # follow=True:让连接提取器继续作用在连接提取器提取出的来连接所对应的页面源码中
        Rule(link, callback='parse_item', follow=False),
        Rule(link1, callback='parse_item', follow=True),
    )
    print(rules)

    def parse_item(self, response):
        # print(123)
        # i = {}
        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
        #i['name'] = response.xpath('//div[@id="name"]').extract()
        #i['description'] = response.xpath('//div[@id="description"]').extract()
        # return i
        print(response)

View Code

　　代码的执行流程:

　　　1) scrapy crawl spidername开始运行，程序自动使用start_urls构造Request并发送请求，然后调用parse函数对其进行解析，在这个解析过程中使用rules中的规则

从html（或xml）文本中提取匹配的链接，通过这个链接再次生成Request，如此不断循环，直到返回的文本中再也没有匹配的链接，或调度器中的Request对象用尽，程序才停止。

　　 2) rules中的规则如果callback没有指定，则使用默认的parse函数进行解析，如果指定了，那么使用自定义的解析函数。

　　 3) 如果起始的url解析方式有所不同，那么可以重写CrawlSpider中的另一个函数parse_start_url(self, response)用来解析第一个url返回的Response，但这不是必须的。

查看全文

相关阅读:
编译用到boost相关的东西,问题的解决;以及和googletest库
 看开源代码利器—用Graphviz + CodeViz生成C/C++函数调用图(call graph)
centos5 升级到centos6
Go vs Erlang
Graphviz
Oracle相关安装经验总结
 学习erlang书籍
 sublime使用总结
 List集合五种遍历方式
 nginx常用命令

原文地址：https://www.cnblogs.com/gyh412724/p/10274437.html