zoukankan      html  css  js  c++  java
  • 一、scrapy基本使用

    一、scrapy基本使用

    1.1 环境安装:

    • linux和mac操作系统:
      • pip install scrapy
    • windows系统:
      • pip install wheel
      • 下载twisted,下载地址为http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
      • 安装twisted:pip install Twisted‑17.1.0‑cp36‑cp36m‑win_amd64.whl
      • pip install pywin32
      • pip install scrapy
        测试:在终端里录入scrapy指令,没有报错即表示安装成功!

    1.2 scrapy使用流程:

    • 创建工程:

      • scrapy startproject ProName
    • 进入工程目录:

      • cd ProName
    • 创建爬虫文件:

      • scrapy genspider spiderName www.xxx.com
    • 编写相关操作代码

    • 执行工程:

      • scrapy crawl spiderName
    • 爬虫文件剖析:

      import scrapy
      class QiushiSpider(scrapy.Spider):
            name = 'qiubai' #应用名称
            #允许爬取的域名(如果遇到非该域名的url则爬取不到数据,一般注释)
            allowed_domains = ['https://www.qiushibaike.com/']
            #起始爬取的url
            start_urls = ['https://www.qiushibaike.com/']
            #访问起始URL并获取结果后的回调函数,该函数的response参数就是向起始的url发送请求后,获取的响应对象.该函数返回值必须为可迭代对象或者NUll 
            def parse(self, response):
                print(response.text) #获取字符串类型的响应内容
                print(response.body)#获取字节类型的相应内容
      
    • 配置文件settings.py修改:

      # Crawl responsibly by identifying yourself (and your website) on the user-agent
      # 16行 定义请求头
      USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36'
      # 19行
      # Obey robots.txt rules
      # 不遵守robots协议
      ROBOTSTXT_OBEY = False 
      
    • scrapy基于xpath数据解析操作:

      import scrapy
      
      class QiushiSpider(scrapy.Spider):
          #
          name = 'qiuShi'
          start_urls = ['https://www.qiushibaike.com/text']
      
          def parse(self, response):
              #xpath为response中的方法,可以将xpath表达式直接作用于该函数中
              div_list = response.xpath('//*[@id="content"]/div/div[2]/div')
              print(type(div_list))
              for div in div_list:
                #xpath函数返回的为列表,列表中存放的数据为Selector类型的数据。我们解析到的内容被封装在了								Selector对象中,需要调用extract()函数将解析的内容从Selecor中取出。
                  name = div.xpath('./div[1]/a[2]/h2/text()')[0].extract()
                  content = div.xpath('./a//span/text()').extract()
                   #打印展示爬取到的数据
                  print(name,content)
      
  • 相关阅读:
    LeetCode 109 Convert Sorted List to Binary Search Tree
    LeetCode 108 Convert Sorted Array to Binary Search Tree
    LeetCode 107. Binary Tree Level Order Traversal II
    LeetCode 106. Construct Binary Tree from Inorder and Postorder Traversal
    LeetCode 105. Construct Binary Tree from Preorder and Inorder Traversal
    LeetCode 103 Binary Tree Zigzag Level Order Traversal
    LeetCode 102. Binary Tree Level Order Traversal
    LeetCode 104. Maximum Depth of Binary Tree
    接口和多态性
    C# 编码规范
  • 原文地址:https://www.cnblogs.com/merryblogs/p/14338870.html
Copyright © 2011-2022 走看看