zoukankan      html  css  js  c++  java
  • 浅谈scrapy框架安装使用

    Scrapy笔记: 一 安装:

    1. pip3 install wheel

      1. pip3 install lxml

        1. pip3 install pyopenssl

        2. pip3 install -i https://mirrors.aliyun.com/pypi/simple/ pypiwin32

        3. 下载文件(twisted): https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted pip3 install 文件路径Twisted-19.2.0-cp36-cp36m-win_amd64.whl 6.pip3 install scrapy 7.scrapy 测试安装是否成功 Scrapy 1.6.0 - no active project ...

    二 scrapy命令 全局(所有路径下都可以使用): bench Run quick benchmark test fetch Fetch a URL using the Scrapy downloader

          # 会把爬虫程序创建在当前目录下
        genspider     Generate new spider using pre-defined templates

        # 可以在当前目录下启动爬虫程序
        runspider     Run a self-contained spider (without creating a project)

        runspider 爬虫程序的绝对路径

        settings     Get settings values
        shell         Interactive scraping console

        # 创建scrapy项目
        startproject Create new project
        version       Print Scrapy version
        view

    局部(在scrapy项目中可以使用):
        bench         Run quick benchmark test
        # 监测语法
        check         Check spider contracts
        # 根据爬虫程序的name 启动爬虫程序
        crawl         Run a spider

        # !!!!!!!重点使用它!!!!!!!
        scrapy crawl name

        edit         Edit spider
        fetch         Fetch a URL using the Scrapy downloader
        genspider     Generate new spider using pre-defined templates
        # 查看所有的爬虫程序
        list         List available spiders

        parse         Parse URL (using its spider) and print the results

        runspider     Run a self-contained spider (without creating a project)

        settings     Get settings values
        shell         Interactive scraping console
        startproject Create new project
        version       Print Scrapy version
        view         Open URL in browser, as seen by Scrapy

     

    三 创建scrapy项目 Django: # 创建项目 django-admin startproject P1 cd P1

        # 创建实例
      python3 manage.py app01
      python3 manage.py bbs
    Scrapy:
      # 创建项目
      scrapy startproject spider_project
      cd spider_project

      # 创建爬虫程序
      scrapy genspider chouti chouti.com

      # 执行爬虫程序
      scrapy crawl chouti

      # 取消日志
      scrapy crawl --nolog chouti

    settings:
      # 不遵循反爬协议
      ROBOTSTXT_OBEY = False
    main.py
      from scrapy.cmdline import execute
      # execute(['scrapy', 'crawl', 'baidu'])
      execute("scrapy crawl --nolog chouti".split(' '))

     

    main.py
      from scrapy.cmdline import execute
      # execute(['scrapy', 'crawl', 'baidu'])
      execute("scrapy crawl --nolog chouti".split(' '))

     

    Spiders.chouti.py

      class ChoutiSpider(scrapy.Spider):
          def parse(self, response):
          def parse_user_index(self, response):
    items.py
      # 新闻items类
      class SpiderNewListItem(scrapy.Item):
          # define the fields for your item here like:
          # 新闻链接
          new_url = scrapy.Field()
          # 新闻文本
          new_text = scrapy.Field()
          # 点赞数
          nice_num = scrapy.Field()
          # 新闻ID
          new_id = scrapy.Field()
          # 评论数
          commit_num = scrapy.Field()
          # 新闻详情
          new_content = scrapy.Field()
          # 发表新闻用户的主页
          user_link = scrapy.Field()
        # 新闻items类
      class SpiderUserListItem(scrapy.Item):
          # define the fields for your item here like:
          # 新闻链接
          new_url = scrapy.Field()
          # 新闻文本
          new_text = scrapy.Field()
          # 点赞数
          nice_num = scrapy.Field()
          # 新闻ID
          new_id = scrapy.Field()
          # 评论数
          commit_num = scrapy.Field()
          # 新闻详情
          new_content = scrapy.Field()
          # 用户名
          user_name = scrapy.Field()

    pipelines.py

      class SpiderNewListPipeline(object):
          def __init__(self, ip, port, mongo_db):
              self.ip = ip
              self.port = port
              self.mongo_db = mongo_db
        # 必须配置才可以启动ITEM_PIPELINES
      ITEM_PIPELINES = {
          'spider_project.pipelines.SpiderNewListPipeline': 300,
          'spider_project.pipelines.SpiderUserListPipeline': 301,
      }

      # MongoDB配置信息
      IP = 'localhost'
      PORT = 27017
      DB = 'chouti'
  • 相关阅读:
    sql server 全文搜索,同义词,链接服务器
    行版本控制,解决死锁问题
    SqlServer 官方教程文档
    变态强大的EF Core 拓展库 ELinq
    Jenkinsfile 学习
    Dockerfile+Jenkinsfile+GitLab轻松实现.NetCore程序的CI&CD
    JMeter 从入门到精通
    LinqPad工具
    Angular 部署到IIS上出现404问题
    es search教程
  • 原文地址:https://www.cnblogs.com/tangda/p/10960968.html
Copyright © 2011-2022 走看看