zoukankan      html  css  js  c++  java
  • scrapy

    windowns 10

    这里我使用的是python3.7

    创建虚拟环境

    pip install virtualenv
    pip install virtualenvwrapper-win
    # 创建虚拟环境 first_pro
    mkvirtualenv first_pro 如果报错“mkvirtualenv不是内部命令或者外部命令”,请在环境配置中添加有mkvirtual.bat下路径目录
    # 删除虚拟环境
    rmvirtualenv first_pro
    # 退出虚拟环境
    deactivate
    #进入虚拟环境first_pro
    workon first_pro


    安装scrapy包

    pip install scrapy
    如果不成功,则按照以下步骤:
    1.下载Twisted,下载地址:https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
    进入包的目录下:pip install Twisted-19.2.1-cp37-cp37m-win_amd64.whl
    2.安装scrapy
    pip install scrapy
    如果报错:“Consider using the `--user` option or check the permissions”的情况
    pip install --user scrapy

    创建一个scrapy项目

    步骤1:
    scrapy startproject mySpider
    步骤2:
    cd mySpider
    步骤3:
    scrapy genspider [lvbo] [s.hc360.com]

     scrapy目录

    mySpider
      -mSpider
        -spiders
          __init__.py
          lvbo.py
        __init__.py
        items.py
        middlewares.py
        pipelines.py
        settings.py

    案列,爬取慧聪网,铝箔相关内容

    lvbo.py

    import scrapy
    from mySpider.items import MyspiderItem
    import time
    
    
    class LvboSpider(scrapy.Spider):
        name = 'lvbo'
        allowed_domains = []
        start_urls = ['https://s.hc360.com/seller/search.html?kwd=%E9%93%9D%E7%AE%94']
    
        def parse(self, response):
            li_list = response.xpath("//div[@class='s-layout']//div[@class='wrap-grid']//li")
            for li in li_list:
                item = MyspiderItem()
                url = li.xpath(".//div[@class='NewItem']//a/@href").extract_first()
                if url:
                    url = 'https:' + url
                    yield scrapy.Request(
                        url,
                        callback=self.parse_detail,
                        meta={"item": item}
                    )
                time.sleep(0.2)
    
            # 翻页
            next_url = response.xpath("//a[text()='下一页']/@href").extract_first()
            if next_url:
                next_url = 'https:' + next_url
                print(next_url)
                yield scrapy.Request(
                    next_url,
                    callback=self.parse
                )
    
        def parse_detail(self, response):
            item = response.meta["item"]
    
            company_name = response.xpath("//div[@class='word-box']/div/div[@class='p sate']/em/text()").extract_first()
            name = response.xpath("//div[@class='word-box']/div/div[@class='p name']/em/text()").extract_first()
            phone = response.xpath("//div[@class='word-box']/div/div[@class='p tel2']/em/text()").extract_first()
            if company_name and name and phone:
                item["company_name"] = company_name.lstrip("")
                item["name"] = name.replace(u'xa0', u' ').strip()
                item["phone"] = phone.lstrip("")
                print(item)
                yield item
    View Code

    settings.py

    # 日志打印等级
    LOG_LEVEL = "WARNING"
    # 不遵守roboot规则
    ROBOTSTXT_OBEY = False
    # 下载间隔3秒
    DOWNLOAD_DELAY = 3
    # 开启管道
    ITEM_PIPELINES = {
       'mySpider.pipelines.MyspiderPipeline': 300,
    }

    pipelines.py

    import json
    
    class MyspiderPipeline(object):
      # 程序运行就打开记事本
    def open_spider(self, spider): self.file = open('lvbo.txt', 'w', encoding='utf-8')   # 运行结束关闭 def close_spider(self, spider): self.file.close()   # 将数据写入记事本 def process_item(self, item, spider):
         # 中文编码 ensure_ascii=False line
    = json.dumps(dict(item), ensure_ascii=False) + ' ' self.file.write(line) return item

    items.py

    import scrapy
    
    
    class MyspiderItem(scrapy.Item):
        # define the fields for your item here like:
        company_name = scrapy.Field()
        name = scrapy.Field()
        position = scrapy.Field()
        phone = scrapy.Field()
        site = scrapy.Field()

    运行

    DOS命令窗口,进入mySpider目录下:cd mySpider

    ...mySpider> scrapy crawl lvbo

    如果使用pycharm进行DeBug,则在setting.py文件同级下创建一个start.py文件

    start.py

    # -*- coding:utf-8 -*-
    from scrapy import cmdline
    
    cmdline.execute("scrapy crawl lvbo".split())

    在pycharm中的Run/Debug Configuration 的 Script path : C:codemySpidermySpiderstart.py

  • 相关阅读:
    Linux操作系统(二)
    匿名函数和内置函数
    BeautifulSoup
    Robots协议
    列表和生成器表达式
    迭代器
    排序总结
    图论专题笔记
    Trie树的二三事QWQ
    二分答案经典入门题:)
  • 原文地址:https://www.cnblogs.com/aqiuboke/p/11132754.html
Copyright © 2011-2022 走看看