zoukankan      html  css  js  c++  java
  • Python爬虫框架Scrapy实例(一)

    目标任务:爬取腾讯社招信息,需要爬取的内容为:职位名称,职位的详情链接,职位类别,招聘人数,工作地点,发布时间。

    一、创建Scrapy项目

    scrapy startproject Tencent

    命令执行后,会创建一个Tencent文件夹,结构如下

    二、编写item文件,根据需要爬取的内容定义爬取字段

    复制代码
    # -*- coding: utf-8 -*-
    

    import scrapy

    class TencentItem(scrapy.Item):

    # 职位名
    positionname = scrapy.Field()
    # 详情连接
    positionlink = scrapy.Field()
    # 职位类别
    positionType = scrapy.Field()
    # 招聘人数
    peopleNum = scrapy.Field()
    # 工作地点
    workLocation = scrapy.Field()
    # 发布时间
    publishTime = scrapy.Field()

    复制代码

    三、编写spider文件

    进入Tencent目录,使用命令创建一个基础爬虫类:

    #  tencentPostion为爬虫名,tencent.com为爬虫作用范围
    scrapy genspider tencentPostion "tencent.com"

    执行命令后会在spiders文件夹中创建一个tencentPostion.py的文件,现在开始对其编写:

    复制代码
    # -*- coding: utf-8 -*-
    import scrapy
    from tencent.items import TencentItem
    

    class TencentpositionSpider(scrapy.Spider):
    """
    功能:爬取腾讯社招信息
    """
    # 爬虫名
    name
    = "tencentPosition"
    # 爬虫作用范围

    allowed_domains
    = ["tencent.com"]

    url </span>= <span style="color: #800000">"</span><span style="color: #800000">http://hr.tencent.com/position.php?&amp;start=</span><span style="color: #800000">"</span><span style="color: #000000">
    offset </span>=<span style="color: #000000"> 0
    <span style="color: #000000"><span style="color: #008000">#<span style="color: #008000"> 起始url</span></span></span>
    start_urls </span>= [url +<span style="color: #000000"> str(offset)]
    
    </span><span style="color: #0000ff">def</span><span style="color: #000000"> parse(self, response):
        </span><span style="color: #0000ff">for</span> each <span style="color: #0000ff">in</span> response.xpath(<span style="color: #800000">"</span><span style="color: #800000">//tr[@class='even'] | //tr[@class='odd']</span><span style="color: #800000">"</span><span style="color: #000000">):
            </span><span style="color: #008000">#</span><span style="color: #008000"> 初始化模型对象</span>
            item =<span style="color: #000000"> TencentItem()
            </span><span style="color: #008000">#</span><span style="color: #008000"> 职位名称</span>
            item[<span style="color: #800000">'</span><span style="color: #800000">positionname</span><span style="color: #800000">'</span>] = each.xpath(<span style="color: #800000">"</span><span style="color: #800000">./td[1]/a/text()</span><span style="color: #800000">"</span><span style="color: #000000">).extract()[0]
            </span><span style="color: #008000">#</span><span style="color: #008000"> 详情连接</span>
            item[<span style="color: #800000">'</span><span style="color: #800000">positionlink</span><span style="color: #800000">'</span>] = each.xpath(<span style="color: #800000">"</span><span style="color: #800000">./td[1]/a/@href</span><span style="color: #800000">"</span><span style="color: #000000">).extract()[0]
            </span><span style="color: #008000">#</span><span style="color: #008000"> 职位类别</span>
            item[<span style="color: #800000">'</span><span style="color: #800000">positionType</span><span style="color: #800000">'</span>] = each.xpath(<span style="color: #800000">"</span><span style="color: #800000">./td[2]/text()</span><span style="color: #800000">"</span><span style="color: #000000">).extract()[0]
            </span><span style="color: #008000">#</span><span style="color: #008000"> 招聘人数</span>
            item[<span style="color: #800000">'</span><span style="color: #800000">peopleNum</span><span style="color: #800000">'</span>] =  each.xpath(<span style="color: #800000">"</span><span style="color: #800000">./td[3]/text()</span><span style="color: #800000">"</span><span style="color: #000000">).extract()[0]
            </span><span style="color: #008000">#</span><span style="color: #008000"> 工作地点</span>
            item[<span style="color: #800000">'</span><span style="color: #800000">workLocation</span><span style="color: #800000">'</span>] = each.xpath(<span style="color: #800000">"</span><span style="color: #800000">./td[4]/text()</span><span style="color: #800000">"</span><span style="color: #000000">).extract()[0]
            </span><span style="color: #008000">#</span><span style="color: #008000"> 发布时间</span>
            item[<span style="color: #800000">'</span><span style="color: #800000">publishTime</span><span style="color: #800000">'</span>] = each.xpath(<span style="color: #800000">"</span><span style="color: #800000">./td[5]/text()</span><span style="color: #800000">"</span><span style="color: #000000">).extract()[0]
    
            </span><span style="color: #0000ff">yield</span><span style="color: #000000"> item
    
        </span><span style="color: #0000ff">if</span> self.offset &lt; 1680<span style="color: #000000">:
            self.offset </span>+= 10
    
        <span style="color: #008000">#</span><span style="color: #008000"> 每次处理完一页的数据之后,重新发送下一页页面请求</span>
        <span style="color: #008000">#</span><span style="color: #008000"> self.offset自增10,同时拼接为新的url,并调用回调函数self.parse处理Response</span>
        <span style="color: #0000ff">yield</span> scrapy.Request(self.url + str(self.offset), callback = self.parse)</pre>
    
    复制代码

    四、编写pipelines文件

    复制代码
    # -*- coding: utf-8 -*-
    import json
    

    class TencentPipeline(object):
      """
    功能:保存item数据
    """

    def init(self):
    self.filename
    = open("tencent.json", "w")

    </span><span style="color: #0000ff">def</span><span style="color: #000000"> process_item(self, item, spider):
        text </span>= json.dumps(dict(item), ensure_ascii = False) + <span style="color: #800000">"</span><span style="color: #800000">,
    </span><span style="color: #800000">"</span><span style="color: #000000">
        self.filename.write(text.encode(</span><span style="color: #800000">"</span><span style="color: #800000">utf-8</span><span style="color: #800000">"</span><span style="color: #000000">))
        </span><span style="color: #0000ff">return</span><span style="color: #000000"> item
    
    </span><span style="color: #0000ff">def</span><span style="color: #000000"> close_spider(self, spider):
        self.filename.close()</span></pre>
    
    复制代码

    五、settings文件设置(主要设置内容)

    复制代码
    # 设置请求头部,添加url
    DEFAULT_REQUEST_HEADERS = {
        "User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;",
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
    }
    

    # 设置item——pipelines
    ITEM_PIPELINES = {
    'tencent.pipelines.TencentPipeline': 300,
    }

    复制代码

    执行命令,运行程序

    # tencentPosition为爬虫名
    scrapy crwal tencentPosition

    使用CrawlSpider类改写

    # 创建项目
    scrapy startproject TencentSpider
    

    # 进入项目目录下,创建爬虫文件
    scrapy genspider -t crawl tencent tencent.com

    item等文件写法不变,主要是爬虫文件的编写

    复制代码
    # -*- coding:utf-8 -*-
    

    import scrapy
    # 导入CrawlSpider类和Rule
    from scrapy.spiders import CrawlSpider, Rule
    # 导入链接规则匹配类,用来提取符合规则的连接
    from scrapy.linkextractors import LinkExtractor
    from TencentSpider.items import TencentItem

    class TencentSpider(CrawlSpider):
    name
    = "tencent"
    allow_domains
    = ["hr.tencent.com"]
    start_urls
    = ["http://hr.tencent.com/position.php?&start=0#a"]

    </span><span style="color: #008000">#</span><span style="color: #008000"> Response里链接的提取规则,返回的符合匹配规则的链接匹配对象的列表</span>
    pagelink = LinkExtractor(allow=(<span style="color: #800000">"</span><span style="color: #800000">start=d+</span><span style="color: #800000">"</span><span style="color: #000000">))
    
    rules </span>=<span style="color: #000000"> [
        </span><span style="color: #008000">#</span><span style="color: #008000"> 获取这个列表里的链接,依次发送请求,并且继续跟进,调用指定回调函数处理</span>
        Rule(pagelink, callback = <span style="color: #800000">"</span><span style="color: #800000">parseTencent</span><span style="color: #800000">"</span>, follow =<span style="color: #000000"> True)
    ]
    
    </span><span style="color: #008000">#</span><span style="color: #008000"> 指定的回调函数</span>
    <span style="color: #0000ff">def</span><span style="color: #000000"> parseTencent(self, response):
        </span><span style="color: #0000ff">for</span> each <span style="color: #0000ff">in</span> response.xpath(<span style="color: #800000">"</span><span style="color: #800000">//tr[@class='even'] | //tr[@class='odd']</span><span style="color: #800000">"</span><span style="color: #000000">):
            item </span>=<span style="color: #000000"> TencentItem()
            </span><span style="color: #008000">#</span><span style="color: #008000"> 职位名称</span>
            item[<span style="color: #800000">'</span><span style="color: #800000">positionname</span><span style="color: #800000">'</span>] = each.xpath(<span style="color: #800000">"</span><span style="color: #800000">./td[1]/a/text()</span><span style="color: #800000">"</span><span style="color: #000000">).extract()[0]
            </span><span style="color: #008000">#</span><span style="color: #008000"> 详情连接</span>
            item[<span style="color: #800000">'</span><span style="color: #800000">positionlink</span><span style="color: #800000">'</span>] = each.xpath(<span style="color: #800000">"</span><span style="color: #800000">./td[1]/a/@href</span><span style="color: #800000">"</span><span style="color: #000000">).extract()[0]
            </span><span style="color: #008000">#</span><span style="color: #008000"> 职位类别</span>
            item[<span style="color: #800000">'</span><span style="color: #800000">positionType</span><span style="color: #800000">'</span>] = each.xpath(<span style="color: #800000">"</span><span style="color: #800000">./td[2]/text()</span><span style="color: #800000">"</span><span style="color: #000000">).extract()[0]
            </span><span style="color: #008000">#</span><span style="color: #008000"> 招聘人数</span>
            item[<span style="color: #800000">'</span><span style="color: #800000">peopleNum</span><span style="color: #800000">'</span>] =  each.xpath(<span style="color: #800000">"</span><span style="color: #800000">./td[3]/text()</span><span style="color: #800000">"</span><span style="color: #000000">).extract()[0]
            </span><span style="color: #008000">#</span><span style="color: #008000"> 工作地点</span>
            item[<span style="color: #800000">'</span><span style="color: #800000">workLocation</span><span style="color: #800000">'</span>] = each.xpath(<span style="color: #800000">"</span><span style="color: #800000">./td[4]/text()</span><span style="color: #800000">"</span><span style="color: #000000">).extract()[0]
            </span><span style="color: #008000">#</span><span style="color: #008000"> 发布时间</span>
            item[<span style="color: #800000">'</span><span style="color: #800000">publishTime</span><span style="color: #800000">'</span>] = each.xpath(<span style="color: #800000">"</span><span style="color: #800000">./td[5]/text()</span><span style="color: #800000">"</span><span style="color: #000000">).extract()[0]
    
            </span><span style="color: #0000ff">yield</span> item</pre>
    
    复制代码
  • 相关阅读:
    Atitit.Java exe bat  作为windows系统服务程序运行
    Atitit. Object-c语言 的新的特性  attilax总结
    Atitit. Object-c语言 的新的特性  attilax总结
    Atitit。Time base gc 垃圾 资源 收集的原理与设计
    Atitit。Time base gc 垃圾 资源 收集的原理与设计
    Atitit.go语言golang语言的新的特性  attilax总结
    Atitit.go语言golang语言的新的特性  attilax总结
    Atitit.pdf 预览 转换html attilax总结
    Atitit.pdf 预览 转换html attilax总结
    Atitit.office word  excel  ppt pdf 的web在线预览方案与html转换方案 attilax 总结
  • 原文地址:https://www.cnblogs.com/wq-mr-almost/p/10208569.html
Copyright © 2011-2022 走看看