zoukankan      html  css  js  c++  java
  • Python爬虫框架Scrapy实例(一)

    目标任务:爬取腾讯社招信息,需要爬取的内容为:职位名称,职位的详情链接,职位类别,招聘人数,工作地点,发布时间。

    一、创建Scrapy项目

    scrapy startproject Tencent

    命令执行后,会创建一个Tencent文件夹,结构如下

    二、编写item文件,根据需要爬取的内容定义爬取字段

    复制代码
    # -*- coding: utf-8 -*-
    

    import scrapy

    class TencentItem(scrapy.Item):

    # 职位名
    positionname = scrapy.Field()
    # 详情连接
    positionlink = scrapy.Field()
    # 职位类别
    positionType = scrapy.Field()
    # 招聘人数
    peopleNum = scrapy.Field()
    # 工作地点
    workLocation = scrapy.Field()
    # 发布时间
    publishTime = scrapy.Field()

    复制代码

    三、编写spider文件

    进入Tencent目录,使用命令创建一个基础爬虫类:

    #  tencentPostion为爬虫名,tencent.com为爬虫作用范围
    scrapy genspider tencentPostion "tencent.com"

    执行命令后会在spiders文件夹中创建一个tencentPostion.py的文件,现在开始对其编写:

    复制代码
    # -*- coding: utf-8 -*-
    import scrapy
    from tencent.items import TencentItem
    

    class TencentpositionSpider(scrapy.Spider):
    """
    功能:爬取腾讯社招信息
    """
    # 爬虫名
    name
    = "tencentPosition"
    # 爬虫作用范围

    allowed_domains
    = ["tencent.com"]

    url </span>= <span style="color: #800000">"</span><span style="color: #800000">http://hr.tencent.com/position.php?&amp;start=</span><span style="color: #800000">"</span><span style="color: #000000">
    offset </span>=<span style="color: #000000"> 0
    <span style="color: #000000"><span style="color: #008000">#<span style="color: #008000"> 起始url</span></span></span>
    start_urls </span>= [url +<span style="color: #000000"> str(offset)]
    
    </span><span style="color: #0000ff">def</span><span style="color: #000000"> parse(self, response):
        </span><span style="color: #0000ff">for</span> each <span style="color: #0000ff">in</span> response.xpath(<span style="color: #800000">"</span><span style="color: #800000">//tr[@class='even'] | //tr[@class='odd']</span><span style="color: #800000">"</span><span style="color: #000000">):
            </span><span style="color: #008000">#</span><span style="color: #008000"> 初始化模型对象</span>
            item =<span style="color: #000000"> TencentItem()
            </span><span style="color: #008000">#</span><span style="color: #008000"> 职位名称</span>
            item[<span style="color: #800000">'</span><span style="color: #800000">positionname</span><span style="color: #800000">'</span>] = each.xpath(<span style="color: #800000">"</span><span style="color: #800000">./td[1]/a/text()</span><span style="color: #800000">"</span><span style="color: #000000">).extract()[0]
            </span><span style="color: #008000">#</span><span style="color: #008000"> 详情连接</span>
            item[<span style="color: #800000">'</span><span style="color: #800000">positionlink</span><span style="color: #800000">'</span>] = each.xpath(<span style="color: #800000">"</span><span style="color: #800000">./td[1]/a/@href</span><span style="color: #800000">"</span><span style="color: #000000">).extract()[0]
            </span><span style="color: #008000">#</span><span style="color: #008000"> 职位类别</span>
            item[<span style="color: #800000">'</span><span style="color: #800000">positionType</span><span style="color: #800000">'</span>] = each.xpath(<span style="color: #800000">"</span><span style="color: #800000">./td[2]/text()</span><span style="color: #800000">"</span><span style="color: #000000">).extract()[0]
            </span><span style="color: #008000">#</span><span style="color: #008000"> 招聘人数</span>
            item[<span style="color: #800000">'</span><span style="color: #800000">peopleNum</span><span style="color: #800000">'</span>] =  each.xpath(<span style="color: #800000">"</span><span style="color: #800000">./td[3]/text()</span><span style="color: #800000">"</span><span style="color: #000000">).extract()[0]
            </span><span style="color: #008000">#</span><span style="color: #008000"> 工作地点</span>
            item[<span style="color: #800000">'</span><span style="color: #800000">workLocation</span><span style="color: #800000">'</span>] = each.xpath(<span style="color: #800000">"</span><span style="color: #800000">./td[4]/text()</span><span style="color: #800000">"</span><span style="color: #000000">).extract()[0]
            </span><span style="color: #008000">#</span><span style="color: #008000"> 发布时间</span>
            item[<span style="color: #800000">'</span><span style="color: #800000">publishTime</span><span style="color: #800000">'</span>] = each.xpath(<span style="color: #800000">"</span><span style="color: #800000">./td[5]/text()</span><span style="color: #800000">"</span><span style="color: #000000">).extract()[0]
    
            </span><span style="color: #0000ff">yield</span><span style="color: #000000"> item
    
        </span><span style="color: #0000ff">if</span> self.offset &lt; 1680<span style="color: #000000">:
            self.offset </span>+= 10
    
        <span style="color: #008000">#</span><span style="color: #008000"> 每次处理完一页的数据之后,重新发送下一页页面请求</span>
        <span style="color: #008000">#</span><span style="color: #008000"> self.offset自增10,同时拼接为新的url,并调用回调函数self.parse处理Response</span>
        <span style="color: #0000ff">yield</span> scrapy.Request(self.url + str(self.offset), callback = self.parse)</pre>
    
    复制代码

    四、编写pipelines文件

    复制代码
    # -*- coding: utf-8 -*-
    import json
    

    class TencentPipeline(object):
      """
    功能:保存item数据
    """

    def init(self):
    self.filename
    = open("tencent.json", "w")

    </span><span style="color: #0000ff">def</span><span style="color: #000000"> process_item(self, item, spider):
        text </span>= json.dumps(dict(item), ensure_ascii = False) + <span style="color: #800000">"</span><span style="color: #800000">,
    </span><span style="color: #800000">"</span><span style="color: #000000">
        self.filename.write(text.encode(</span><span style="color: #800000">"</span><span style="color: #800000">utf-8</span><span style="color: #800000">"</span><span style="color: #000000">))
        </span><span style="color: #0000ff">return</span><span style="color: #000000"> item
    
    </span><span style="color: #0000ff">def</span><span style="color: #000000"> close_spider(self, spider):
        self.filename.close()</span></pre>
    
    复制代码

    五、settings文件设置(主要设置内容)

    复制代码
    # 设置请求头部,添加url
    DEFAULT_REQUEST_HEADERS = {
        "User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;",
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
    }
    

    # 设置item——pipelines
    ITEM_PIPELINES = {
    'tencent.pipelines.TencentPipeline': 300,
    }

    复制代码

    执行命令,运行程序

    # tencentPosition为爬虫名
    scrapy crwal tencentPosition

    使用CrawlSpider类改写

    # 创建项目
    scrapy startproject TencentSpider
    

    # 进入项目目录下,创建爬虫文件
    scrapy genspider -t crawl tencent tencent.com

    item等文件写法不变,主要是爬虫文件的编写

    复制代码
    # -*- coding:utf-8 -*-
    

    import scrapy
    # 导入CrawlSpider类和Rule
    from scrapy.spiders import CrawlSpider, Rule
    # 导入链接规则匹配类,用来提取符合规则的连接
    from scrapy.linkextractors import LinkExtractor
    from TencentSpider.items import TencentItem

    class TencentSpider(CrawlSpider):
    name
    = "tencent"
    allow_domains
    = ["hr.tencent.com"]
    start_urls
    = ["http://hr.tencent.com/position.php?&start=0#a"]

    </span><span style="color: #008000">#</span><span style="color: #008000"> Response里链接的提取规则,返回的符合匹配规则的链接匹配对象的列表</span>
    pagelink = LinkExtractor(allow=(<span style="color: #800000">"</span><span style="color: #800000">start=d+</span><span style="color: #800000">"</span><span style="color: #000000">))
    
    rules </span>=<span style="color: #000000"> [
        </span><span style="color: #008000">#</span><span style="color: #008000"> 获取这个列表里的链接,依次发送请求,并且继续跟进,调用指定回调函数处理</span>
        Rule(pagelink, callback = <span style="color: #800000">"</span><span style="color: #800000">parseTencent</span><span style="color: #800000">"</span>, follow =<span style="color: #000000"> True)
    ]
    
    </span><span style="color: #008000">#</span><span style="color: #008000"> 指定的回调函数</span>
    <span style="color: #0000ff">def</span><span style="color: #000000"> parseTencent(self, response):
        </span><span style="color: #0000ff">for</span> each <span style="color: #0000ff">in</span> response.xpath(<span style="color: #800000">"</span><span style="color: #800000">//tr[@class='even'] | //tr[@class='odd']</span><span style="color: #800000">"</span><span style="color: #000000">):
            item </span>=<span style="color: #000000"> TencentItem()
            </span><span style="color: #008000">#</span><span style="color: #008000"> 职位名称</span>
            item[<span style="color: #800000">'</span><span style="color: #800000">positionname</span><span style="color: #800000">'</span>] = each.xpath(<span style="color: #800000">"</span><span style="color: #800000">./td[1]/a/text()</span><span style="color: #800000">"</span><span style="color: #000000">).extract()[0]
            </span><span style="color: #008000">#</span><span style="color: #008000"> 详情连接</span>
            item[<span style="color: #800000">'</span><span style="color: #800000">positionlink</span><span style="color: #800000">'</span>] = each.xpath(<span style="color: #800000">"</span><span style="color: #800000">./td[1]/a/@href</span><span style="color: #800000">"</span><span style="color: #000000">).extract()[0]
            </span><span style="color: #008000">#</span><span style="color: #008000"> 职位类别</span>
            item[<span style="color: #800000">'</span><span style="color: #800000">positionType</span><span style="color: #800000">'</span>] = each.xpath(<span style="color: #800000">"</span><span style="color: #800000">./td[2]/text()</span><span style="color: #800000">"</span><span style="color: #000000">).extract()[0]
            </span><span style="color: #008000">#</span><span style="color: #008000"> 招聘人数</span>
            item[<span style="color: #800000">'</span><span style="color: #800000">peopleNum</span><span style="color: #800000">'</span>] =  each.xpath(<span style="color: #800000">"</span><span style="color: #800000">./td[3]/text()</span><span style="color: #800000">"</span><span style="color: #000000">).extract()[0]
            </span><span style="color: #008000">#</span><span style="color: #008000"> 工作地点</span>
            item[<span style="color: #800000">'</span><span style="color: #800000">workLocation</span><span style="color: #800000">'</span>] = each.xpath(<span style="color: #800000">"</span><span style="color: #800000">./td[4]/text()</span><span style="color: #800000">"</span><span style="color: #000000">).extract()[0]
            </span><span style="color: #008000">#</span><span style="color: #008000"> 发布时间</span>
            item[<span style="color: #800000">'</span><span style="color: #800000">publishTime</span><span style="color: #800000">'</span>] = each.xpath(<span style="color: #800000">"</span><span style="color: #800000">./td[5]/text()</span><span style="color: #800000">"</span><span style="color: #000000">).extract()[0]
    
            </span><span style="color: #0000ff">yield</span> item</pre>
    
    复制代码
  • 相关阅读:
    简单例子windows 共享内存 Demo -----(一)
    Qt qss浅析
    基于EntityFramework的权限的配置和验证
    快速获取Windows系统上的国家和地区信息
    Scorm 1.2 开发文档
    SQL Server 联表字段合并查询
    解决 ko mapping 数组无法添加新对象的问题
    SQL Server 数据库初始化准备脚本
    妾心如水,良人不来
    有趣的格子效果
  • 原文地址:https://www.cnblogs.com/wq-mr-almost/p/10208569.html
Copyright © 2011-2022 走看看