zoukankan      html  css  js  c++  java
  • scrapy基础知识之 CrawlSpiders(爬取腾讯校内招聘):

    import scrapy
    from scrapy.spider import CrawlSpider,Rule
    from scrapy.linkextractors import LinkExtractor
    from tencent.items import TencentItem

    class TencentSpider(CrawlSpider):
        name = "Tencent"
        allowed_domains = ["tencent.com"]
        # url="http://hr.tencent.com/position.php?&start="
        # offset=0
        start_urls = [ "http://hr.tencent.com/position.php?&start=0#a"]

        page_link=LinkExtractor(allow=("start=d+"))

        rules=[
                Rule(page_link,callback = "parseContent",follow=True)
        ]

        def parseContent(self, response):
            list=response.xpath('//tr[@class="even"] | //tr[@class="odd"]')
            for infos in list:
                item=TencentItem()
                item['positionname']=infos.xpath("./td[1]/a/text()").extract()[0]
                item['positionlink']=infos.xpath("./td[1]/a/@href").extract()[0]
                item['positionType']=infos.xpath("./td[2]/text()").extract()
                item['positionNum']=infos.xpath("./td[3]/text()").extract()[0]
                item['positionLocation']=infos.xpath("./td[4]/text()").extract()[0]
                item['publishTime']=infos.xpath("./td[5]/text()").extract()[0]

                yield item


    运行: scrapy crawl Tencent
    #注意:千万记住callback不能写 parse,由于CrawlSpider使用parse方法来实现其逻辑,如果覆盖了 parse方法,crawl spider将会运行失败
  • 相关阅读:
    短信编码总结
    在Linux下用C语言实现短信收发
    sshd_config配置详解
    SSH的通讯和认证
    linux安装tacacs+服务器
    Tacacs+认证详细调研
    伪分布配置完成启动jobtracker和tasktracker没有启动
    Hadoop学习记录(7)|Eclipse远程调试Hadoop
    Hadoop学习记录(6)|Eclipse安装Hadoop 插件
    Hadoop学习记录(5)|集群搭建|节点动态添加删除
  • 原文地址:https://www.cnblogs.com/huwei934/p/6971251.html
Copyright © 2011-2022 走看看