zoukankan      html  css  js  c++  java
  • 爬取腾讯招聘

    scrapy startproject insist  #创建项目
    scrapy  genspider  teng  carees.tencent.com#创建爬虫(爬虫名字+域名)
    
    items.py
    #需要爬取的信息
    import scrapy
    class InsistItem(scrapy.Item):
        # define the fields for your item here like:
        positionname = scrapy.Field()
        type=scrapy.Field()
        place=scrapy.Field()
        mian=scrapy.Field()
        time=scrapy.Field()
        #pass
    
    
    pipelines.py
    #保存数据到数据库或者json文件
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    
    import json
    
    class InsistPipeline(object):
        def __init__(self):
            self.f=open('teng.json','w',encoding='utf-8')#编码
        def process_item(self, item, spider):
            #item(Item对象,被爬取的item)
            #这个方法必须实现
            content=json.dumps(dict(item),ensure_ascii=False)+",
    "
            self.f.write(content)
            return item
    
    teng.py
    import scrapy
    import json
    from insist.items import InsistItem
    
    class TengSpider(scrapy.Spider):
        name = 'teng'
        allowed_domains = ['careers.tencent.com']
        #start_urls = ['http://careers.tencent.com/']
        baseURL = 'https://careers.tencent.com/tencentcareer/api/post/Query?pageSize=10&pageIndex='
        offset = 1
        start_urls = [baseURL + str(offset)]
    
        def parse(self, response):
            contents=json.loads(response.text)
            jobs=contents['Data']['Posts']
            item=InsistItem()
            for job in jobs:
                item['positionname']=job['RecruitPostName']
                item['type']=job['BGName']
                item['place']=job['LocationName']
                item['mian']=job['CategoryName']
                item['time']=job['LastUpdateTime']
                yield item
            if self.offset<=10:
                     self.offset += 1
                     yield scrapy.Request(self.baseURL + str(self.offset), callback=self.parse)
  • 相关阅读:
    textarea回车在多浏览器兼容问题
    windows server平台移动oracle表空间
    奇妙的英文recreate,reproduce,regenerate也不同
    《Inside the C++ Object Model》笔记(1~7章)
    1.4买书问题C#源码
    C#的Compiler Error CS1660
    数学符号表
    C#工程添加了DLL编译运行时却提示”无法加载DLL“的解决方案
    看源代码那些事【转】
    救命的软件,查看你关掉的网站内容
  • 原文地址:https://www.cnblogs.com/persistence-ok/p/11553716.html
Copyright © 2011-2022 走看看