zoukankan      html  css  js  c++  java
  • pymongodb的使用和一个腾讯招聘爬取的案例

    一.在python3中操作mongodb

      1.连接条件

    • 安装好pymongo库
    • 启动mongodb的服务端(如果是前台启动后就不关闭窗口,窗口关闭后服务端也会跟着关闭)

      3.使用

    import pymongo
    #连接mongodb需要使用里面的mongoclient,一般来说传入mongodb的ip和端口即可
    #第一个参数为host,,第二个为ip.默认为27017,
    client=pymongo.MongoClient(host='127.0.0.1',port=27017)
    #这样就可以拿到一个客户端对象了
    #另外MongoClient的第一个参数host还可以直接传MongoDB的连接字符串,以mongodb开头,
    #例如:client = MongoClient('mongodb://localhost:27017/')可以达到同样的连接效果
    # print(client)

    ###################指定数据库
    db=client.test
    #也可以这样写
    # db=client['test']


    ##################指定集合
    collections=db.student
    #也可以这样写
    # collections=db['student']

    ###################插入数据
    # student={
    # 'id':'1111',
    # 'name':'xiaowang',
    # 'age':20,
    # 'sex':'boy',
    # }
    #
    # res=collections.insert(student)
    # print(res)
    #在mongodb中,每一条数据其实都有一个_id属性唯一标识,
    #如果灭有显示指明_id,mongodb会自动产生yigeObjectId类型的_id属性
    #insert执行后的返回值就是_id的值,5c7fb5ae35573f14b85101c0


    #也可以插入多条数据
    # student1={
    # 'name':'xx',
    # 'age':20,
    # 'sex':'boy'
    # }
    #
    # student2={
    # 'name':'ww',
    # 'age':21,
    # 'sex':'girl'
    # }
    # student3={
    # 'name':'xxx',
    # 'age':22,
    # 'sex':'boy'
    # }
    #
    # result=collections.insertMany([student1,student2,student3])
    # print(result)
    #这边的返回值就不是_id,而是insertoneresult对象
    #我们可以通过打印insert_id来获取_id

    #insert方法有两种
    #insert_one,insertMany,一个是单条插入,一个是多条插入,以列表形式传入
    #也可以直接inset(),如果是单个就直接写,多个还是以列表的形式传入


    ###################查找 单条查找
    # re=collections.find_one({'name':'xx'})
    # print(re)
    # print(type(re))
    #{'_id': ObjectId('5c7fb8d535573f13f85a6933'), 'name': 'xx', 'age': 20, 'sex': 'boy'}
    # <class 'dict'>


    #####################多条查找
    # re=collections.find({'name':'xx'})
    # print(re)
    # print(type(re))
    # for r in re:
    # print(r)
    #结果是一个生成器,我们可以遍历里面的这个对象,拿到里面的值
    # <pymongo.cursor.Cursor object at 0x000000000A98E630>
    # <class 'pymongo.cursor.Cursor'>


    # re=collections.find({'age':{'$gt':20}})
    # print(re)
    # print(type(re))
    # for r in re:
    # print(r)
    # 在这里查询的条件键值已经不是单纯的数字了,而是一个字典,其键名为比较符号$gt,意思是大于,键值为20,这样便可以查询出所有
    # 年龄大于20的数据。

    # 在这里将比较符号归纳如下表:
    """
    符号含义示例
    $lt小于{'age': {'$lt': 20}}
    $gt大于{'age': {'$gt': 20}}
    $lte小于等于{'age': {'$lte': 20}}
    $gte大于等于{'age': {'$gte': 20}}
    $ne不等于{'age': {'$ne': 20}}
    $in在范围内{'age': {'$in': [20, 23]}}
    $nin不在范围内{'age': {'$nin': [20, 23]}}
    """

    #正则匹配来查找
    # re = collections.find({'name': {'$regex': '^x.*'}})
    # print(re)
    # print(type(re))
    # for r in re:
    # print(r)

    # 在这里将一些功能符号再归类如下:
    """
    符号含义示例示例含义
    $regex匹配正则{'name': {'$regex': '^M.*'}}name以M开头
    $exists属性是否存在{'name': {'$exists': True}}name属性存在
    $type类型判断{'age': {'$type': 'int'}}age的类型为int
    $mod数字模操作{'age': {'$mod': [5, 0]}}年龄模5余0
    $text文本查询{'$text': {'$search': 'Mike'}}text类型的属性中包含Mike字符串
    $where高级条件查询{'$where': 'obj.fans_count == obj.follows_count'}自身粉丝数等于关注数
    """

    ################计数
    # count=collections.find({'age':{'$gt':20}}).count()
    # print(count)


    #################排序
    # result=collections.find({'age':{'$gt':20}}).sort('age',pymongo.ASCENDING)
    # print([re['name'] for re in result])


    ########### 偏移,可能想只取某几个元素,在这里可以利用skip()方法偏移几个位置,比如偏移2,就忽略前2个元素,得到第三个及以后的元素。
    # result=collections.find({'age':{'$gt':20}}).sort('age',pymongo.ASCENDING).skip(1)
    # print([re['name'] for re in result])


    ##################另外还可以用limit()方法指定要取的结果个数,示例如下:
    # results = collections.find().sort('age', pymongo.ASCENDING).skip(1).limit(2)
    # print([result['name'] for result in results])

    # 值得注意的是,在数据库数量非常庞大的时候,如千万、亿级别,最好不要使用大的偏移量来查询数据,很可能会导致内存溢出,
    # 可以使用类似find({'_id': {'$gt': ObjectId('593278c815c2602678bb2b8d')}}) 这样的方法来查询,记录好上次查询的_id。


    ################################数据更新
    # 对于数据更新要使用update方法
    # condition={'name':'xx'}
    # student=collections.find_one(condition)
    # student['age']=100
    # result=collections.update(condition,student)
    # print(result)

    # 在这里我们将name为xx的数据的年龄进行更新,首先指定查询条件,然后将数据查询出来,修改年龄,
    # 之后调用update方法将原条件和修改后的数据传入,即可完成数据的更新。
    # {'ok': 1, 'nModified': 1, 'n': 1, 'updatedExisting': True}
    # 返回结果是字典形式,ok即代表执行成功,nModified代表影响的数据条数。

    # 另外update()方法其实也是官方不推荐使用的方法,在这里也分了update_one()方法和update_many()方法,用法更加严格,
    # 第二个参数需要使用$类型操作符作为字典的键名,我们用示例感受一下。

    # condition={'name':'xx'}
    # student=collections.find_one(condition)
    # print(student)
    # student['age']=112
    # result=collections.update_one(condition,{'$set':student})
    # print(result)
    # print(result.matched_count,result.modified_count)

    #再看一个例子
    # condition={'age':{'$gt':20}}
    # result=collections.update_one(condition,{'$inc':{'age':1}})
    # print(result)
    # print(result.matched_count,result.modified_count)
    # 在这里我们指定查询条件为年龄大于20,
    # 然后更新条件为{'$inc': {'age': 1}},执行之后会讲第一条符合条件的数据年龄加1。
    # <pymongo.results.UpdateResult object at 0x000000000A99AB48>
    # 1 1

    # 如果调用update_many()方法,则会将所有符合条件的数据都更新,示例如下:

    condition = {'age': {'$gt': 20}}
    result = collections.update_many(condition, {'$inc': {'age': 1}})
    print(result)
    print(result.matched_count, result.modified_count)
    # 这时候匹配条数就不再为1条了,运行结果如下:

    # <pymongo.results.UpdateResult object at 0x10c6384c8>
    # 3 3
    # 可以看到这时所有匹配到的数据都会被更新。


    # ###############删除
    # 删除操作比较简单,直接调用remove()方法指定删除的条件即可,符合条件的所有数据均会被删除,示例如下:

    # result = collections.remove({'name': 'Kevin'})
    # print(result)
    # 运行结果:

    # {'ok': 1, 'n': 1}
    # 另外依然存在两个新的推荐方法,delete_one()和delete_many()方法,示例如下:

    # result = collections.delete_one({'name': 'Kevin'})
    # print(result)
    # print(result.deleted_count)
    # result = collections.delete_many({'age': {'$lt': 25}})
    # print(result.deleted_count)
    # # 运行结果:

    # <pymongo.results.DeleteResult object at 0x10e6ba4c8>
    # 1
    # 4
    # delete_one()即删除第一条符合条件的数据,delete_many()即删除所有符合条件的数据,返回结果是DeleteResult类型,
    # 可以调用deleted_count属性获取删除的数据条数。


    # 更多
    # 另外PyMongo还提供了一些组合方法,如find_one_and_delete()、find_one_and_replace()、find_one_and_update(),
    # 就是查找后删除、替换、更新操作,用法与上述方法基本一致。

    二.爬取腾讯招聘

      爬虫文件

    # -*- coding: utf-8 -*-
    import scrapy
    from Tencent.items import TencentItem
    
    
    class TencentSpider(scrapy.Spider):
        name = 'tencent'
        # allowed_domains = ['www.xxx.com']
        #指定基础url用来做拼接用的
        base_url = 'http://hr.tencent.com/position.php?&start='
        page_num = 0
        start_urls = [base_url + str(page_num)]
    
        def parse(self, response):
            tr_list = response.xpath("//tr[@class='even' ] | //tr[@class='odd']")
            #先拿到存放类目的标签列表,然后循环标签列表
            for tr in tr_list:
                name = tr.xpath('./td[1]/a/text()').extract_first()
                url = tr.xpath('./td[1]/a/@href').extract_first()
                #在工作类别的时候,有时候是空值,会报错,需要这样直接给他一个空值
                # if len(tr.xpath("./td[2]/text()")):
                #    worktype = tr.xpath("./td[2]/text()").extract()[0].encode("utf-8")
                # else:
                #     worktype = "NULL"
                #如果不报错就用这种
                worktype = tr.xpath('./td[2]/text()').extract_first()
                num = tr.xpath('./td[3]/text()').extract_first()
                location = tr.xpath('./td[4]/text()').extract_first()
                publish_time = tr.xpath('./td[5]/text()').extract_first()
    
                item = TencentItem()
                item['name'] = name
                item['worktype'] = worktype
                item['url'] = url
                item['num'] = num
                item['location'] = location
                item['publish_time'] = publish_time
                print('----', name)
                print('----', url)
                print('----', worktype)
                print('----', location)
                print('----', num)
                print('----', publish_time)
    
                yield item
    
            # 分页处理:方法一
            # 这是第一中写法,在知道他的页码的情况下使用
            # 适用场景,在没有下一页可以点击,只能通过url拼接的情况
            # if self.page_num<3060:
            #     self.page_num+=10
            #     url=self.base_url+str(self.page_num)
            #     # yield  scrapy.Request(url=url,callback=self.parse)
            #     yield  scrapy.Request(url, callback=self.parse)
    
            # 方法二:
            # 直接提取的他的下一页连接
            # 这个等于0,说明不是最后一页,可以继续下一页,否则不等于0就继续提取
            #获取下一页的url直接拼接就可以了
            if len(response.xpath("//a[@id='next' and @class='noactive']")) == 0:
                next_url = response.xpath('//a[@id="next"]/@href').extract_first()
                url = 'https://hr.tencent.com/' + next_url
                yield scrapy.Request(url=url, callback=self.parse)
    爬虫文件

      pipeline

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    import pymysql
    import json
    from redis import Redis
    import pymongo
    #存储到本地
    class TencentPipeline(object):
        f=None
        def open_spider(self,spider):
            self.f=open('./tencent2.txt','w',encoding='utf-8')
        def process_item(self, item, spider):
            self.f.write(item['name']+':'+item['url']+':'+item['num']+':'+item['worktype']+':'+item['location']+':'+item['publish_time']+'
    ')
            return item
        def close_spider(self,spider):
            self.f.close()
    #存储到mysql
    class TencentPipelineMysql(object):
    
        conn=None
        cursor=None
        def open_spider(self,spider):
            self.conn=pymysql.connect(host='127.0.0.1',port=3306,user='root',password='123',db='tencent')
        def process_item(self,item,spider):
            print('这是mydql.米有进来吗')
            self.cursor = self.conn.cursor()
            try:
                self.cursor.execute('insert into tencent values("%s","%s","%s","%s","%s","%s")'%(item['name'],item['worktype'],item['url'],item['num'],item['publish_time'],item['location']))
                self.conn.commit()
            except Exception as  e:
                print('错误提示',e)
                self.conn.rollback()
            return item
    
        def close_spider(self,spider):
            self.cursor.close()
            self.conn.close()
    
    
    #储存到redis
    class TencentPipelineRedis(object):
        conn=None
        def open_spider(self,spider):
            self.conn=Redis(host='127.0.0.1',port=6379)
    
        def process_item(self,item,spider):
            item_dic=dict(item)
            item_json=json.dumps(item_dic)
            self.conn.lpush('tencent',item_json)
            return item
    
    #存储到mongodb
    class TencentPipelineMongo(object):
        client=None
        def open_spider(self,spider):
            self.client=pymongo.MongoClient(host='127.0.0.1',port=27017)
            self.db=self.client['test']
    
        def process_item(self,item,spider):
            collection = self.db['tencent']
            item_dic=dict(item)
            collection.insert(item_dic)
    
            return item
    
        def close_spider(self,spider):
            self.client.close()
    pipeline

      settings.py

    # -*- coding: utf-8 -*-
    
    # Scrapy settings for Tencent project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://doc.scrapy.org/en/latest/topics/settings.html
    #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'Tencent'
    
    SPIDER_MODULES = ['Tencent.spiders']
    NEWSPIDER_MODULE = 'Tencent.spiders'
    
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32
    
    # Configure a delay for requests for the same website (default: 0)
    # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    #DOWNLOAD_DELAY = 3
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16
    
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:
    #DEFAULT_REQUEST_HEADERS = {
    #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    #   'Accept-Language': 'en',
    #}
    
    # Enable or disable spider middlewares
    # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    #    'Tencent.middlewares.TencentSpiderMiddleware': 543,
    #}
    
    # Enable or disable downloader middlewares
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #DOWNLOADER_MIDDLEWARES = {
    #    'Tencent.middlewares.TencentDownloaderMiddleware': 543,
    #}
    
    # Enable or disable extensions
    # See https://doc.scrapy.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    #    'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    
    # Configure item pipelines
    # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {
       'Tencent.pipelines.TencentPipeline': 300,
        'Tencent.pipelines.TencentPipelineMysql': 301,
        'Tencent.pipelines.TencentPipelineRedis': 302,
        'Tencent.pipelines.TencentPipelineMongo': 303,
    
    }
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    View Code

      item

    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class TencentItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        name=scrapy.Field()
        url=scrapy.Field()
        worktype=scrapy.Field()
        location=scrapy.Field()
        num=scrapy.Field()
        publish_time=scrapy.Field()
    View Code

     

  • 相关阅读:
    结构体后面不加 ; 的后果。
    swap的两种错误写法
    rewind和fseek作用分析
    16个get函数的用法。
    枚举的简单使用。
    小知识点
    网线头的做法
    内存和寄存器
    linux下service mongod start启动报错
    appium上下文切换、webview调试以及chromedriver/键盘等报错问题解决
  • 原文地址:https://www.cnblogs.com/tjp40922/p/10486317.html
Copyright © 2011-2022 走看看