zoukankan      html  css  js  c++  java
  • 如何将爬取的数据写入ES中

    前面章节一直在说ES相关知识点,现在是如何实现将爬取到的数据写入到ES中,首先的知道ES的python接口叫elasticsearch dsl

    链接:https://github.com/elastic/elasticsearch-dsl-py

    什么是elasticsearch dsl:

    Elasticsearch DSL是一个高级库,其目的是帮助编写和运行针对Elasticsearch的查询

    安装:

    pip install elasticsearch-dsl

    首先我们在项目文件中新建一个名为models的包,然后在包里新建一个名为es.types.py的文件,用于定义ES的数据的定义

    # !/usr/bin/env python
    # -*- coding:utf-8 -*-
    from datetime import datetime
    from elasticsearch_dsl import  Date,DocType,Text,Integer,analyzer,Completion,Keyword,Integer
    from elasticsearch_dsl.connections import connections
    connections.create_connection(hosts=["localhost"])
    
    class ActicleType(DocType):
          #伯乐在线文章类型
          title = Text(analyzer ="ik_max_word")
          create_date = Date()
          url = Keyword()
          url_object_id = Keyword()
          front_image_url = Keyword()
          front_image_path = Keyword()
          praise_nums = Integer()
          comment_nums = Integer()
          fav_nums = Integer()
          tags = Text(analyzer="ik_max_word")
          content = Text(analyzer="ik_max_word")
    
          class Meta:
                index = "jobbile"
                doc_type = "article"
    
    if __name__=="__main__":
          ActicleType.init()

    然后再items中编写如下文件:

    # !/usr/bin/env python
    # -*- coding:utf-8 -*-
    from models.es_types import ArticleType
    
    def save_to_es(self):
        artcle = ArticleType()
        artcle.title = self['title']
        artcle.cteate_date = self['cteate_date']
        artcle.content = remove_tags(self['content'])
        artcle.front_image_url = self['front_image']
        if "front_image_path" in self:
            artcle.front_image_path = self['front_image_path']
        artcle.praise_nums = self['praise_nums']
        artcle.fav_nums = self['fav_nums']
        artcle.comment_nums = self['comment_nums']
        artcle.url = self['url']
        artcle.tags = self['tags']
        artcle.meta.id = self['url_object_id']
    
        artcle.save()
        return

    然后再pipeline中编写如下文件:

    # !/usr/bin/env python
    # -*- coding:utf-8 -*-
    from models.es_types import ArticleType
    from w3lib.html import remove_tags
    
    class ElasticsearchPipeline(object):
        #将数据写入到es中
        def process_item(self,item,spider):
            #将item转换为es数据
            item.save_to_es()
            return item

    最后再settings中编写如下文件:

    # !/usr/bin/env python
    # -*- coding:utf-8 -*-
    ITEM_PIPELINES = {
       'ArticleSpider.pipelines.ElasticsearchPipeline': 300,
    }
  • 相关阅读:
    安卓执行机制JNI、Dalvik、ART之间的比較 。android L 改动执行机制。
    Android studio 导入githubproject
    JS创建对象几种不同方法具体解释
    python 学习笔记 13 -- 经常使用的时间模块之time
    Version和Build的差别
    关于Java基础的一些笔试题总结
    vim编码方式配置的学习和思考
    从头认识java-15.5 使用LinkedHashSet须要注意的地方
    一篇文章,带你明确什么是过拟合,欠拟合以及交叉验证
    Spring -- Bean自己主动装配&Bean之间关系&Bean的作用域
  • 原文地址:https://www.cnblogs.com/lmx123/p/10025611.html
Copyright © 2011-2022 走看看