zoukankan      html  css  js  c++  java
  • elasticsearch

    1、安装elasticsearch-rtf(elasticsearch中文发行版,针对中文集成了相关插件,方便新手学习测试.)

      https://github.com/ 上搜索elasticsearch-rtf下载最新版,cmd运行bin文件夹下elasticsearch.bat

    2、在浏览器中输入:127.0.0.1:9200显示如下则安装成功:

    ------------------------------------

    {
      "name" : "ewadZmQ",
      "cluster_name" : "elasticsearch",
      "cluster_uuid" : "-BfaRD5ETwuGxlEEPqJNqQ",
      "version" : {
        "number" : "5.1.1",
        "build_hash" : "5395e21",
        "build_date" : "2016-12-06T12:36:15.409Z",
        "build_snapshot" : false,
        "lucene_version" : "6.3.0"
      },
      "tagline" : "You Know, for Search"
    }
    ---------------------------------------
    3、head插件安装

     1)在github上搜索elasticsearch-head下载第一个,

     2)安装node.js(http://nodejs.cn/download/),安装完成后输入:node -v 输出v6.10.3 这样的版本号,就安装成功了,再输入:npm - v输出3.10.10 这样的版本号npm就安装成功了(node.js集成了npm)

     3)安装cnpm(http://npm.taobao.org/),cmd下运行:npm install -g cnpm --registry=https://registry.npm.taobao.org

     4)cmd到elasticsearch-head目录,运行cnpm install 完成后再运行cnpm run start

     5)打开网页:http://localhost:910如下图

    提示链接不到http://127.0.0.1:9200/端口,为什么?elasticsearch默认情况下不允许使用第三方服务,所以不能链接

     解决:在elasticsearch-rft的config文件夹下的elasticsearch.yml文件最后加入如下配置:

    http.cors.enabled: true
    http.cors.allow-origin: "*"
    http.cors.allow-methods: OPTIONS, HEAD, GET, POST, PUT, DELETE
    http.cors.allow-headers: "X-Requested-With, Content-Type, Content-Length, X-User"

    重起elasticsearch,如下图链接成功

    6)下载安装Kibana 5.1.1(elasticsearch是5.1.1)(https://www.elastic.co/downloads/past-releases),cmd下在bin文件夹下运行kibana.bat文件,打开网页:http://127.0.0.1:5601/,安装成功。

    7)把scrapy数据写入到elasticsearch:
      a、先cmd到虚拟环境中安装重
    elasticsearch-dsl(scrapy操作elasticsearch的高级接口):
    pip install elasticsearch-dsl
      b、创建文件夹models,再创建一个es_types.py文件,定义字段类型并运行文件建立索引:
    from datetime import datetime
    from elasticsearch_dsl import DocType, Date, Nested, Boolean, analyzer, InnerObjectWrapper, Completion, Keyword, Text, Integer
    from elasticsearch_dsl.connections import connections
    connections.create_connection(hosts=['localhost'])
    class ArticleType(DocType):
        #文章类型
        title = Text(analyzer="ik_max_word")
        create_date = Date()
        praise_nums = Integer()
        fav_nums = Integer()
        comment_nums = Integer()
        tags = Text(analyzer="ik_max_word")
        front_image_url = Keyword()
        url_object_id = Keyword()
        front_image_path = Keyword()
        url = Keyword()
        content = Text(analyzer="ik_max_word")
    
        class Meta:
            index = 'jobbole'
            doc_type = 'article'
    if __name__ == '__main__':
        ArticleType.init()

      c、在Pipelines.py文件字义一个pipeline类:
    class ElasticsearchPipeline(object):
        #把数据写入elasticsearch
        def process_item(self, item, spider):
            #把item转换为elasticsearch数据
            article = ArticleType()
            article.title = item['title']
            article.create_date = item['create_date']
            article.content = remove_tags(item['content'])  #remove_tags()去除html标签
            article.front_image_url = item['front_image_url']
            article.front_image_path = item['front_image_path']
            article.praise_nums = item['praise_nums']
            article.fav_nums = item['fav_nums']
            article.comment_nums = item['comment_nums']
            article.url = item['url']
            article.tags = item['tags']
            article.meta.id = item['url_object_id']
    
            article.save() #保存
            return item
    
    

      d、再把Pipelines.py文件中的ElasticsearchPipeline类配置到settings.py文件中:

    ITEM_PIPELINES = {'spider.pipelines.ElasticsearchPipeline': 1}

      e、运行scrapy程序,在http://127.0.0.1:9100/中的数据浏览中显示如下,则配置成功。

     优化:为了不同爬虫能利用同一个Pipelines类,把Pipelines类功能放入到item.py文件中的相应item类中:

    class JobboleArticleItem(scrapy.Item):
        title = scrapy.Field()
        create_date = scrapy.Field(input_processor=MapCompose(date_convert))
        praise_nums = scrapy.Field(input_processor=MapCompose(number_convert))
        fav_nums = scrapy.Field(input_processor=MapCompose(number_convert))
        comment_nums = scrapy.Field(input_processor=MapCompose(number_convert))
        tags = scrapy.Field(input_processor=MapCompose(remove_comment_tags), output_processor=Join(','))
        front_image_url = scrapy.Field(output_processor=MapCompose(returnValue))
        url_object_id = scrapy.Field(input_processor=MapCompose(get_md5))
        front_image_path = scrapy.Field()
        url = scrapy.Field()
        content = scrapy.Field()
    
        def get_insert_mysql(self):
          #写入数据到mysql
            insert_sql = """
                        insert into jobbole(front_image_url,front_image_path,title,url,create_date,url_object_id,fav_nums,comment_nums,praise_nums,tags,content)
                        values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
                        ON DUPLICATE KEY UPDATE fav_nums=VALUES(fav_nums),comment_nums=VALUES(comment_nums),praise_nums=VALUES(praise_nums)
                        """
            params = (self['front_image_url'][0], self['front_image_path'], self['title'], self['url'], self['create_date'],
                      self['url_object_id'], self['fav_nums'], self['comment_nums'], self['praise_nums'], self['tags'],
                      self['content'])
            return insert_sql, params
    
        def save_to_elasticsearch(self):
            #写入数据到elasticsearch
            article = ArticleType()
            article.title = self['title']
            article.create_date = self['create_date']
            article.content = remove_tags(self['content'])  # remove_tags()去除html标签
            article.front_image_url = self['front_image_url']
            if 'front_image_path' in self:
                article.front_image_path = self['front_image_path']
            article.praise_nums = self['praise_nums']
            article.fav_nums = self['fav_nums']
            article.comment_nums = self['comment_nums']
            article.url = self['url']
            article.tags = self['tags']
            article.meta.id = self['url_object_id']
    
            article.save()  # 保存
            return        

    然后再在Pipelines.py文件pipeline类调用save_to_elasticsearch():

    class ElasticsearchPipeline(object):
        #把数据写入elasticsearch
        def process_item(self, item, spider):
            #把item转换为elasticsearch数据
            item.save_to_elasticsearch()
            return item
     
  • 相关阅读:
    截屏 多难未遂
    捕捉异常
    Android中缓存记忆
    Android中的线程池
    悄悄为Android中解决部分适配问题哦!
    java中的服务端跳转与客户端跳转区别与联系
    doget(),doput()方法的使用
    基本概念--同步,异步
    java--SimpleDataFormat类
    java--9
  • 原文地址:https://www.cnblogs.com/jp-mao/p/6933480.html
Copyright © 2011-2022 走看看