zoukankan      html  css  js  c++  java
  • python爬取elasticsearch内容

    我们以上篇的elasticsearch添加的内容为例,对其内容进行爬取,并获得有用信息个过程。

    先来看一下elasticsearch中的内容:

    {
      "took": 88,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": 3,
        "max_score": 1,
        "hits": [
          {
            "_index": "megacorp",
            "_type": "employee",
            "_id": "2",
            "_score": 1,
            "_source": {
              "first_name": "Jane",
              "last_name": "Smith",
              "age": 32,
              "about": "I like to collect rock albums",
              "interests": [
                "music"
              ]
            }
          },
          {
            "_index": "megacorp",
            "_type": "employee",
            "_id": "1",
            "_score": 1,
            "_source": {
              "first_name": "John",
              "last_name": "Smith",
              "age": 25,
              "about": "I love to go rock climbing",
              "interests": [
                "sports",
                "music"
              ]
            }
          },
          {
            "_index": "megacorp",
            "_type": "employee",
            "_id": "3",
            "_score": 1,
            "_source": {
              "first_name": "Douglas",
              "last_name": "Fir",
              "age": 35,
              "about": "I like to build cabinets",
              "interests": [
                "forestry"
              ]
            }
          }
        ]
      }
    }

    1.在python中,首先要用到urllib的包,其次对其进行读取的格式为json。

    import urllib.request as request
    import json

    2.接下来,我们获取相应的路径请求,并用urlopen打开请求的文件:

    if __name__ == '__main__':
        req = request.Request("http://localhost:9200/megacorp/employee/_search")
        resp = request.urlopen(req)

    3.对得到的resp,我们需要用json的格式迭代输出:(注意是字符串类型)

    jsonstr=""
        for line in resp:
            jsonstr+=line.decode()
        data=json.loads(jsonstr)
        print(data)

    4.但是我们得到的信息是包含内容和属性的,我们只想得到内容,那么久需要对每层的属性进行分解获取:

    employees = data['hits']['hits']
     
        for e in employees:
            _source=e['_source']
            full_name=_source['first_name']+"."+_source['last_name']
            age=_source["age"]
            about=_source["about"]
            interests=_source["interests"]
            print(full_name,'is',age,",")
            print(full_name,"info is",about)
            print(full_name,'likes',interests)

    得到的内容为:

    Jane.Smith is 32 ,
    Jane.Smith info is I like to collect rock albums
    Jane.Smith likes ['music']
    
    John.Smith is 25 ,
    John.Smith info is I love to go rock climbing
    John.Smith likes ['sports', 'music']
    
    Douglas.Fir is 35 ,
    Douglas.Fir info is I like to build cabinets
    Douglas.Fir likes ['forestry']

    对于需要聚合的内容,我们可以通过下面的方法进行获取:

    1:获取路径

    url="http://localhost:9200/megacorp/employee/_search"

    2.获取聚合的格式查询

    data='''
        {
        "aggs" : {
            "all_interests" : {
                "terms" : { "field" : "interests" },
                "aggs" : {
                    "avg_age" : {
                        "avg" : { "field" : "age" }
                    }
                }
            }
        }
    }
        '''

    3.标明头部信息

    headers={"Content-Type":"application/json"}

    4.同样,以请求和相应的方式获取信息并迭代为json格式

    req=request.Request(url=url,data=data.encode(),headers=headers,method="GET")
        resp=request.urlopen(req)
        jsonstr=""
        for line in resp:
            jsonstr+=line.decode()
        rsdata=json.loads(jsonstr)

    5.有用聚合信息内部依然是数组形式,所以依然需要迭代输出:

    agg = rsdata['aggregations']
    buckets = agg['all_interests']['buckets']
        
        for b in buckets:
            key = b['key']
            doc_count = b['doc_count']
            avg_age = b['avg_age']['value']        
    print('aihao',key,'gongyou',doc_count,'ren,tamenpingjuageshi',avg_age)

    最终得到信息:

    aihao music gongyou 2 ren,tamenpingjuageshi 28.5
    
    aihao forestry gongyou 1 ren,tamenpingjuageshi 35.0
    
    aihao sports gongyou 1 ren,tamenpingjuageshi 25.0
  • 相关阅读:
    mysql报错:java.sql.SQLException: The server time zone value 'Öйú±ê׼ʱ¼ä' is unrecognized or represents more than one time zone.
    MD5登陆密码的生成
    15. 3Sum、16. 3Sum Closest和18. 4Sum
    11. Container With Most Water
    8. String to Integer (atoi)
    6. ZigZag Conversion
    5. Longest Palindromic Substring
    几种非线性激活函数介绍
    AI初探1
    AI初探
  • 原文地址:https://www.cnblogs.com/qianshuixianyu/p/9287556.html
Copyright © 2011-2022 走看看