我们以上篇的elasticsearch添加的内容为例,对其内容进行爬取,并获得有用信息个过程。
先来看一下elasticsearch中的内容:
{ "took": 88, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 3, "max_score": 1, "hits": [ { "_index": "megacorp", "_type": "employee", "_id": "2", "_score": 1, "_source": { "first_name": "Jane", "last_name": "Smith", "age": 32, "about": "I like to collect rock albums", "interests": [ "music" ] } }, { "_index": "megacorp", "_type": "employee", "_id": "1", "_score": 1, "_source": { "first_name": "John", "last_name": "Smith", "age": 25, "about": "I love to go rock climbing", "interests": [ "sports", "music" ] } }, { "_index": "megacorp", "_type": "employee", "_id": "3", "_score": 1, "_source": { "first_name": "Douglas", "last_name": "Fir", "age": 35, "about": "I like to build cabinets", "interests": [ "forestry" ] } } ] } }
1.在python中,首先要用到urllib的包,其次对其进行读取的格式为json。
import urllib.request as request import json
2.接下来,我们获取相应的路径请求,并用urlopen打开请求的文件:
if __name__ == '__main__': req = request.Request("http://localhost:9200/megacorp/employee/_search") resp = request.urlopen(req)
3.对得到的resp,我们需要用json的格式迭代输出:(注意是字符串类型)
jsonstr="" for line in resp: jsonstr+=line.decode() data=json.loads(jsonstr) print(data)
4.但是我们得到的信息是包含内容和属性的,我们只想得到内容,那么久需要对每层的属性进行分解获取:
employees = data['hits']['hits'] for e in employees: _source=e['_source'] full_name=_source['first_name']+"."+_source['last_name'] age=_source["age"] about=_source["about"] interests=_source["interests"] print(full_name,'is',age,",") print(full_name,"info is",about) print(full_name,'likes',interests)
得到的内容为:
Jane.Smith is 32 , Jane.Smith info is I like to collect rock albums Jane.Smith likes ['music'] John.Smith is 25 , John.Smith info is I love to go rock climbing John.Smith likes ['sports', 'music'] Douglas.Fir is 35 , Douglas.Fir info is I like to build cabinets Douglas.Fir likes ['forestry']
对于需要聚合的内容,我们可以通过下面的方法进行获取:
1:获取路径
url="http://localhost:9200/megacorp/employee/_search"
2.获取聚合的格式查询
data=''' { "aggs" : { "all_interests" : { "terms" : { "field" : "interests" }, "aggs" : { "avg_age" : { "avg" : { "field" : "age" } } } } } } '''
3.标明头部信息
headers={"Content-Type":"application/json"}
4.同样,以请求和相应的方式获取信息并迭代为json格式
req=request.Request(url=url,data=data.encode(),headers=headers,method="GET") resp=request.urlopen(req) jsonstr="" for line in resp: jsonstr+=line.decode() rsdata=json.loads(jsonstr)
5.有用聚合信息内部依然是数组形式,所以依然需要迭代输出:
agg = rsdata['aggregations'] buckets = agg['all_interests']['buckets'] for b in buckets: key = b['key'] doc_count = b['doc_count'] avg_age = b['avg_age']['value']
print('aihao',key,'gongyou',doc_count,'ren,tamenpingjuageshi',avg_age)
最终得到信息:
aihao music gongyou 2 ren,tamenpingjuageshi 28.5 aihao forestry gongyou 1 ren,tamenpingjuageshi 35.0 aihao sports gongyou 1 ren,tamenpingjuageshi 25.0