zoukankan      html  css  js  c++  java
  • 基于term vector深入探查数据

    1、term vector介绍
    
    获取document中的某个field内的各个term的统计信息
    
    term information: term frequency in the field, term positions, start and end offsets, term payloads
    
    term statistics: 设置term_statistics=true; total term frequency, 一个term在所有document中出现的频率; document frequency,有多少document包含这个term
    
    field statistics: document count,有多少document包含这个field; sum of document frequency,一个field中所有term的df之和; sum of total term frequency,一个field中的所有term的tf之和
    
    GET /twitter/tweet/1/_termvectors
    GET /twitter/tweet/1/_termvectors?fields=text
    
    term statistics和field statistics并不精准,不会被考虑有的doc可能被删除了
    
    我告诉大家,其实很少用,用的时候,一般来说,就是你需要对一些数据做探查的时候。比如说,你想要看到某个term,某个词条,大话西游,这个词条,在多少个document中出现了。或者说某个field,film_desc,电影的说明信息,有多少个doc包含了这个说明信息。
    
    2、index-iime term vector实验
    
    term vector,涉及了很多的term和field相关的统计信息,有两种方式可以采集到这个统计信息
    
    (1)index-time,你在mapping里配置一下,然后建立索引的时候,就直接给你生成这些term和field的统计信息了
    (2)query-time,你之前没有生成过任何的Term vector信息,然后在查看term vector的时候,直接就可以看到了,会on the fly,现场计算出各种统计信息,然后返回给你
    
    这一讲,不会手敲任何命令,直接copy我做好的命令,因为这一讲的重点,不是掌握什么搜索或者聚合的语法,而是说,掌握,如何采集term vector信息,然后如何看懂term vector信息,你能掌握利用term vector进行数据探查
    
    PUT /my_index
    {
    "mappings": {
    "my_type": {
    "properties": {
    "text": {
    "type": "text",
    "term_vector": "with_positions_offsets_payloads",
    "store" : true,
    "analyzer" : "fulltext_analyzer"
    },
    "fullname": {
    "type": "text",
    "analyzer" : "fulltext_analyzer"
    }
    }
    }
    },
    "settings" : {
    "index" : {
    "number_of_shards" : 1,
    "number_of_replicas" : 0
    },
    "analysis": {
    "analyzer": {
    "fulltext_analyzer": {
    "type": "custom",
    "tokenizer": "whitespace",
    "filter": [
    "lowercase",
    "type_as_payload"
    ]
    }
    }
    }
    }
    }
    
    
    PUT /my_index/my_type/1
    {
    "fullname" : "Leo Li",
    "text" : "hello test test test "
    }
    
    PUT /my_index/my_type/2
    {
    "fullname" : "Leo Li",
    "text" : "other hello test ..."
    }
    
    GET /my_index/my_type/1/_termvectors
    {
    "fields" : ["text"],
    "offsets" : true,
    "payloads" : true,
    "positions" : true,
    "term_statistics" : true,
    "field_statistics" : true
    }
    
    {
    "_index": "my_index",
    "_type": "my_type",
    "_id": "1",
    "_version": 1,
    "found": true,
    "took": 10,
    "term_vectors": {
    "text": {
    "field_statistics": {
    "sum_doc_freq": 6,
    "doc_count": 2,
    "sum_ttf": 8
    },
    "terms": {
    "hello": {
    "doc_freq": 2,
    "ttf": 2,
    "term_freq": 1,
    "tokens": [
    {
    "position": 0,
    "start_offset": 0,
    "end_offset": 5,
    "payload": "d29yZA=="
    }
    ]
    },
    "test": {
    "doc_freq": 2,
    "ttf": 4,
    "term_freq": 3,
    "tokens": [
    {
    "position": 1,
    "start_offset": 6,
    "end_offset": 10,
    "payload": "d29yZA=="
    },
    {
    "position": 2,
    "start_offset": 11,
    "end_offset": 15,
    "payload": "d29yZA=="
    },
    {
    "position": 3,
    "start_offset": 16,
    "end_offset": 20,
    "payload": "d29yZA=="
    }
    ]
    }
    }
    }
    }
    }
    
    3、query-time term vector实验
    
    GET /my_index/my_type/1/_termvectors
    {
    "fields" : ["fullname"],
    "offsets" : true,
    "positions" : true,
    "term_statistics" : true,
    "field_statistics" : true
    }
    
    一般来说,如果条件允许,你就用query time的term vector就可以了,你要探查什么数据,现场去探查一下就好了
    
    4、手动指定doc的term vector
    
    GET /my_index/my_type/_termvectors
    {
    "doc" : {
    "fullname" : "Leo Li",
    "text" : "hello test test test"
    },
    "fields" : ["text"],
    "offsets" : true,
    "payloads" : true,
    "positions" : true,
    "term_statistics" : true,
    "field_statistics" : true
    }
    
    手动指定一个doc,实际上不是要指定doc,而是要指定你想要安插的词条,hello test,那么就可以放在一个field中
    
    将这些term分词,然后对每个term,都去计算它在现有的所有doc中的一些统计信息
    
    这个挺有用的,可以让你手动指定要探查的term的数据情况,你就可以指定探查“大话西游”这个词条的统计信息
    
    5、手动指定analyzer来生成term vector
    
    GET /my_index/my_type/_termvectors
    {
    "doc" : {
    "fullname" : "Leo Li",
    "text" : "hello test test test"
    },
    "fields" : ["text"],
    "offsets" : true,
    "payloads" : true,
    "positions" : true,
    "term_statistics" : true,
    "field_statistics" : true,
    "per_field_analyzer" : {
    "text": "standard"
    }
    }
    
    6、terms filter
    
    GET /my_index/my_type/_termvectors
    {
    "doc" : {
    "fullname" : "Leo Li",
    "text" : "hello test test test"
    },
    "fields" : ["text"],
    "offsets" : true,
    "payloads" : true,
    "positions" : true,
    "term_statistics" : true,
    "field_statistics" : true,
    "filter" : {
    "max_num_terms" : 3,
    "min_term_freq" : 1,
    "min_doc_freq" : 1
    }
    }
    
    这个就是说,根据term统计信息,过滤出你想要看到的term vector统计结果
    也挺有用的,比如你探查数据把,可以过滤掉一些出现频率过低的term,就不考虑了
    
    7、multi term vector
    
    GET _mtermvectors
    {
    "docs": [
    {
    "_index": "my_index",
    "_type": "my_type",
    "_id": "2",
    "term_statistics": true
    },
    {
    "_index": "my_index",
    "_type": "my_type",
    "_id": "1",
    "fields": [
    "text"
    ]
    }
    ]
    }
    
    GET /my_index/_mtermvectors
    {
    "docs": [
    {
    "_type": "test",
    "_id": "2",
    "fields": [
    "text"
    ],
    "term_statistics": true
    },
    {
    "_type": "test",
    "_id": "1"
    }
    ]
    }
    
    GET /my_index/my_type/_mtermvectors
    {
    "docs": [
    {
    "_id": "2",
    "fields": [
    "text"
    ],
    "term_statistics": true
    },
    {
    "_id": "1"
    }
    ]
    }
    
    GET /_mtermvectors
    {
    "docs": [
    {
    "_index": "my_index",
    "_type": "my_type",
    "doc" : {
    "fullname" : "Leo Li",
    "text" : "hello test test test"
    }
    },
    {
    "_index": "my_index",
    "_type": "my_type",
    "doc" : {
    "fullname" : "Leo Li",
    "text" : "other hello test ..."
    }
    }
    ]
    }
    
     
  • 相关阅读:
    python 读写XLS
    python去噪算法
    编译freetype 的dll
    python生成测试图片
    python 将pdf分页后插入至word中
    ie height
    Formview单文档或对话框项目接受不到按键消息的解决办法
    SQL SERVER配置[转]
    PyQt5 布局
    PyQt5 各种菜单实现
  • 原文地址:https://www.cnblogs.com/wangchuanfu/p/10990708.html
Copyright © 2011-2022 走看看