zoukankan      html  css  js  c++  java
  • ElasticSearch

    Notes from Exploring ElasticSearch

    The installation of Elasticsearch is very simple. It's a server for processing texts.

    Elasticsearch is a standalone Java app, and can be easily started from command line. A copy can be obtained from the elasticsearch download page.

    Microsoft Windows:

    Download the .zip version and unpack it to a folder. Navigate to the bin folder, then double click elasticsearch.bat to run.

    If the serve is successfully be startedl, you'll see information in terminal like this:

    [2015-02-04 20:43:12,747][INFO ][node              ] [Joe Fixit] started

    P.S: There's a problem you may meet. If the terminal provides the information like this:

    [2014-12-17 09:31:03,820][WARN ][cluster.routing.allocation.decider]
    [logstash test] high disk watermark [10%] exceeded on
    [7drCr113QgSM8wcjNss_Mg][Blur] free: 632.3mb[8.4%], shards will be
    relocated away from this node

    [2014-12-17 09:31:03,820][INFO ][cluster.routing.allocation.decider]
    [logstash test] high disk watermark exceeded on one or more nodes,
    rerouting shards

    It just means there's no enough space in your current disk. So you only need to delete some files for freeing space.

    After you've started your server, you can ensure it's running properly by opening your browser to the URL: http://localhost:9200. You should see a page like this:

    {
      "status" : 200,
      "name" : "Joe Fixit",
      "cluster_name" : "elasticsearch",
      "version" : {
        "number" : "1.4.2",
        "build_hash" : "927caff6f05403e936c20bf4529f144f0c89fd8c",
        "build_timestamp" : "2014-12-16T14:11:12Z",
        "build_snapshot" : false,
        "lucene_version" : "4.10.2"
      },
      "tagline" : "You Know, for Search"
    }

    As it's free to use any tool you wish to query elasticsearch, we can install curl and cygwin to query elasticsearch.

    But if you're reading the book Exploring ElasticSearch, you'd better install the tool made by the author: elastic-hammer. You can find the detailed information on Github: https://github.com/andrewvc/elastic-hammer. It's very easy to install it as a plugin with the following steps:

    Modeling Data

    field:      the smallest individual unit of data.

    documents:    collections of fields, and comprise the base unit of storage in elasticsearch.

    The primary data-format elasticsearch uses is JSON. A sample document:

    {
    	"_id" : 1,
    	"handle" : "ron",
    	"hobbies" : ["hacking", "the great outdoors"],
    	"computer" : {"cpu" : "pentium pro", "mhz" : 200}
    }
    

    The user-dfined type is analogous to a database schema. Types are defined with the Mapping APIs:

    {
    	"user" : {
    		"properties" : {
    			"handle" : {"type" : "string"},
    			"age" : {"type" : "integer"},
    			"hobbies" : {"type" : "string"},
    			"computer" : {
    				"properties" : {
    					"cpu" : {"type" : string},
    					"speed" : {"type" : "integer"}
    				}
    			}
    		}
    	}
    }
    

    Basic CRUD

    The full CRUD lifecycle in elasticsearch is Create, Read, Update, Delete. We'll create an index, then a type, and finally a document within that index using tat type. The URL scheme is consistent for these operations, with most URLs having the form /index/type/docid, and that special operations on a given namespace are namespaced with an uderscore prefix. 

    // create an index named 'planet'
    PUT /planet
    
    
    // create a type called 'hacker'
    PUT /planet/hacker/_mapping
    {
    	"hacker" : {
    		"properties" : {
    			"handle" : {"type" : "string"},
    			"age" : {"type" : "long"}
    		}
    	}
    }
    
    
    // create a document
    PUT /planet/hacker/1
    {"handle" : "jean-michea", "age" : 18}
    
    
    // retrieve the document
    GET /planet/hacker/1
    
    
    // update the document's age field
    POST /planet/hacker/1/_update
    {"doc" : {"age" : 19}}
    
    
    // delete the document
    DELETE /planet/hacker/1
    

    Search Data

    First, create our schema:

    // Delete the document
    DELETE /planet/hacker/1
    
    
    // Delete any existing indexes named planet
    DELETE /planet
    
    // Create our index
    PUT /planet/
    {
    	"mappings" : {
    		"hacker" : {
    			"properties" : {
    				"handle" : {"type" : "string"},
    				"hobbies" : {"type" : "string", "analyzer" : "snowball"}
    			}
    		}
    	}
    }
    

    Then, seed some data by datasets as hacker_planet.eloader.

    The data repository can be got at http://github.com/andrewvc/ee-datasets. After cloned the repository, you can load examples into your server by executing the included elastic-loader.jar program, providing the address of your elasticsearch server, and the path to the data-file. For example, to load the hacker_planet dataset, open a command prompt in the ee-datasets folder, an run:

    java -jar elastic-loader.jar http://localhost:9200 datasets/hacker_planet.eloader

    Finally, we can perform our search:

    // Do the search
    POST /planet/hacker/_search
    {
    	"query" : {
    		"match" : {
    			"hobbies" : "rollerblading"
    		}
    	}
    }
    

    The above codes perform a search for those who like rollerblading out of the 3 users we've created in the datbase.

    Searches in elasticsearch are handled by the aptly named search API. The search API is provided by the _search endpoint.

    • index search:                  /myidx/_search
    • document type search:    /myidx/mytpe/_search

    For example:

    // index search
    POST /planet/_search
    ...
    
    // document type search
    POST /planet/hacker/_search
    ...
    

    A complex search's skeleton

    // Load Dataset: hacker_planet.eloader
    POST /planet/_search
    {
    	"from" : 0,
    	"size" : 15,
    	"query" : {"match_all" : {}},
    	"sort" : {"handle" : "desc"},
    	"filter" : {"term" : {"_all" : "coding"}},
    	"facet" : {
    		"hobbies" : {
    			"term" : {
    				"field" : "hobbies"
    			}
    		}
    	}
    }
    

       

    All elasticsearch queries boil down to the task of

    1. restricting the result set
    2. scoring (the default scoring algorithm implemented in Lucene's TFIDF Similarity class.)
    3. sorting

    Text Analysis

    Elasticsearch has toolbox with which we can slice and dice words in order to efficiently searched. Utilizing these tools we can narrow our search space, and find common ground between linguistically similar terms. 

    The Snowball analyzer is great at figuring out what the stems of English words are. The stem of a word is its root.

    The process by which documents are analyzed is as follows:

    1. A document update or create is received via a PUT or POST.
    2. The field values in the document are each run through an analyzer which converts each value to zero, one, or more indexable tokens.
    3. The tokenized values are stored in an index, pointing back to the full version of the document.

    The easist way to see analysis in action is with the Analyzer API:

    GET /_analyze?analyzer=snowball&text=candles%20candle&pretty=true'
    

    An analyzer is a really a three stage pipeline comprised of the following execution steps:

    1. Character Filtering    Turns the input string into a different string
    2. Tokenization              Turns the char-filtered string into an array of tokens
    3. Token Filtering          Post-process the filtered tokens into a mutated token array

    Let's dive in by building a cutom analyzer for tokenizing CSV data. Custom analyzer can be stored at the index level either during or after index creation. Lets's:

    1. create a "recipes" index
    2. close it
    3. update the analysis settings
    4. reopen it (in order to experiment with a custom analyzer)
    // Create the index
    PUT /recipes
    
    // Close the index for settings update
    POST /recipes/_close
    
    // Create the analyzer
    PUT /recipes/_settings
    {
    	"index" : {
    		"analysis" : {
    			"tokenizer" : {
    				"comma" : {"type" : "pattern", "pattern" : ","}
    			},
    			"analyzer" : {
    				"recipe_csv" : {
    					"type" : "custom",
    					"tokenizer" : "comma",
    					"filter" : ["trim", "lowercase"]
    				}
    			}
    		}
    	}
    }
    
    // Reopen the index
    POST /recipes/_open
    

    Faceting

    Facets are always attached to a query, letting you return aggregate statistics alongside regular query results. We'll create a database of movies and return facets based on the movies genres alongside standard query results. As usual, we need to load the movie_db.eloader data-set into elasticsearch server.

    Simple movie mapping:

    // Load Dataset: movie_db.eloader
    GET /movie_db/movie/_mapping?pretty=true
    {
    	"movie" : {
    		"properties" : {
    			"actors" : {"type" : "string", "analyzer" : "standard", "position_offset_gap" : 100},
    			"genre" : {"type" : "string", "index" : "not_analyzed"},
    			"release_year" : {"type" : "integer", "index" : "not_analyzed"},
    			"title" : {"type" : "string", "analyzer" : "snowball"},
    			"description" : {"type" : "string", "analyzer" : "snowball"} 
    		}
    	}
    }
    

    Simple terms faceting:

    // Load Dataset: movie_db.eloader
    POST /movie_db/_search
    {
    	"query" : {"match" : {"description" : "hacking"}},
    	"facets" : {
    		"genre" : {
    			"terms" : {"field" : "genre"},
    			"size" : 10
    		}
    	}
    }
    

    This query searches for movies with a description containing "hacking". The query will return a list of facets showing which genres have descriptions containing the term "hacking", and how often films are in that genre with a matching description.

  • 相关阅读:
    设计模式(九)外观模式Facade(结构型)
    设计模式(八)装饰器模式Decorator(结构型)
    Linux新手生存笔记[1]——Linux目录结构及说明
    设计模式(三)建造者模式Builder(创建型)
    设计模式(七)组合模式Composite(结构型)
    Linux新手生存笔记[0]——写在前面
    给出两个数m和n,求它们的最大公因子,即能够同时整出m和n的最大正整数
    Linux新手生存笔记[2]——vim训练稿
    Linux新手生存笔记[10]——shell脚本基础3函数及常用命令
    设计模式 ( 十二 ) 职责链模式(Chain of Responsibility)(对象行为
  • 原文地址:https://www.cnblogs.com/kid551/p/4273256.html
Copyright © 2011-2022 走看看