搜索引擎Zend_lucene

zoukankan html css js c++ java

搜索引擎Zend_lucene

Zend Lucene

1.General

Zend_Search_Lucene is a general purpose text search engine written entirely in PHP 5. it stores its index on the filesystem and does not require a database server.

2. How to install Zend Lucene

l DownLoad WebSite : http://www.zend.com/community/downloads

l Zend Framework version : Zend Framework 1.9 minimal

Download Zend Framework 1.9 minimal from DownLoad WebSite.

Remove everything from Zend Folder but remain following files and directories:

l Exception.php

l Loader/

l Loader.php

l Search/

3.How to create an index.

an example of creating an index as below:

<?php

//File Name: createindex.php

require_once 'Zend/Search/Lucene.php';

$productsData= array(

0=>array("PID"=>1,"url"=>"http://www.cybozu.jp","productName"=>"garoon","Description"=>"garoon Description","lag"=>"en"),

1=>array("PID"=>2,"url"=>"http://www.cybozu.jp","productName"=>"share360","Description"=>"share360 Description" ,"lag"=>"en"),

2=>array("PID"=>3,"url"=>"http://www.cybozu.jp a","productName"=>"日本語の製品名前","Description"=>"日本語の製品","lag"=>"jp"),

3=>array("PID"=>4,"url"=>"http://www.cybozu.jp a","productName"=>"中文产品名","Description"=>"中文产品描述","lag"=>"zh")

);

$index=new Zend_Search_Lucene('index',true);

$doc = new Zend_Search_Lucene_Document();

foreach ($productsData as $productData)

{

 $doc->addField(Zend_Search_Lucene_Field::keyword('PID', $productData['PID'], 'UTF-8'));

 $doc->addField(Zend_Search_Lucene_Field::Text('url', $productData['url'], 'UTF-8'));

 $doc->addField(Zend_Search_Lucene_Field::Text('productName', $productData['productName'], 'UTF-8'));

 $doc->addField(Zend_Search_Lucene_Field::Text('Description', $productData['Description'], 'UTF-8'));

 $doc->addField(Zend_Search_Lucene_Field::unIndexed('lan', $productData['lan'], 'UTF-8'));

$index->addDocument($doc);

 $index->commit();

 $index->optimize();

}

echo 'index has been created!';

In KB project, index data is come from database, using method above , We can index all the text from database.

4.Searching index

After creating an index , We can search index as below:

<?php

//File Name: search.php

require_once('Zend/Search/Lucene.php');

$index = new Zend_Search_Lucene('index');

$keywords='garoon';

echo "Index contains {$index->count()} documents.\n";

$query = Zend_Search_Lucene_Search_QueryParser::parse( $keywords, 'utf-8' );

$hits = $index->find($query);

foreach ($hits as $hit)

 {

 echo 'PID: '.$hit->PID.' ';

 echo 'Score: '.$hit->score.' ';

 echo 'url: '.$hit->url.' ';

 echo 'productName: '.$hit->productName.' ';

 echo 'lan: '.$hit->lan.' ';

 }

If we want to search the text for multiple language, We can get value of lan , and then display different results by lan.

5.delete and update index.

If we want to update index , first we must find the document in index by keyword, then delete it ,after deleting the old document ,We can add a new document. This is an example to update an index. We delete PID :1 product,and update the description.

<?php

require_once('Zend/Search/Lucene.php');

 $index = new Zend_Search_Lucene('index');

//new product data to update

$productNewData =array("PID"=>1,"url"=>"http://www.cybozu.jp","productName"=>"garoon","Description"=>"update garoon Description","lan"=>"en");

$keywords="PID:1";

$hits = $index->find($keywords);

//Delete PID:1

 foreach ($hits as $hit)

 {

 echo 'PID: '.$hit->PID .'has been deleted ';

 $index->delete($hit->id);

 }

 $index->commit();

//add new product data to index

$doc = new Zend_Search_Lucene_Document();

$doc->addField(Zend_Search_Lucene_Field::keyword('PID', $productNewData['PID'], 'UTF-8'));

$doc->addField(Zend_Search_Lucene_Field::Text('url', $productNewData['url'], 'UTF-8'));

$doc->addField(Zend_Search_Lucene_Field::Text('productName', $productNewData['productName'], 'UTF-8'));

$doc->addField(Zend_Search_Lucene_Field::Text('Description', $productNewData['Description'], 'UTF-8'));

$doc->addField(Zend_Search_Lucene_Field::unIndexed('lan', $productNewData['lan'], 'UTF-8'));

$index->addDocument($doc);

$index->commit();

$index->optimize();

6.How to search japanese or chinese text by lucene.

As default , lucene can only search English text.But in this project , we must search the text by English, Japanese and Chinese. So we have to change default analyzer of Lucene.

This is an extend of default analyzer of Lucene as below:

<?php

// File Name:chinese.php

require_once 'Zend/Search/Lucene/Analysis/Analyzer.php';

require_once 'Zend/Search/Lucene/Analysis/Analyzer/Common.php';

class CN_Lucene_Analyzer extends Zend_Search_Lucene_Analysis_Analyzer_Common

{

 private $_position;

 private $_cnStopWords = array( );



 public function setCnStopWords( $cnStopWords )

 {

 $this->_cnStopWords = $cnStopWords;

 }

 /**

 * Reset token stream

 */

 public function reset()

 {

 $this->_position = 0;

 $search = array(",", "/", "\\", ".", ";", ":", "\"", "!", "~", "`", "^", "(", ")", "?", "-", "'", "<", ">", "$", "&", "%", "#", "@", "+", "=", "{", "}", "[", "]", "：", "）", "（", "．", "。", "，", "！", "；", "“", "”", "‘", "’", "［", "］", "、", "—", "　", "《", "》", "－", "…", "【", "】", "？", "￥" );



 $this->_input = str_replace( $search, '', $this->_input );

 $this->_input = str_replace( $this->_cnStopWords, ' ', $this->_input );

 }

 /**

 * Tokenization stream API

 * Get next token

 * Returns null at the end of stream

 *

 * @return Zend_Search_Lucene_Analysis_Token|null

 */

 public function nextToken()

 {

 if ($this->_input === null)

 {

 return null;

 }

 $len = strlen($this->_input);

 //print "Old string：".$this->_input." ";

 while ($this->_position < $len)

 {

 // Delete space at the begining

 while ($this->_position < $len &&$this->_input[$this->_position]==' ' )

 {

 $this->_position++;

 }

 $termStartPosition = $this->_position;

 $temp_char = $this->_input[$this->_position];

 $isCnWord = false;

 if(ord($temp_char)>127)

 {

 $i = 0;

 while( $this->_position < $len && ord( $this->_input[$this->_position] )>127 )

 {

 $this->_position = $this->_position + 3;

 $i ++;

 if($i==2)

 {

 $isCnWord = true;

 break;

 }

 }

 if($i==1) continue;

 }

 else

 {

 while ($this->_position < $len && ctype_alnum( $this->_input[$this->_position] ))

 {

 $this->_position++;

 }

 //echo $this->_position.":".$this->_input[$this->_position-1]."\n";

 }

 if ($this->_position == $termStartPosition)

 {

 $this->_position++;

 continue;

 }



 $tmp_str = substr($this->_input, $termStartPosition, $this->_position - $termStartPosition);



 $token = new Zend_Search_Lucene_Analysis_Token( $tmp_str, $termStartPosition,$this->_position );



 $token = $this->normalize($token);

 if($isCnWord)

 {

 $this->_position = $this->_position - 3;

 }

 if ($token !== null)

 {

 return $token;

 }

 }



 return null;

 }

}

With the help of chinese.php we can search Japanese and Chinese in kb. And also we must add codes as below before creating an index and searching.

require_once 'chinese.php';

Zend_Search_Lucene_Analysis_Analyzer::setDefault(new CN_Lucene_Analyzer());

7.Is Zend Lucene need downtime?

 By using Zend Lucene , we don’t need any downtime. When add a new article we can add it to index at the same time, If we edit an article, we need to delete old document and update index with new one .

查看全文

相关阅读:
京东基于大数据技术的个性化电商搜索引擎
 O2O的实时搜索引擎
 天猫11.11：搜索引擎实时秒级更新
 推荐系统和搜索引擎的关系
 1号店的分布式搜索引擎的架构实践
 详谈京东的商品搜索系统架构设计
 Office PPT保持提示无法保存Gill Sans 等非TrueType字体
 材价看板（2）- 运行两周的kanban，改进的起点
 材价看板（1）- 如何建立你的第一个kanban，看看这些暴露的问题你们有没有？
Solr：Schema设计

原文地址：https://www.cnblogs.com/likwo/p/1591319.html