zoukankan html css js c++ java

使用Lucene索引和检索POI数据

1、简介

关于空间数据搜索，以前写过《使用Solr进行空间搜索》这篇文章，是基于Solr的GIS数据的索引和检索。

Solr和ElasticSearch这两者都是基于Lucene实现的，两者都可以进行空间搜索（Spatial Search），在有些场景，我们需要把Lucene嵌入到已有的系统提供数据索引和检索的功能，这篇文章介绍下用Lucene如何索引带有经纬度的POI信息并进行检索。

2、环境数据

Lucene版本：5.3.1

POI数据库：Base_Station测试数据，每条数据主要是ID，经纬度和地址。

3、实现

基本变量定义，这里对“地址”信息进行了分词，分词使用了Lucene自带的smartcnSmartChineseAnalyzer。

    private String indexPath = "D:/IndexPoiData";
    private IndexWriter indexWriter = null;
    private SmartChineseAnalyzer analyzer = new SmartChineseAnalyzer(true);

    private IndexSearcher indexSearcher = null;

    // Field Name
    private static final String IDFieldName = "id";
    private static final String AddressFieldName = "address";
    private static final String LatFieldName = "lat";
    private static final String LngFieldName = "lng";
    private static final String GeoFieldName = "geoField";
    
    // Spatial index and search
    private SpatialContext ctx;
    private SpatialStrategy strategy;

    public PoiIndexService() throws IOException {
        init();
    }

    public PoiIndexService(String indexPath) throws IOException {
        this.indexPath = indexPath;
        init();
    }
    
    protected void init() throws IOException {
        Directory directory = new SimpleFSDirectory(Paths.get(indexPath));
        IndexWriterConfig config = new IndexWriterConfig(analyzer);
        indexWriter = new IndexWriter(directory, config);

        DirectoryReader ireader = DirectoryReader.open(directory);
        indexSearcher = new IndexSearcher(ireader);

        // Typical geospatial context
        // These can also be constructed from SpatialContextFactory
        ctx = SpatialContext.GEO;

        int maxLevels = 11; // results in sub-meter precision for geohash
        // This can also be constructed from SpatialPrefixTreeFactory
        SpatialPrefixTree grid = new GeohashPrefixTree(ctx, maxLevels);

        strategy = new RecursivePrefixTreeStrategy(grid, GeoFieldName);
    }

索引数据

    public boolean indexPoiDataList(List<PoiData> dataList) {
        try {
            if (dataList != null && dataList.size() > 0) {
                List<Document> docs = new ArrayList<>();
                for (PoiData data : dataList) {
                    Document doc = new Document();
                    doc.add(new LongField(IDFieldName, data.getId(), Field.Store.YES));
                    doc.add(new DoubleField(LatFieldName, data.getLat(), Field.Store.YES));
                    doc.add(new DoubleField(LngFieldName, data.getLng(), Field.Store.YES));
                    doc.add(new TextField(AddressFieldName, data.getAddress(), Field.Store.YES));
                    Point point = ctx.makePoint(data.getLng(),data.getLat());
                    for (Field f : strategy.createIndexableFields(point)) {
                        doc.add(f);
                    }
                    docs.add(doc);
                }
                indexWriter.addDocuments(docs);
                indexWriter.commit();
                return true;
            }
            return false;
        } catch (Exception e) {
            log.error(e.toString());
            return false;
        }
    }

这里的PoiData是个普通的POJO。

检索圆形范围内的数据，按距离从近到远排序：

    public List<PoiData> searchPoiInCircle(double lng, double lat, double radius){
        List<PoiData> results= new ArrayList<>();
        Shape circle = ctx.makeCircle(lng, lat, DistanceUtils.dist2Degrees(radius, DistanceUtils.EARTH_MEAN_RADIUS_KM));
        SpatialArgs args = new SpatialArgs(SpatialOperation.Intersects, circle);
        Query query = strategy.makeQuery(args);
        Point pt = ctx.makePoint(lng, lat);
        ValueSource valueSource = strategy.makeDistanceValueSource(pt, DistanceUtils.DEG_TO_KM);//the distance (in km)
        Sort distSort = null;
        TopDocs docs = null;
        try {
            //false = asc dist
            distSort = new Sort(valueSource.getSortField(false)).rewrite(indexSearcher);
            docs = indexSearcher.search(query, 10, distSort);
        } catch (IOException e) {
            log.error(e.toString());
        }
        
        if(docs!=null){
            ScoreDoc[] scoreDocs = docs.scoreDocs;
            printDocs(scoreDocs);
            results = getPoiDatasFromDoc(scoreDocs);
        }
        
        return results;
    }

    private List<PoiData> getPoiDatasFromDoc(ScoreDoc[] scoreDocs){
        List<PoiData> datas = new ArrayList<>();
        if (scoreDocs != null) {
            //System.out.println("总数：" + scoreDocs.length);
            for (int i = 0; i < scoreDocs.length; i++) {
                try {
                    Document hitDoc = indexSearcher.doc(scoreDocs[i].doc);
                    PoiData data = new PoiData();
                    data.setId(Long.parseLong((hitDoc.get(IDFieldName))));
                    data.setLng(Double.parseDouble(hitDoc.get(LngFieldName)));
                    data.setLat(Double.parseDouble(hitDoc.get(LatFieldName)));
                    data.setAddress(hitDoc.get(AddressFieldName));
                    datas.add(data);
                } catch (IOException e) {
                    log.error(e.toString());
                }
            }
        }
        
        return datas;
    }

搜索矩形范围内的数据：

    public List<PoiData> searchPoiInRectangle(double minLng, double minLat, double maxLng, double maxLat) {
        List<PoiData> results= new ArrayList<>();
        Point lowerLeftPoint = ctx.makePoint(minLng, minLat);
        Point upperRightPoint = ctx.makePoint(maxLng, maxLat);
        Shape rect = ctx.makeRectangle(lowerLeftPoint, upperRightPoint);
        SpatialArgs args = new SpatialArgs(SpatialOperation.Intersects, rect);
        Query query = strategy.makeQuery(args);
        TopDocs docs = null;
        try {
            docs = indexSearcher.search(query, 10);
        } catch (IOException e) {
            log.error(e.toString());
        }
        
        if(docs!=null){
            ScoreDoc[] scoreDocs = docs.scoreDocs;
            printDocs(scoreDocs);
            results = getPoiDatasFromDoc(scoreDocs);
        }
        
        return results;
    }

搜索某个范围内并根据地址关键字信息来检索POI：

public List<PoiData>searchPoByRangeAndAddress(doublelng, doublelat, double range, String address){
        List<PoiData> results= newArrayList<>();
        SpatialArgsargs = newSpatialArgs(SpatialOperation.Intersects,
        ctx.makeCircle(lng, lat, DistanceUtils.dist2Degrees(range, DistanceUtils.EARTH_MEAN_RADIUS_KM)));
        Query geoQuery = strategy.makeQuery(args);
        
        QueryBuilder builder = newQueryBuilder(analyzer);
        Query addQuery = builder.createPhraseQuery(AddressFieldName, address);
        
        BooleanQuery.BuilderboolBuilder = newBooleanQuery.Builder();
        boolBuilder.add(addQuery, Occur.SHOULD);
        boolBuilder.add(geoQuery,Occur.MUST);
        
        Query query = boolBuilder.build();
        
        TopDocs docs = null;
        try {
            docs = indexSearcher.search(query, 10);
        } catch (IOException e) {
            log.error(e.toString());
        }
        
        if(docs!=null){
            ScoreDoc[] scoreDocs = docs.scoreDocs;
            printDocs(scoreDocs);
            results = getPoiDatasFromDoc(scoreDocs);
        }
        
        return results;
    }

4、关于分词

POI的地址属性和描述属性都需要做分词才能更好的进行检索和搜索。

简单对比了几种分词效果：

原文：

这是一个lucene中文分词的例子，你可以直接运行它！Chinese Analyer can analysis english text too.中国农业银行（农行）和建设银行(建行)，江苏南京江宁上元大街12号。东南大学是一所985高校。

分词结果：

smartcn SmartChineseAnalyzer

这是一个lucen中文分词的例子你可以直接运行它chinesanalycananalysienglish	ext	oo中国农业银行农行和建设银行建行江苏南京江宁上元大街12号东南大学是一所985高校

MMSegAnalyzer ComplexAnalyzer

这是一个lucene中文分词的例子你可以直接运行它chineseanalyercananalysisenglish	ext	oo中国农业银行农行和建设银行建行江苏南京江宁上元大街12号东南大学是一所985高校

IKAnalyzer

这是一个lucene中文分词的例子你可以直接运行它chineseanalyercananalysisenglish	ext	oo.中国农业银行农行和建设银行建行江苏南京江宁上元大街12号东南大学是一所985高校

分词效果对比：

1）Smartcn不能正确的分出有些英文单词，有些中文单词也被分成单个字。

2）MMSegAnalyzer能正确的分出英文和中文，但对于类似“江宁”这样的地名和“建行”等信息不是很准确。MMSegAnalyzer支持自定义词库，词库可以大大提高分词的准确性。

3）IKAnalyzer能正确的分出英文和中文，中文分词比较不错，但也有些小问题，比如单词too和最后的点号分在了一起。IKAnalyzer也支持自定义词库，但是要扩展一些源码。

总结：使用Lucene强大的数据索引和检索能力可以为一些带有经纬度和需要分词检索的数据提供搜索功能。

代码托管在GitHub上：https://github.com/luxiaoxun/Code4Java

查看全文

相关阅读:
基于sshpass批量实现主机间的key验证脚本
 一键安装mysql5.7.30脚本
 centos8网卡名称修改
 mysql分库备份脚本
 centos一键二进制编译安装mariadb-10.2.31脚本
 chrony时间同步服务简介及配置
 linux基于key验证
 expect 脚本语言中交互处理常用命令
 shell中数值测试和算术表达式比较
 JAVA Math的简单运用

原文地址：https://www.cnblogs.com/luxiaoxun/p/5020247.html

最新文章
static静态关键字
 String类
 ArrayList类
 Random类
 匿名对象
 Scanner类
 标准类 (Java Bean)
构造方法
 成员变量与局部变量的差异
 String.format 用法