zoukankan      html  css  js  c++  java
  • 【Neo4j】踩坑大会-Neo4J用中文索引

    正在用的Neo4j是当前最新版:3.1.0,各种踩坑。说一下如何在Neo4j 3.1.0中使用中文索引。选用了IKAnalyzer做分词器。


    1. 首先参考文章:

    https://segmentfault.com/a/1190000005665612

    里面大致讲了用IKAnalyzer做索引的方式。但并不清晰,实际上,这篇文章的背景是用嵌入式Neo4j,即Neo4j一定要嵌入在你的Java应用中(https://neo4j.com/docs/java-reference/current/#tutorials-java-embedded),切记。否则无法使用自定义的Analyzer。其次,文中的方法现在用起来已经有问题了,因为Neo4j 3.1.0用了lucene5.5,故官方的IKAnalyzer已经不适用了。


    2. 修正

     转用 IKAnalyzer2012FF_u1.jar,在Google可以下载到(https://code.google.com/archive/p/ik-analyzer/downloads)。这个版本的IKAnalyzer是有小伙伴修复了IKAnalyzer不适配lucene3.5以上而修改的一个版本。但是用了这个包仍有问题,报错提示:

    Caused by: java.lang.AbstractMethodError: org.apache.lucene.analysis.Analyzer.createComponents(Ljava/lang/String;)Lorg/apache/lucene/analysis/Analyzer$TokenStreamComponents;

    即IKAnalyzer的Analyzer类和当前版本的lucene仍有不适配的地方。

    解决方案:再增加两个类

    1. package com.uc.wa.function;
    2. import org.apache.lucene.analysis.Analyzer;
    3. import org.apache.lucene.analysis.Tokenizer;
    4. public class IKAnalyzer5x extends Analyzer{
    5. private boolean useSmart;
    6. public boolean useSmart() {
    7. return useSmart;
    8. }
    9. public void setUseSmart(boolean useSmart) {
    10. this.useSmart = useSmart;
    11. }
    12. public IKAnalyzer5x(){
    13. this(false);
    14. }
    15. public IKAnalyzer5x(boolean useSmart){
    16. super();
    17. this.useSmart = useSmart;
    18. }
    19. /**
    20. protected TokenStreamComponents createComponents(String fieldName, final Reader in) {
    21. Tokenizer _IKTokenizer = new IKTokenizer(in , this.useSmart());
    22. return new TokenStreamComponents(_IKTokenizer);
    23. }
    24. **/
    25. /**
    26. * 重写最新版本的createComponents
    27. * 重载Analyzer接口,构造分词组件
    28. */
    29. @Override
    30. protected TokenStreamComponents createComponents(String fieldName) {
    31. Tokenizer _IKTokenizer = new IKTokenizer5x(this.useSmart());
    32. return new TokenStreamComponents(_IKTokenizer);
    33. }
    34. }

    1. package com.uc.wa.function;
    2. import java.io.IOException;
    3. import org.apache.lucene.analysis.Tokenizer;
    4. import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
    5. import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
    6. import org.apache.lucene.analysis.tokenattributes.TypeAttribute;
    7. import org.wltea.analyzer.core.IKSegmenter;
    8. import org.wltea.analyzer.core.Lexeme;
    9. public class IKTokenizer5x extends Tokenizer{
    10. //IK�ִ���ʵ��
    11. private IKSegmenter _IKImplement;
    12. //��Ԫ�ı�����
    13. private final CharTermAttribute termAtt;
    14. //��Ԫλ������
    15. private final OffsetAttribute offsetAtt;
    16. //��Ԫ�������ԣ������Է���ο�org.wltea.analyzer.core.Lexeme�еķ��ೣ����
    17. private final TypeAttribute typeAtt;
    18. //��¼���һ����Ԫ�Ľ���λ��
    19. private int endPosition;
    20. /**
    21. public IKTokenizer(Reader in , boolean useSmart){
    22. super(in);
    23. offsetAtt = addAttribute(OffsetAttribute.class);
    24. termAtt = addAttribute(CharTermAttribute.class);
    25. typeAtt = addAttribute(TypeAttribute.class);
    26. _IKImplement = new IKSegmenter(input , useSmart);
    27. }**/
    28. /**
    29. * Lucene 5.x Tokenizer�������๹�캯��
    30. * ʵ�����µ�Tokenizer�ӿ�
    31. * @param useSmart
    32. */
    33. public IKTokenizer5x(boolean useSmart){
    34. super();
    35. offsetAtt = addAttribute(OffsetAttribute.class);
    36. termAtt = addAttribute(CharTermAttribute.class);
    37. typeAtt = addAttribute(TypeAttribute.class);
    38. _IKImplement = new IKSegmenter(input , useSmart);
    39. }
    40. /* (non-Javadoc)
    41. * @see org.apache.lucene.analysis.TokenStream#incrementToken()
    42. */
    43. @Override
    44. public boolean incrementToken() throws IOException {
    45. //������еĴ�Ԫ����
    46. clearAttributes();
    47. Lexeme nextLexeme = _IKImplement.next();
    48. if(nextLexeme != null){
    49. //��Lexemeת��Attributes
    50. //���ô�Ԫ�ı�
    51. termAtt.append(nextLexeme.getLexemeText());
    52. //���ô�Ԫ����
    53. termAtt.setLength(nextLexeme.getLength());
    54. //���ô�Ԫλ��
    55. offsetAtt.setOffset(nextLexeme.getBeginPosition(), nextLexeme.getEndPosition());
    56. //��¼�ִʵ����λ��
    57. endPosition = nextLexeme.getEndPosition();
    58. //��¼��Ԫ����
    59. typeAtt.setType(nextLexeme.getLexemeTypeString());
    60. //����true��֪�����¸���Ԫ
    61. return true;
    62. }
    63. //����false��֪��Ԫ������
    64. return false;
    65. }
    66. /*
    67. * (non-Javadoc)
    68. * @see org.apache.lucene.analysis.Tokenizer#reset(java.io.Reader)
    69. */
    70. @Override
    71. public void reset() throws IOException {
    72. super.reset();
    73. _IKImplement.reset(input);
    74. }
    75. @Override
    76. public final void end() {
    77. // set final offset
    78. int finalOffset = correctOffset(this.endPosition);
    79. offsetAtt.setOffset(finalOffset, finalOffset);
    80. }
    81. }

    解决 IKAnalyzer2012FF_u1.jar和lucene5不适配的问题。使用时用IKAnalyzer5x替换IKAnalyzer即可。


    3. 最后

    Neo4j中文索引建立和搜索示例:

    1. /**
    2. * 为单个结点创建索引
    3. *
    4. * @param propKeys
    5. */
    6. public static void createFullTextIndex(long id, List<String> propKeys) {
    7. log.info("method[createFullTextIndex] begin.propKeys<"+propKeys+">");
    8. Index<Node> entityIndex = null;
    9. try (Transaction tx = Neo4j.graphDb.beginTx()) {
    10. entityIndex = Neo4j.graphDb.index().forNodes("NodeFullTextIndex",
    11. MapUtil.stringMap(IndexManager.PROVIDER, "lucene", "analyzer", IKAnalyzer5x.class.getName()));
    12. Node node = Neo4j.graphDb.getNodeById(id);
    13. log.info("method[createFullTextIndex] get node id<"+node.getId()+"> name<"
    14. +node.getProperty("knowledge_name")+">");
    15. /**获取node详细信息*/
    16. Set<Map.Entry<String, Object>> properties = node.getProperties(propKeys.toArray(new String[0]))
    17. .entrySet();
    18. for (Map.Entry<String, Object> property : properties) {
    19. log.info("method[createFullTextIndex] index prop<"+property.getKey()+":"+property.getValue()+">");
    20. entityIndex.add(node, property.getKey(), property.getValue());
    21. }
    22. tx.success();
    23. }
    24. }

    1. /**
    2. * 使用索引查询
    3. *
    4. * @param query
    5. * @return
    6. * @throws IOException
    7. */
    8. public static List<Map<String, Object>> selectByFullTextIndex(String[] fields, String query) throws IOException {
    9. List<Map<String, Object>> ret = Lists.newArrayList();
    10. try (Transaction tx = Neo4j.graphDb.beginTx()) {
    11. IndexManager index = Neo4j.graphDb.index();
    12. /**查询*/
    13. Index<Node> addressNodeFullTextIndex = index.forNodes("NodeFullTextIndex",
    14. MapUtil.stringMap(IndexManager.PROVIDER, "lucene", "analyzer", IKAnalyzer5x.class.getName()));
    15. Query q = IKQueryParser.parseMultiField(fields, query);
    16. IndexHits<Node> foundNodes = addressNodeFullTextIndex.query(q);
    17. for(Node n : foundNodes){
    18. Map<String, Object> m = n.getAllProperties();
    19. if(!Float.isNaN(foundNodes.currentScore())){
    20. m.put("score", foundNodes.currentScore());
    21. }
    22. log.info("method[selectByIndex] score<"+foundNodes.currentScore()+">");
    23. ret.add(m);
    24. }
    25. tx.success();
    26. } catch (IOException e) {
    27. log.error("method[selectByIndex] fields<"+Joiner.on(",").join(fields)+"> query<"+query+">", e);
    28. throw e;
    29. }
    30. return ret;
    31. }

    注意到,在这里我用了IKQueryParser,即根据我们的查询词和要查询的字段,自动构造Query。这里是绕过了一个坑:用lucene查询语句直接查的话,是有问题的。比如:“address:南昌市” 查询语句,会搜到所有带市字的地址,这是非常不合理的。改用IKQueryParser即修正这个问题。IKQueryParser是IKAnalyzer自带的一个工具,但在 IKAnalyzer2012FF_u1.jar却被删减掉了。因此我这里重新引入了原版IKAnalyzer的jar包,项目最终是两个jar包共存的。

    到这里坑就踩得差不多了。


           原文地址:https://blog.csdn.net/hereiskxm/article/details/54345261                         </div>
  • 相关阅读:
    js setTimeout的第三个参数
    vue 实现跑马灯 transform
    vue 使用闭包实现防抖
    js 获取输入日期的几个月前的日期
    js 作用域和作用域链
    退役划水(10)
    退役划水(9)
    解决 SpringBoot Elasticsearch 7.x 聚合查询遇到的问题
    ElasticSearch7.4.2:RestHighLevelClient应用
    RestHighLevelClient操作ES的API
  • 原文地址:https://www.cnblogs.com/jpfss/p/11412573.html
Copyright © 2011-2022 走看看