zoukankan      html  css  js  c++  java
  • stanford corenlp自定义切词类

    stanford corenlp的中文切词有时不尽如意,那我们就需要实现一个自定义切词类,来完全满足我们的私人定制(加各种词典干预)。上篇文章《IKAnalyzer》介绍了IKAnalyzer的自由度,本篇文章就说下怎么把IKAnalyzer作为corenlp的切词工具。

    stanford corenlp的TokensRegex》提到了corenlp的配置CoreNLP-chinese.properties,其中customAnnotatorClass.segment就是用于指定切词类的,在这里我们只需要模仿ChineseSegmenterAnnotator来实现一个自己的Annotator,并设置在配置文件中即可。

    customAnnotatorClass.segment = edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator
    

    下面是我的实现:

    public class IKSegmenterAnnotator extends ChineseSegmenterAnnotator {
        public IKSegmenterAnnotator() {
            super();
        }
    
        public IKSegmenterAnnotator(boolean verbose) {
            super(verbose);
        }
    
        public IKSegmenterAnnotator(String segLoc, boolean verbose) {
            super(segLoc, verbose);
        }
    
        public IKSegmenterAnnotator(String segLoc, boolean verbose, String serDictionary, String sighanCorporaDict) {
            super(segLoc, verbose, serDictionary, sighanCorporaDict);
        }
    
        public IKSegmenterAnnotator(String name, Properties props) {
            super(name, props);
        }
    
        private List<String> splitWords(String str) {
            try {
                List<String> words = new ArrayList<String>();
                IKSegmenter ik = new IKSegmenter(new StringReader(str), true);
                Lexeme lex = null;
                while ((lex = ik.next()) != null) {
                    words.add(lex.getLexemeText());
                }
                return words;
            } catch (IOException e) {
                //LOGGER.error(e.getMessage(), e);
                System.out.println(e);
                List<String> words = new ArrayList<String>();
                words.add(str);
                return words;
            }
        }
    
        @Override
        public void runSegmentation(CoreMap annotation) {
            //0 2
            // A BC D E
            // 1 10 1 1
            // 0 12 3 4
            // 0, 0+1 ,
    
            String text = annotation.get(CoreAnnotations.TextAnnotation.class);
            List<CoreLabel> sentChars = annotation.get(ChineseCoreAnnotations.CharactersAnnotation.class);
            List<CoreLabel> tokens = new ArrayList<CoreLabel>();
            annotation.set(CoreAnnotations.TokensAnnotation.class, tokens);
    
            //List<String> words = segmenter.segmentString(text);
            List<String> words = splitWords(text);
            System.err.println(text);
            System.err.println("--->");
            System.err.println(words);
    
            int pos = 0;
            for (String w : words) {
                CoreLabel fl = sentChars.get(pos);
                fl.set(CoreAnnotations.ChineseSegAnnotation.class, "1");
                if (w.length() == 0) {
                    continue;
                }
                CoreLabel token = new CoreLabel();
                token.setWord(w);
                token.set(CoreAnnotations.CharacterOffsetBeginAnnotation.class, fl.get(CoreAnnotations.CharacterOffsetBeginAnnotation.class));
                pos += w.length();
                fl = sentChars.get(pos - 1);
                token.set(CoreAnnotations.CharacterOffsetEndAnnotation.class, fl.get(CoreAnnotations.CharacterOffsetEndAnnotation.class));
                tokens.add(token);
            }
        }
    }
    

    在外面为IKAnalyzer初始化词典,指定扩展词典和删除词典

            //为ik初始化词典,删除干扰词
            Dictionary.initial(DefaultConfig.getInstance());
            String delDic = System.getProperty(READ_IK_DEL_DIC, null);
            BufferedReader reader = new BufferedReader(new FileReader(delDic));
            String line = null;
            List<String> delWords = new ArrayList<>();
            while ((line = reader.readLine()) != null) {
                delWords.add(line);
            }
            Dictionary.getSingleton().disableWords(delWords);
    

      

      

      

  • 相关阅读:
    Word2010如何从指定页设置页码
    十大常见web漏洞及防范
    CSRF攻击与防御
    信息安全常见漏洞类型汇总
    回忆一次校招笔试的题目
    Python操作rabbitmq系列(三):多个接收端消费消息
    Python操作rabbitmq系列(二):多个接收端消费消息
    Python操作rabbitmq系列(一)
    Python操作Redis
    Python连接Redis
  • 原文地址:https://www.cnblogs.com/whuqin/p/6149742.html
Copyright © 2011-2022 走看看