zoukankan      html  css  js  c++  java
  • 细节化 OpenNLP

    6 细节化

    功能介绍:文本分块由除以单词句法相关部分,如名词基,动词基的文字,但没有指定其内部结构,也没有其在主句作用。

    API:该概括化提供了一个API来培养新的概括化的模式。下面的示例代码演示了如何做到这一点:

    测试代码

    package package01;

    import opennlp.tools.chunker.ChunkerME;
    import opennlp.tools.chunker.ChunkerModel;
    import opennlp.tools.cmdline.postag.POSModelLoader;
    import opennlp.tools.postag.POSModel;
    import opennlp.tools.postag.POSSample;
    import opennlp.tools.postag.POSTaggerME;
    import opennlp.tools.tokenize.WhitespaceTokenizer;
    import opennlp.tools.util.*;

    import java.io.File;
    import java.io.FileInputStream;
    import java.io.IOException;
    import java.io.InputStream;
    import java.nio.charset.Charset;

    public class Test06 {

    public static void main(String[] args) throws IOException {
    Test06.chunk();
    }

    /**
    * 5.序列标注:Chunker
    * @deprecated 通过使用标记生成器生成的tokens分为一个句子划分为一组块。What chunker does is to partition a sentence to a set of chunks by using the tokens generated by tokenizer.
    *
    * 输入值
    * Hi. How are you? This is Mike.
    */
    public static void chunk() throws IOException {
    POSModel model = new POSModelLoader().load(new File("E:\NLP_Practics\models\en-pos-maxent.bin"));
    //PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
    POSTaggerME tagger = new POSTaggerME(model);
    // ObjectStream<String> lineStream = new PlainTextByLineStream(new StringReader(str));

    Charset charset = Charset.forName("UTF-8");
    InputStreamFactory isf = new MarkableFileInputStreamFactory(new File("E:\myText.txt"));
    ObjectStream<String> lineStream = new PlainTextByLineStream(isf, charset);

    //perfMon.start();
    String line;
    String whitespaceTokenizerLine[] = null;
    String[] tags = null;
    while ((line = lineStream.read()) != null) {
    whitespaceTokenizerLine = WhitespaceTokenizer.INSTANCE.tokenize(line);
    tags = tagger.tag(whitespaceTokenizerLine);
    POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
    System.out.println(sample.toString());
    //perfMon.incrementCounter();
    }
    //perfMon.stopAndPrintFinalResult();

    // chunker
    InputStream is = new FileInputStream("E:\NLP_Practics\models\en-chunker.bin");
    ChunkerModel cModel = new ChunkerModel(is);
    ChunkerME chunkerME = new ChunkerME(cModel);
    String result[] = chunkerME.chunk(whitespaceTokenizerLine, tags);
    for (String s : result)
    System.out.println(s);
    Span[] span = chunkerME.chunkAsSpans(whitespaceTokenizerLine, tags);
    for (Span s : span)
    System.out.println(s.toString());
    System.out.println("--------------5-------------");
    is.close();
    }
    }

      

    结果

    Loading POS Tagger model ... done (0.554s)
    Hi._NNP How_WRB are_VBP you?_JJ This_DT is_VBZ Mike._NNP
    B-NP
    B-ADVP
    O
    B-NP
    I-NP
    B-VP
    O
    [0..1) NP
    [1..2) ADVP
    [3..5) NP
    [5..6) VP
    --------------5-------------
    

      

    https://github.com/godmaybelieve
  • 相关阅读:
    全程图解】ADSL+笔记本电脑 组建WIFI网络让5800实现WIFI上网(更新完毕)
    JSP用户管理系统【上学应付作业用】
    c++按位操作符
    F#: .NET中的函数编程语言
    Visual Studio OpenGL 配置方法
    Linux下挂载U盘方法
    开发者该以什么为骄傲
    POSIX约定与GNU长选项
    修复移动硬盘"文件或目录损坏且无法读取"
    某国外论坛关于什么是Computer Science的争论,你怎么看?
  • 原文地址:https://www.cnblogs.com/yuyu666/p/15029795.html
Copyright © 2011-2022 走看看