zoukankan      html  css  js  c++  java
  • 有关Lucene的问题(2):stemming和lemmatization

    问题:

    我试验了一下文章中提到的 stemming 和 lemmatization

    • 将单词缩减为词根形式,如“cars”到“car”等。这种操作称为:stemming。
    • 将单词转变为词根形式,如“drove”到“drive”等。这种操作称为:lemmatization。

    试验没有成功

    代码如下:

    public class TestNorms {   
        public void createIndex() throws IOException {   
            Directory d = new SimpleFSDirectory(new File("d:/falconTest/lucene3/norms"));   
            IndexWriter writer = new IndexWriter(d, new StandardAnalyzer(Version.LUCENE_30), 
                                                                                          true, IndexWriter.MaxFieldLength.UNLIMITED);   
            Field field = new Field("desc", "", Field.Store.YES, Field.Index.ANALYZED);   
            Document doc = new Document();   
            field.setValue("Hello students was drive");   
            doc.add(field);   
            writer.addDocument(doc);   
            writer.optimize();   
            writer.close();   
        }   
        public void search() throws IOException {   
            Directory d = new SimpleFSDirectory(new File("d:/falconTest/lucene3/norms"));   
            IndexReader reader = IndexReader.open(d);   
            IndexSearcher searcher = new IndexSearcher(reader);   
            TopDocs docs = searcher.search(new TermQuery(new Term("desc","drove")), 10);   
            System.out.println(docs.totalHits);   
        }   
        public static void main(String[] args) throws IOException {   
            TestNorms test= new TestNorms();   
            test.createIndex();   
            test.search();   
        }   

    不管是单复数,还是单词的变化,都是没有体现的

    不知道是不是分词器的原因?

    回答:

    的确是分词器的问题,StandardAnalyzer并不能进行stemming和lemmatization,因而不能够区分单复数和词型。

    文章中讲述的是全文检索的基本原理,理解了他,有利于更好的理解Lucene,但不代表Lucene是完全按照此基本流程进行的。

    (1) 有关stemming

    作为stemming,一个著名的算法是The Porter Stemming Algorithm,其主页为http://tartarus.org/~martin/PorterStemmer/,也可查看其论文http://tartarus.org/~martin/PorterStemmer/def.txt

    通过以下网页可以进行简单的测试:Porter's Stemming Algorithm Online[http://facweb.cs.depaul.edu/mobasher/classes/csc575/porter.html]

    cars –> car

    driving –> drive

    tokenization –> token

    然而

    drove –> drove

    可见stemming是通过规则缩减为词根的,而不能识别词型的变化。

    在最新的Lucene 3.0中,已经有了PorterStemFilter这个类来实现上述算法,只可惜没有Analyzer向匹配,不过不要紧,我们可以简单实现:

    public class PorterStemAnalyzer extends Analyzer
    {
        @Override
        public TokenStream tokenStream(String fieldName, Reader reader) {
          return new PorterStemFilter(new LowerCaseTokenizer(reader));
        }
    }

    把此分词器用在你的程序中,就能够识别单复数和规则的词型变化了。

    public void createIndex() throws IOException {
      Directory d = new SimpleFSDirectory(new File("d:/falconTest/lucene3/norms"));
      IndexWriter writer = new IndexWriter(d, new PorterStemAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED);

      Field field = new Field("desc", "", Field.Store.YES, Field.Index.ANALYZED);
      Document doc = new Document();
      field.setValue("Hello students was driving cars professionally");
      doc.add(field);

      writer.addDocument(doc);
      writer.optimize();
      writer.close();
    }

    public void search() throws IOException {
      Directory d = new SimpleFSDirectory(new File("d:/falconTest/lucene3/norms"));
      IndexReader reader = IndexReader.open(d);
      IndexSearcher searcher = new IndexSearcher(reader);
      TopDocs docs = searcher.search(new TermQuery(new Term("desc", "car")), 10);
      System.out.println(docs.totalHits);
      docs = searcher.search(new TermQuery(new Term("desc", "drive")), 10);
      System.out.println(docs.totalHits);
      docs = searcher.search(new TermQuery(new Term("desc", "profession")), 10);
      System.out.println(docs.totalHits);
    }

    (2) 有关lemmatization

    至于lemmatization,一般是有字典的,方能够由"drove"对应到"drive".

    在网上搜了一下,找到European languages lemmatizer[http://lemmatizer.org/],只不过是在linux下面C++开发的,有兴趣可以试验一下。

    首先按照网站的说明下载,编译,安装:

    libMAFSA is the core of the lemmatizer. All other libraries depend on it. Download the last version from the following page, unpack it and compile:

    # tar xzf libMAFSA-0.2.tar.gz
    # cd libMAFSA-0.2/
    # cmake .
    # make
    # sudo make install
    After this you should install libturglem. You can download it at the same place.
    # tar xzf libturglem-0.2.tar.gz
    # cd libturglem-0.2
    # cmake .
    # make
    # sudo make install
    Next you should install english dictionaries with some additional features to work with.
    # tar xzf turglem-english-0.2.tar.gz
    # cd turglem-english-0.2
    # cmake .
    # make
    # sudo make install

    安装完毕后:

    • /usr/local/include/turglem是头文件,用于编译自己编写的代码
    • /usr/local/share/turglem/english是字典文件,其中lemmas.xml中我们可以看到"drove"和"drive"的对应,"was"和"be"的对应。
    • /usr/local/lib中的libMAFSA.a  libturglem.a  libturglem-english.a  libtxml.a是用于生成应用程序的静态库

    <l id="DRIVE" p="6" />

    <l id="DROVE" p="6" />

    <l id="DRIVING" p="6" />

    在turglem-english-0.2目录下有例子测试程序test_utf8.cpp

    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <unistd.h>
    #include <turglem/lemmatizer.h>
    #include <turglem/lemmatizer.hpp>
    #include <turglem/english/charset_adapters.hpp>

    int main(int argc, char **argv)
    {
            char in_s_buf[1024];
            char *nl_ptr;

            tl::lemmatizer lem;

            if(argc != 4)
            {
                    printf("Usage: %s words.dic predict.dic flexias.bin\n", argv[0]);
                    return -1;
            }

            lem.load_lemmatizer(argv[1], argv[3], argv[2]);

            while (!feof(stdin))
            {
                    fgets(in_s_buf, 1024, stdin);
                    nl_ptr = strchr(in_s_buf, '\n');
                    if (nl_ptr) *nl_ptr = 0;
                    nl_ptr = strchr(in_s_buf, '\r');
                    if (nl_ptr) *nl_ptr = 0;

                    if (in_s_buf[0])
                    {
                            printf("processing %s\n", in_s_buf);
                            tl::lem_result pars;
                            size_t pcnt = lem.lemmatize<english_utf8_adapter>(in_s_buf, pars);
                            printf("%d\n", pcnt);
                            for (size_t i = 0; i < pcnt; i++)
                            {
                                    std::string s;
                                    u_int32_t src_form = lem.get_src_form(pars, i);
                                    s = lem.get_text<english_utf8_adapter>(pars, i, 0);
                                    printf("PARADIGM %d: normal form '%s'\n", (unsigned int)i, s.c_str());
                                    printf("\tpart of speech:%d\n", lem.get_part_of_speech(pars, (unsigned int)i, src_form));
                            }
                    }
            }

            return 0;
    }

    编译此文件,并且链接静态库:注意链接顺序,否则可能出错。

    g++ -g -o output test_utf8.cpp -L/usr/local/lib/ -lturglem-english -lturglem -lMAFSA –ltxml

    运行编译好的程序:

    ./output /usr/local/share/turglem/english/dict_english.auto

    /usr/local/share/turglem/english/prediction_english.auto

    /usr/local/share/turglem/english/paradigms_english.bin

    做测试,虽然对其机制尚不甚了解,但是可以看到lemmatization的作用:

    drove
    processing drove
    3
    PARADIGM 0: normal form 'DROVE'
            part of speech:0
    PARADIGM 1: normal form 'DROVE'
            part of speech:2
    PARADIGM 2: normal form 'DRIVE'
            part of speech:2

    was
    processing was
    3
    PARADIGM 0: normal form 'BE'
            part of speech:3
    PARADIGM 1: normal form 'BE'
            part of speech:3
    PARADIGM 2: normal form 'BE'
            part of speech:3

  • 相关阅读:
    【CSS3】响应式布局
    【jQuery插件】pagepiling滚屏插件使用
    【README.md】Markdown语言常用语法
    【页面架构】水平居中+垂直居中
    【页面架构】垂直居中
    【页面架构】水平居中
    【转载】css3动画简介以及动画库animate.css的使用
    【前端学习笔记】登录验证案例
    bzoj 3569 DZY Loves Chinese II 随机算法 树上倍增
    bzoj 1018 堵塞的交通traffic 线段树
  • 原文地址:https://www.cnblogs.com/forfuture1978/p/1664915.html
Copyright © 2011-2022 走看看