zoukankan      html  css  js  c++  java
  • Carrot2 in action_初试身手—融入自己的中文分词器

    http://jiajiam.spaces.live.com/blog/cns!E9F2928B37455D08!281.entry

     

    初试身手融入自己的中文分词器 

     

    现在准备着手写一个真正意义上的聚类搜素了。一开始担心老外的carrot2对中文会进行“歧视”,后来发现原来carrot2还是比较重视中文的,在有一个org.carrot2.filter.lingo.local.ChineseLingoLocalFilterComponent的类,专门用来为中文提供分词操作。再次往下细看,底层的分词造作在org.carrot2.util.tokenizer.parser.jflex.JeZHWordSplit中实现的,采用的是基于luceneMMAnalyer 。我没有使用过这种分词器,不知道它的消歧机制和切分效率如何,于是想拿经常使用的分词器来做个比较。于是,必须建立一个自己的中文filter组建。   以往经常使用的是中科院的java改良版(还是很慢)和c++版本的mmseg,由于使用的是自己家是windows平台的,所以只好用中科院的java改良版。 

    1.              首先在org.carrot2.util.tokenizer.parser中新加一个分析器就叫KellyWordSplit

    package org.carrot2.util.tokenizer.parser;

     

     

    import org.apache.lucene.analysis.ictcals.FMNM;

    import org.carrot2.util.tokenizer.parser.jflex.PreprocessedJFlexWordBasedParserBase;

    public class KellyWordSplit extends PreprocessedJFlexWordBasedParserBase {

        //public Segment seg = null;

     

        public KellyWordSplit() {

    //     try {

    //         seg = new Segment(1, new File(".").getCanonicalPath()

    //                + File.separator+"dic"+File.separator);

    //     } catch (IOException e) {

    //         // TODO Auto-generated catch block

    //         e.printStackTrace();

    //     }

        }

     

        @Override

        public String preprocess(String input) {

           System.out.println("cut:"+input);

           return FMNM.ICTCLASCut(input) ;

        }

     

    }

     

    然后再在这个包中建立一个解析工厂:ICTCALWordBasedParserFactory

    package org.carrot2.util.tokenizer.parser;

     

    import org.apache.commons.pool.BasePoolableObjectFactory;

    import org.apache.commons.pool.ObjectPool;

    import org.apache.commons.pool.impl.SoftReferenceObjectPool;

     

    public class ICTCALWordBasedParserFactory {

        /** Chinese tokenizer factory */

        public static final ICTCALWordBasedParserFactory ChineseSimplified = new KellyICTCALWordBasedParserFactory();

     

        /** Parser pool */

        protected ObjectPool parserPool;

     

        /** No public constructor */

        private ICTCALWordBasedParserFactory() {

           // No public constructor

        }

     

        public WordBasedParserBase borrowParser() {

           try {

               return (WordBasedParserBase) parserPool.borrowObject();

           } catch (Exception e) {

               throw new RuntimeException("Cannot borrow a parser", e);

           }

        }

     

        /**

         * @param parser

         */

        public void returnParser(WordBasedParserBase parser) {

           try {

               parserPool.returnObject(parser);

           } catch (Exception e) {

               throw new RuntimeException("Cannot return a parser", e);

           }

        }

     

        /**

         * @author Stanislaw Osinski

         * @version $Revision: 2122 $

         */

        private static class KellyICTCALWordBasedParserFactory extends

               ICTCALWordBasedParserFactory {

           public KellyICTCALWordBasedParserFactory() {

               parserPool = new SoftReferenceObjectPool(

                      new BasePoolableObjectFactory() {

                         public Object makeObject() throws Exception {

                             return new KellyWordSplit();

                         }

                      });

           }

        }

     

    }

    11月23日

    Carrot2 in action_初试身手—融入自己的中文分词器(2)

    2.              第二步就是在org.carrot2.util.tokenizer.languages.chinese中建立一个自己的语言类ICTCALChineseSimplified
         public class ICTCALChineseSimplified extends StemmedLanguageBase{
        /**
         * A set of stopwords for this language.
         */
        private final static Set stopwords;
     
        /*
         * Load stopwords from an associated resource.
         */
        static
        {
            try
            {
                stopwords = WordLoadingUtils.loadWordSet("stopwords.zh-cn");
            }
            catch (IOException e)
            {
                throw new RuntimeException("Could not initialize class: " + e.getMessage());
            }
        }
     
        /**
         * Public constructor.
         */
        public ICTCALChineseSimplified()
        {
            super.setStopwords(stopwords);
        }
     
        /**
         * Creates a new instance of a {@link LanguageTokenizer} for this language.
         *
         * @see org.carrot2.util.tokenizer.languages.StemmedLanguageBase#createTokenizerInstanceInternal()
         */
        protected LanguageTokenizer createTokenizerInstanceInternal()
        {
            return ICTCALWordBasedParserFactory.ChineseSimplified.borrowParser();
        }
     
        /**
         * @return Language code: <code>pl</code>
         * @see org.carrot2.core.linguistic.Language#getIsoCode()
         */
        public String getIsoCode()
        {
            return "zh-cn";
        }
     
        protected Stemmer createStemmerInstance()
        {
            return EmptyStemmer.INSTANCE;
        }
    }
     
    3.              第三步就可以建立自己的中filter组建了:
    package org.carrot2.filter.lingo.local;
     
    import java.util.HashMap;
    import java.util.Map;
     
    import org.carrot2.core.linguistic.Language;
    import org.carrot2.util.tokenizer.languages.chinese.ICTCALChineseSimplified;
    import org.carrot2.util.tokenizer.languages.english.English;
     
    public class ICTCALLingoLocalFilterComponent extends LingoLocalFilterComponent {
        public ICTCALLingoLocalFilterComponent() {
           super(new Language[] { new English(), new ICTCALChineseSimplified() },
                  new ICTCALChineseSimplified(), new HashMap());
        }
     
        public ICTCALLingoLocalFilterComponent(Map parameters) {
           super(new Language[] { new English(), new ICTCALChineseSimplified() },
                  new ICTCALChineseSimplified(), parameters);
        }
     
    }
    哈哈,是不是很容易啊?怎么用它呢?
    如下:
    //
           final LocalComponentFactory lingo = new LocalComponentFactory() {
               public LocalComponent getInstance() {
                    HashMap defaults = new HashMap();
                 
                  // These are adjustments settings for the clustering algorithm...
                  // You can play with them, but the values below are our 'best guess'
                  // settings that we acquired experimentally.
                  defaults.put("lsi.threshold.clusterAssignment", "0.150");
                  defaults.put("lsi.threshold.candidateCluster",  "0.775");
     
                  // we will use the defaults here, see {@link Example}
                  // for more verbose configuration.
                  //return new ChineseLingoLocalFilterComponent();
                  return new ICTCALLingoLocalFilterComponent(defaults);
               }
           };
     
           // add the clustering component as "lingo-classic"
           controller.addLocalComponentFactory("lingo-classic", lingo);
    下一次,我将谈谈如何将carrot2融合自己的搜索框架,以及在架构上对搜索聚类/分类的一些自己的看法
     
    11月24日

    Carrot2 in action(3)_融入系统

    接上面阐述,从以上两种聚类的结构和效率来看,其实carrot2自带的MMAnalyer的效果都还不错,没有特殊需求可以不用加入自己的分词组建。

     

    融入系统

    Carrot2针对来自lucene的搜索源提供了专门的输入组建LuceneLocalInputComponent,看了它里面的结构,我觉得并不符合我这套系统的搜索架构

    ,换句话说LuceneLocalInputComponent太过“傻瓜”化,对于需要高性能的应用并不适合。于是我决定使用carrot2的直接输入输出组建ArrayInputComponentArrayOutputComponent,俗话说“最基本的也是最灵活的”真的是不错!此外我选用lingo算法的过滤组建。Ok,一切就绪,马上着手组建。一下是主要程序片段:

     

        /**

         * @param documentList:原信息

         * @return ArrayOutputComponent.Result 下午03:55:03

         */

        public ArrayOutputComponent.Result cluster(

               List<RawDocumentSnippet> documentList) {

           final HashMap params = new HashMap();

           params

                  .put(ArrayInputComponent.PARAM_SOURCE_RAW_DOCUMENTS,

                         documentList);

    //     params

    //     .put(ArrayInputComponent.,

    //            documentList);

           ProcessingResult pResult;

           try {

               pResult = controller.query("direct-feed-lingo", query, params);

               return (ArrayOutputComponent.Result) pResult.getQueryResult();

           } catch (MissingProcessException e) {

               // TODO Auto-generated catch block

               e.printStackTrace();

           } catch (Exception e) {

               // TODO Auto-generated catch block

               e.printStackTrace();

           }

           return null

     
    11月24日

    Carrot2 in action(4)_融入系统

    private LocalController initLocalController() throws DuplicatedKeyException {
           final LocalController controller = new LocalControllerBase();
           //
           // Create direct document feed input component factory. The documents
           // that that this component will feed will be provided at clustering
           // request time.
           //
           final LocalComponentFactory input = new LocalComponentFactory() {
               public LocalComponent getInstance() {
                  return new ArrayInputComponent();
               }
           };
     
           // add direct document feed input as 'input-direct'
           controller.addLocalComponentFactory("input-direct", input);
     
           //
           // Now it's time to create filters. We will use Lingo clustering
           // component.
           //
           final LocalComponentFactory lingo = new LocalComponentFactory() {
               public LocalComponent getInstance() {
                    HashMap defaults = new HashMap();
                 
                  // These are adjustments settings for the clustering algorithm...
                  // You can play with them, but the values below are our 'best guess'
                  // settings that we acquired experimentally.
                  defaults.put("lsi.threshold.clusterAssignment", "0.150");
                  defaults.put("lsi.threshold.candidateCluster",  "0.775");
     
                  // we will use the defaults here, see {@link Example}
                  // for more verbose configuration.
                  //return new ChineseLingoLocalFilterComponent();
                  return new ICTCALLingoLocalFilterComponent(defaults);
               }
           };
     
           // add the clustering component as "lingo-classic"
           controller.addLocalComponentFactory("lingo-classic", lingo);
     
           //
           // Finally, create a result-catcher component
           //
           final LocalComponentFactory output = new LocalComponentFactory() {
               public LocalComponent getInstance() {
                  return new ArrayOutputComponent();
               }
           };
     
           // add the output component as "buffer"
           controller.addLocalComponentFactory("buffer", output);
     
           //
           // In the final step, assemble a process from the above.
           //
           try {
               controller
                      .addProcess("direct-feed-lingo", new LocalProcessBase(
                             "input-direct", "buffer",
                             new String[] { "lingo-classic" }));
     
           } catch (InitializationException e) {
               // This exception is thrown during verification of the added
               // component chain,
               // when a component cannot properly initialize for some reason. We
               // don't
               // expect it here, so rethrow it as runtime exception.
               throw new RuntimeException(e);
           } catch (MissingComponentException e) {
               // If you give an identifier of a component for which factory has
               // not been
               // added to the controller, you'll get this exception. Impossible in
               // our
               // example.
               throw new RuntimeException(e);
           }
     
           return controller;
        }
    11月24日

    Carrot2 in action(5)_融入系统

    到这儿一切主要的步骤就差不多了,剩下的就是如何组装聚类结果并返回了。我选择了以xml的方式返回。一下是主要片段:

    /**

         * 将结果组装成xml中,并返回

         *

         * @param result

         * @return String 上午11:17:08

         */

        public String wrapperResult(ArrayOutputComponent.Result result,

               ClusterObject co) {

           if (result == null) {

               return null;

           }

           StringBuilder sb = new StringBuilder();

           final List clusters = result.clusters;

           int size = clusters.size();

           if (size > 0) {

               sb.append("<CLUSTERS_SIZE>");

               sb.append(size);

               sb.append("</CLUSTERS_SIZE>");

     

               int num = 1;

               for (Iterator i = clusters.iterator(); i.hasNext(); num++) {

                  wrapperCluster(sb, 0, (RawCluster) i.next(), co);

               }

           }

           return sb.toString();

        }

    Carrot2 in action(6)_融入系统

     /**
         * wrap the content of a single cluster, descending recursively to
         * subclusters.
         *
         * @param level
         *            current nesting level.
         * @param tag
         *            prefix for the current nesting level.
         * @param cluster
         *            cluster to display.
         * @return String 上午11:24:04
         */
        private void wrapperCluster(StringBuilder sb, final int level,
               RawCluster cluster, ClusterObject co) {
           // Detect and skip "junk" clusters -- clusters that have no meaning.
           // Also note that clusters have properties. Algorithms may pass
           // additional
           // information about clusters this way.
           if (cluster.getProperty(RawCluster.PROPERTY_JUNK_CLUSTER) != null) {
               return;
           }
           sb.append("<CLUSTER>");
     
           // Get the label of the current cluster. The description of a cluster
           // is a list of strings, ordered according to the accuracy of their
           // relationship with the cluster's content. Typically you'll just
           // show the first few phrases. We'll limit ourselves to just one.
           final List phrases = cluster.getClusterDescription();
           final String label = (String) phrases.get(0);
           sb.append("<LABEL><![CDATA[");
           sb.append(label);
           sb.append("]]></LABEL>");
           sb.append("<SIZE>");
           int size = cluster.getDocuments().size();
           sb.append(size);
           sb.append("</SIZE>");
           if (size > 0)
           // if this cluster has documents, display three topmost documents.
           {
               int count = 1;
               sb.append("<DOCUMENTS>");
               for (Iterator d = cluster.getDocuments().iterator(); d.hasNext(); count++) {
                  final RawDocument document = (RawDocument) d.next();
                  sb.append("<DOC>");
     
                  // <NUM>
                  sb.append(count);
                  sb.append(System.getProperty("line.separator"));
                  // <Score>
                  sb.append(document.getScore());
                  sb.append(System.getProperty("line.separator"));
                  // <ID>
                  sb.append(document.getTitle());
                  sb.append(System.getProperty("line.separator"));
                  // <Value>
                  sb.append(document.getProperty("Value"));
                  sb.append(System.getProperty("line.separator"));
                  // <Key>
                  String Key = document.getSnippet();
                  Key = highlightUtil.highlight(StringUtil.filterKeyWords(Key),
                         StringUtil.filterKeyWords(co.keyWord), false, analyzer,
                         co.hiliPrefix, co.hiliPostfix);
                  sb.append(Key);
                  sb.append(System.getProperty("line.separator"));
                  sb.append("</DOC>");
     
               }
               sb.append("</DOCUMENTS>");
           }
           // finally, if this cluster has subclusters, descend into recursion.
           int scnum = cluster.getClusterDescription().size();
           if (scnum > 0) {
               int num = 1;
               sb.append("<SUBCLUSTER>");
               for (Iterator c = cluster.getSubclusters().iterator(); c.hasNext(); num++) {
                  wrapperCluster(sb, level + 1, (RawCluster) c.next(), co);
               }
               sb.append("</SUBCLUSTER>");
           }
           sb.append("</CLUSTER>");
     
        }
        主要的步骤到此基本就完成了。当然其中还有很多关系到性能的细节问题,如缓存的设置,聚类和搜索的并发处理等等,都需要根据各自系统的需求而处理。这里就不累述。
    11月24日

    Carrot2 in action(7)_多嘴说说

    多嘴说说

           其实carrot2是一个做实时聚类的开源项目,它聚类的输入类型是数组,即将所有要聚类的数据一次性输入,这样无疑对大数据量的聚类操作是不合适的。所以carrot2适合做新闻发布系统等实时聚类的项目。本人草草的看了一下源码,发现carrot2主要聚类操作在MultilingualClusteringContextMultilingualFeatureExtractionStrategy

    ;特征值采用VSMvector space model向量空间模型),提取主要一下方法来完成private Feature[] extractSingleTerms()

    rivate Feature[] extractPhraseTerms(int[] indexMapping)

    实施聚类需要边搜索边聚类,这无比给搜索性能带来负面影响。为了提高聚类和搜索的效率,我预备从框架上做一个新调整,思路如下:就是专门开一个聚类/分类的进程。第一步,搜索进程将聚类信息传递给聚类/分类进程后,就可以去做自己的事情了,如组装xml(结果中带有一个key定义该此搜索嘴硬的聚类结果值)等等。第二步,当聚类/分类进程收到聚类信息后,开始聚类/分类操作,组装聚类的结果。第三步,这一步可以有两种实现方式:一种是当前台显示层接收到搜索结果后,根据结果xml中的聚类Key值去聚类/分类进程拿聚类结果,这种方式在于前台可以尽快的显示搜索结果,而且如果聚类/分类进程和搜索进程不在一台服务器上,还可以减少搜索进程的并发负担,因为它可以快速的返回减少在搜索服务器的停留时间。但是这种方式会增大前台显示的通讯负担和显示效果,因为一次搜索前台会提出两次请求,而且搜索结果和左侧聚类会分两次先后显现,即异步显示;另一种方式就是在返回结果前由后台搜索进程从聚类/分类进程中取结果,并组装返回,它的好处是减少了前台的通讯次,而且两中结果(聚类和搜索)会同时显现,感官上会好接受一些。但这种方式的不足在于一次搜索的时间会变长,即用户等待结果的时间会变长。

        啊,java的世界真是广阔无垠啊,开源真是个促进技术发展的好东西。原来有一个叫weka的开源项目,早已在数据挖掘界众人皆知了,它里面有着很多data mining的算法实现,对于大数据量也很使用,所以接下来,我将对weka展开“攻击”o(_)o…

  • 相关阅读:
    做开发的童鞋应该都了解这几款软件
    给文件对比工具自定义快捷键的方法
    C/C++ 编程有哪些值得推荐的辅助工具
    如何用Beyond Compare修改对比文件颜色
    据说这些工具可以提高程序员的工作效率
    遇到Beyond Compare禁止编辑该怎么办
    BZOJ
    周三
    大总结
    周二上午
  • 原文地址:https://www.cnblogs.com/cy163/p/1730172.html
Copyright © 2011-2022 走看看