zoukankan      html  css  js  c++  java
  • Twenty Newsgroups Classification任务之二seq2sparse

    seq2sparse对应于mahout中的org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles,从昨天跑的算法中的任务监控界面可以看到这一步包含了7个Job信息,分别是:(1)DocumentTokenizer(2)WordCount(3)MakePartialVectors(4)MergePartialVectors(5)VectorTfIdf Document Frequency Count(6)MakePartialVectors(7)MergePartialVectors。打印SparseVectorsFromSequenceFiles的参数帮助信息可以看到如下的信息:

    Usage:                                                                          
     [--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize           
    <chunkSize> --output <output> --input <input> --minDF <minDF> --maxDFSigma      
    <maxDFSigma> --maxDFPercent <maxDFPercent> --weight <weight> --norm <norm>      
    --minLLR <minLLR> --numReducers <numReducers> --maxNGramSize <ngramSize>        
    --overwrite --help --sequentialAccessVector --namedVector --logNormalize]       
    Options                                                                         
      --minSupport (-s) minSupport        (Optional) Minimum Support. Default       
                                          Value: 2                                  
      --analyzerName (-a) analyzerName    The class name of the analyzer            
      --chunkSize (-chunk) chunkSize      The chunkSize in MegaBytes. 100-10000 MB  
      --output (-o) output                The directory pathname for output.        
      --input (-i) input                  Path to job input directory.              
      --minDF (-md) minDF                 The minimum document frequency.  Default  
                                          is 1                                      
      --maxDFSigma (-xs) maxDFSigma       What portion of the tf (tf-idf) vectors   
                                          to be used, expressed in times the        
                                          standard deviation (sigma) of the         
                                          document frequencies of these vectors.    
                                          Can be used to remove really high         
                                          frequency terms. Expressed as a double    
                                          value. Good value to be specified is 3.0. 
                                          In case the value is less then 0 no       
                                          vectors will be filtered out. Default is  
                                          -1.0.  Overrides maxDFPercent             
      --maxDFPercent (-x) maxDFPercent    The max percentage of docs for the DF.    
                                          Can be used to remove really high         
                                          frequency terms. Expressed as an integer  
                                          between 0 and 100. Default is 99.  If     
                                          maxDFSigma is also set, it will override  
                                          this value.                               
      --weight (-wt) weight               The kind of weight to use. Currently TF   
                                          or TFIDF                                  
      --norm (-n) norm                    The norm to use, expressed as either a    
                                          float or "INF" if you want to use the     
                                          Infinite norm.  Must be greater or equal  
                                          to 0.  The default is not to normalize    
      --minLLR (-ml) minLLR               (Optional)The minimum Log Likelihood      
                                          Ratio(Float)  Default is 1.0              
      --numReducers (-nr) numReducers     (Optional) Number of reduce tasks.        
                                          Default Value: 1                          
      --maxNGramSize (-ng) ngramSize      (Optional) The maximum size of ngrams to  
                                          create (2 = bigrams, 3 = trigrams, etc)   
                                          Default Value:1                           
      --overwrite (-ow)                   If set, overwrite the output directory    
      --help (-h)                         Print out help                            
      --sequentialAccessVector (-seq)     (Optional) Whether output vectors should  
                                          be SequentialAccessVectors. If set true   
                                          else false                                
      --namedVector (-nv)                 (Optional) Whether output vectors should  
                                          be NamedVectors. If set true else false   
      --logNormalize (-lnorm)             (Optional) Whether output vectors should  
                                          be logNormalize. If set true else false 

    在昨天算法的终端信息中该步骤的调用命令如下:

    ./bin/mahout seq2sparse -i /home/mahout/mahout-work-mahout/20news-seq -o /home/mahout/mahout-work-mahout/20news-vectors -lnorm -nv -wt tfidf

    我们只看对应的参数,首先是-lnorm 对应的解释为输出向量是否要使用log函数进行归一化(设置则为true),-nv解释为输出向量被设置为named 向量,这里的named是啥意思?(暂时不清楚),-wt tfidf解释为使用权重的算法,具体参考 http://zh.wikipedia.org/wiki/TF-IDF 。

    第(1)步在SparseVectorsFromSequenceFiles的253行的:

    DocumentProcessor.tokenizeDocuments(inputDir, analyzerClass, tokenizedPath, conf);

    这里进入可以看到使用的Mapper是:SequenceFileTokenizerMapper,没有使用Reducer。Mapper的代码如下:

    protected void map(Text key, Text value, Context context) throws IOException, InterruptedException {
        TokenStream stream = analyzer.reusableTokenStream(key.toString(), new StringReader(value.toString()));
        CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);
        StringTuple document = new StringTuple();
        stream.reset();
        while (stream.incrementToken()) {
          if (termAtt.length() > 0) {
            document.add(new String(termAtt.buffer(), 0, termAtt.length()));
          }
        }
        context.write(key, document);
      }

    该Mapper的setup函数主要设置Analyzer的,关于Analyzer的api参考: http://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/analysis/Analyzer.html ,其中在map中用到的函数为 reusableTokenStream( String fieldName,  Reader reader) :Creates a TokenStream that is allowed to be re-used from the previous time that the same thread called this method.
    编写下面的测试程序:

    package mahout.fansy.test.bayes;
    
    import java.io.IOException;
    import java.io.StringReader;
    
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.io.Text;
    import org.apache.lucene.analysis.Analyzer;
    import org.apache.lucene.analysis.TokenStream;
    import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
    import org.apache.mahout.common.ClassUtils;
    import org.apache.mahout.common.StringTuple;
    import org.apache.mahout.vectorizer.DefaultAnalyzer;
    import org.apache.mahout.vectorizer.DocumentProcessor;
    
    public class TestSequenceFileTokenizerMapper {
    
    	/**
    	 * @param args
    	 */
    	private static Analyzer analyzer = ClassUtils.instantiateAs("org.apache.mahout.vectorizer.DefaultAnalyzer",
    Analyzer.class);
    	public static void main(String[] args) throws IOException {
    		testMap();
    	}
    	
    	public static void testMap() throws IOException{
    		Text key=new Text("4096");
    		Text value=new Text("today is also late.what about tomorrow?");
    		TokenStream stream = analyzer.reusableTokenStream(key.toString(), new StringReader(value.toString()));
    	    CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);
    	    StringTuple document = new StringTuple();
    	    stream.reset();
    	    while (stream.incrementToken()) {
    	      if (termAtt.length() > 0) {
    	        document.add(new String(termAtt.buffer(), 0, termAtt.length()));
    	      }
    	    }
    	    System.out.println("key:"+key.toString()+",document"+document);
    	}
    
    }
    

    得出的结果如下:

    key:4096,document[today, also, late.what, about, tomorrow]
    

    其中,TokenStream有一个stopwords属性,值为:[but, be, with, such, then, for, no, will, not, are, and, their, if, this, on, into, a, or, there, in, that, they, was, is, it, an, the, as, at, these, by, to, of],所以当遇到这些单词的时候就不进行计算了。

    额,又太晚了。哎,早困了,刷个牙线。。。



    分享,快乐,成长


    转载请注明出处:http://blog.csdn.net/fansy1990 



  • 相关阅读:
    P7771 【模板】欧拉路径
    远程服务器运行代码命令(后台执行 及时输出)
    服务器显存溢出
    指定python环境下pip安装包
    语音信号处理
    Es 常用命令
    MySQL常用命令
    Mysql数据按天分区,定期删除,及分区索引
    查看Mysql正在执行的事务、锁、等待
    git忽略提交文件 IT
  • 原文地址:https://www.cnblogs.com/pangblog/p/3290049.html
Copyright © 2011-2022 走看看