  • 学习Mahout(一)

    Mahout 官方下载地址:http://apache.fayea.com/apache-mirror/mahout/

    环境ubuntu 12.04, hadoop1.2.1 ,mahout 0.9 , memory 2G 

    1 首先解压tar包

    tar -zxvf /mnt/hgfs/mnt/mahout-distribution-0.9.tar.gz -C /opt/hadoop/

    2 添加环境变量

    export HADOOP_HOME=/opt/hadoop/hadoop-1.2.1
    export HADOOP_CONF_DIR=${HADOOP_HOME}/conf
    export MAHOUT_HOME=/opt/hadoop/mahout-distribution-0.9


    3 启动你的hadoop服务,这里不再累述,自己参考:http://www.cnblogs.com/chenfool/p/3574789.html

    4 执行一下mahout

    cd /opt/hadoop/mahout-distribution-0.9
    bin/mahout --help


    Error occurred during initialization of VM
    Could not reserve enough space for object heap
    Could not create the Java virtual machine.

    使用vi 打开bin/mahout查看,搜索JAVA_HEAP_MAX=-X



    再查找一下mapred.map.child.java.opts 、 mapred.reduce.child.java.opts , 都写着4096m,还让渣渣机器活吗?



    bin/mahout --help

    arff.vector: : Generate Vectors from an ARFF file or directory
    baumwelch: : Baum-Welch algorithm for unsupervised HMM training
    canopy: : Canopy clustering
    cat: : Print a file or resource as the logistic regression models would see it
    cleansvd: : Cleanup and verification of SVD output
    clusterdump: : Dump cluster output to text
    clusterpp: : Groups Clustering Output In Clusters
    cmdump: : Dump confusion matrix in HTML or text formats
    concatmatrices: : Concatenates 2 matrices of same cardinality into a single matrix
    cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx)
    cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.
    evaluateFactorization: : compute RMSE and MAE of a rating matrix factorization against probes
    fkmeans: : Fuzzy K-means clustering
    hmmpredict: : Generate random sequence of observations by given HMM
    itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering
    kmeans: : K-means clustering
    lucene.vector: : Generate Vectors from a Lucene index
    lucene2seq: : Generate Text SequenceFiles from a Lucene index
    matrixdump: : Dump matrix in CSV format
    matrixmult: : Take the product of two matrices
    parallelALS: : ALS-WR factorization of a rating matrix
    qualcluster: : Runs clustering experiments and summarizes results in a CSV
    recommendfactorized: : Compute recommendations using the factorization of a rating matrix
    recommenditembased: : Compute recommendations using item-based collaborative filtering
    regexconverter: : Convert text files on a per line basis based on regular expressions
    resplit: : Splits a set of SequenceFiles into a number of equal splits
    rowid: : Map SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>}
    rowsimilarity: : Compute the pairwise similarities of the rows of a matrix
    runAdaptiveLogistic: : Score new production data using a probably trained and validated AdaptivelogisticRegression model
    runlogistic: : Run a logistic regression model against CSV data
    seq2encoded: : Encoded Sparse Vector generation from Text sequence files
    seq2sparse: : Sparse Vector generation from Text sequence files
    seqdirectory: : Generate sequence files (of Text) from a directory
    seqdumper: : Generic Sequence File dumper
    seqmailarchives: : Creates SequenceFile from a directory containing gzipped mail archives
    seqwiki: : Wikipedia xml dump to sequence file
    spectralkmeans: : Spectral k-means clustering
    split: : Split Input data into test and train sets
    splitDataset: : split a rating dataset into training and probe parts
    ssvd: : Stochastic SVD
    streamingkmeans: : Streaming k-means clustering
    svd: : Lanczos Singular Value Decomposition
    testnb: : Test the Vector-based Bayes classifier
    trainAdaptiveLogistic: : Train an AdaptivelogisticRegression model
    trainlogistic: : Train a logistic regression using stochastic gradient descent
    trainnb: : Train the Vector-based Bayes classifier
    transpose: : Take the transpose of a matrix
    validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model against hold-out data set
    vecdist: : Compute the distances between a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors
    vectordump: : Dump vectors from a sequence file to text
    viterbi: : Viterbi decoding of hidden states from given output states sequence

    证明mahout 环境部署成功了。




