zoukankan      html  css  js  c++  java
  • To execute Mr.LDA

    The key to get the visible data is to covert the outcomes to proper format ,which is in HDFS ( Mr.LDA on hadoop ) . The methods in details is in original Mr.LDA , which can be used by referring to README.md . The main steps to train the corpus are following :

    1.prepare corpus

    Two points must be paid attention to.

    • Firstly , the format of corpus is same as lda-c . Therefore , we have convert corpus to proper format by coding .
    • Secondly , to be dealt with on hadoop , the corpus should be processed again . However , the code is available on original Mr.LDA and what we should do is write a sh file like this :
    $ hadoop jar target/mrlda-0.9.0-SNAPSHOT-fatjar.jar cc.mrlda.ParseCorpus 
        -input ap-sample.txt -output ap-sample-parsed
    

    A complete will separated into several parts by property like this:

    $ hadoop fs -ls ap-sample-parsed
    ap-sample-parsed/document
    ap-sample-parsed/term
    ap-sample-parsed/title
    

    Then which the corpus we use to run Mr.LDA is coming from this folder .

    2.Run "vanilla" LDA

    This step costs much time about 1 or 2 hours , using nohup command .
    Set some parameters and run it like this :

    $ nohup hadoop jar target/mrlda-0.9.0-SNAPSHOT-fatjar.jar 
        cc.mrlda.VariationalInference 
        -input ap-sample-parsed/document -output ap-sample-lda 
        -term 10000 -topic 20 -iteration 50 -mapper 50 -reducer 20 >& lda.log &
    

    3.convert outcomes to proper format

    The outcomes processed in the HDFS and isn't visible . If we want to get the visible data , we must convert it to proper format .
    Being considerable , the method to convert format need SciPy module in Python , which is used to read data from matlab and similar data . To add the module we only need to type :

    $ sudo apt-get install python-scipy
    

    Then we can see the alpha id and beta file in the terminal by using original Mr.LDA . Some questions occur here , which is how to get beta alpha and other files as final outcomes .

    z. About evaluation of machine learning

    The key to evaluation of any machine learning algorithm is to split the corpus into three dataset : training set , development set , and test set . The training set is used to fit the model , the development set is used to select parameters , and the test set is used for evaluation . For this task , since we do not focus on tuning parameters , we use only the training set and test set .

  • 相关阅读:
    转发:招聘一个靠谱的 iOS
    转发:经典面试题
    APP上架证书无效:解决
    转发:Xcode插件
    Alcatraz:插件管理
    类似禅道的多条件搜索功能,比如或者并且和模糊查询和指定查询,见下图吧
    关于angularjs中,数据模型被改变,页面不刷新的解决办法
    angluar1+ionic详情页返回在原来的位置(缓存数据和页面高度)
    unable to resolve module react-native-gesture-handler from
    解决React Native:Error: Cannot find module 'asap/raw'
  • 原文地址:https://www.cnblogs.com/cyno/p/4182026.html
Copyright © 2011-2022 走看看