To execute Mr.LDA

zoukankan html css js c++ java

To execute Mr.LDA
The key to get the visible data is to covert the outcomes to proper format ,which is in HDFS ( Mr.LDA on hadoop ) . The methods in details is in original Mr.LDA , which can be used by referring to README.md . The main steps to train the corpus are following :

1.prepare corpus

Two points must be paid attention to.
- Firstly , the format of corpus is same as lda-c . Therefore , we have convert corpus to proper format by coding .
- Secondly , to be dealt with on hadoop , the corpus should be processed again . However , the code is available on original Mr.LDA and what we should do is write a sh file like this :
```
$ hadoop jar target/mrlda-0.9.0-SNAPSHOT-fatjar.jar cc.mrlda.ParseCorpus 
    -input ap-sample.txt -output ap-sample-parsed
```
A complete will separated into several parts by property like this:
```
$ hadoop fs -ls ap-sample-parsed
ap-sample-parsed/document
ap-sample-parsed/term
ap-sample-parsed/title
```
Then which the corpus we use to run Mr.LDA is coming from this folder .

2.Run "vanilla" LDA

This step costs much time about 1 or 2 hours , using nohup command .
Set some parameters and run it like this :
```
$ nohup hadoop jar target/mrlda-0.9.0-SNAPSHOT-fatjar.jar 
    cc.mrlda.VariationalInference 
    -input ap-sample-parsed/document -output ap-sample-lda 
    -term 10000 -topic 20 -iteration 50 -mapper 50 -reducer 20 >& lda.log &
```
3.convert outcomes to proper format

The outcomes processed in the HDFS and isn't visible . If we want to get the visible data , we must convert it to proper format .
Being considerable , the method to convert format need SciPy module in Python , which is used to read data from matlab and similar data . To add the module we only need to type :
```
$ sudo apt-get install python-scipy
```
Then we can see the alpha id and beta file in the terminal by using original Mr.LDA . Some questions occur here , which is how to get beta alpha and other files as final outcomes .

z. About evaluation of machine learning

The key to evaluation of any machine learning algorithm is to split the corpus into three dataset : training set , development set , and test set . The training set is used to fit the model , the development set is used to select parameters , and the test set is used for evaluation . For this task , since we do not focus on tuning parameters , we use only the training set and test set .
查看全文

相关阅读:
Error -26631: HTTP Status-Code=400 (Bad Request) for
mysql中的制表符替换
 mysql中json数据的拼接方式
 使用Nightwatch.js做基于浏览器的web应用自动测试
 Selenium + Nightwatch 自动化测试环境搭建
 Python web 框架：web.py
转 Python Selenium设计模式-POM
自动化测试
 日志打印longging模块（控制台和文件同时输出）
读取配置文件（configparser，.ini文件）

原文地址：https://www.cnblogs.com/cyno/p/4182026.html

1.prepare corpus

2.Run "vanilla" LDA

3.convert outcomes to proper format

z. About evaluation of machine learning