随机森林算法demo python spark

zoukankan html css js c++ java

随机森林算法demo python spark
关键参数

最重要的，常常需要调试以提高算法效果的有两个参数：numTrees，maxDepth。
- numTrees（决策树的个数）：增加决策树的个数会降低预测结果的方差，这样在测试时会有更高的accuracy。训练时间大致与numTrees呈线性增长关系。
- maxDepth：是指森林中每一棵决策树最大可能depth，在决策树中提到了这个参数。更深的一棵树意味模型预测更有力，但同时训练时间更长，也更倾向于过拟合。但是值得注意的是，随机森林算法和单一决策树算法对这个参数的要求是不一样的。随机森林由于是多个的决策树预测结果的投票或平均而降低而预测结果的方差，因此相对于单一决策树而言，不容易出现过拟合的情况。所以随机森林可以选择比决策树模型中更大的maxDepth。
  甚至有的文献说，随机森林的每棵决策树都最大可能地进行生长而不进行剪枝。但是不管怎样，还是建议对maxDepth参数进行一定的实验，看看是否可以提高预测的效果。
  另外还有两个参数，subsamplingRate，featureSubsetStrategy一般不需要调试，但是这两个参数也可以重新设置以加快训练，但是值得注意的是可能会影响模型的预测效果（如果需要调试的仔细读下面英文吧）。
We include a few guidelines for using random forests by discussing the various parameters. We omit some decision tree parameters since those are covered in the decision tree guide.
The first two parameters we mention are the most important, and tuning them can often improve performance:
（1）numTrees: Number of trees in the forest.
Increasing the number of trees will decrease the variance in predictions, improving the model’s test-time accuracy.
Training time increases roughly linearly in the number of trees.
（2）maxDepth: Maximum depth of each tree in the forest.
Increasing the depth makes the model more expressive and powerful. However, deep trees take longer to train and are also more prone to overfitting.
In general, it is acceptable to train deeper trees when using random forests than when using a single decision tree. One tree is more likely to overfit than a random forest (because of the variance reduction from averaging multiple trees in the forest).
The next two parameters generally do not require tuning. However, they can be tuned to speed up training.
（3）subsamplingRate: This parameter specifies the size of the dataset used for training each tree in the forest, as a fraction of the size of the original dataset. The default (1.0) is recommended, but decreasing this fraction can speed up training.
（4）featureSubsetStrategy: Number of features to use as candidates for splitting at each tree node. The number is specified as a fraction or function of the total number of features. Decreasing this number will speed up training, but can sometimes impact performance if too low.
We include a few guidelines for using random forests by discussing the various parameters. We omit some decision tree parameters since those are covered in the decision tree guide.
""" Random Forest Classification Example. """ from __future__ import print_function from pyspark import SparkContext # $example on$ from pyspark.mllib.tree import RandomForest, RandomForestModel from pyspark.mllib.util import MLUtils # $example off$ if __name__ == "__main__": sc = SparkContext(appName="PythonRandomForestClassificationExample") # $example on$ # Load and parse the data file into an RDD of LabeledPoint. data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt') # Split the data into training and test sets (30% held out for testing) (trainingData, testData) = data.randomSplit([0.7, 0.3]) # Train a RandomForest model. # Empty categoricalFeaturesInfo indicates all features are continuous. # Note: Use larger numTrees in practice. # Setting featureSubsetStrategy="auto" lets the algorithm choose. model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={}, numTrees=3, featureSubsetStrategy="auto", impurity='gini', maxDepth=4, maxBins=32) # Evaluate model on test instances and compute test error predictions = model.predict(testData.map(lambda x: x.features)) labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions) testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count()) print('Test Error = ' + str(testErr)) print('Learned classification forest model:') print(model.toDebugString()) # Save and load model model.save(sc, "target/tmp/myRandomForestClassificationModel") sameModel = RandomForestModel.load(sc, "target/tmp/myRandomForestClassificationModel") # $example off$
模型样子：
TreeEnsembleModel classifier with 3 trees Tree 0: If (feature 511 <= 0.0) If (feature 434 <= 0.0) Predict: 0.0 Else (feature 434 > 0.0) Predict: 1.0 Else (feature 511 > 0.0) Predict: 0.0 Tree 1: If (feature 490 <= 31.0) Predict: 0.0 Else (feature 490 > 31.0) Predict: 1.0 Tree 2: If (feature 302 <= 0.0) If (feature 461 <= 0.0) If (feature 208 <= 107.0) Predict: 1.0 Else (feature 208 > 107.0) Predict: 0.0 Else (feature 461 > 0.0) Predict: 1.0 Else (feature 302 > 0.0) Predict: 0.0
查看全文

相关阅读:
Android安全——加固原理
 Android安全–Dex文件格式详解
 我的第二个开源库SuperTextView——中文文档
 【解决方案】： hyper-v 导入虚拟机报这个错误 32784
【经验】谈谈怎么找自己想要的资源吧~
有吧友需要PDF的下载站点，好吧，我这边汇总一下
 lucene、lucene.NET详细使用与优化详解
 轻量级开源内存数据库SQLite性能测试
 SQLite介绍、学习笔记、性能测试
 十个 MongoDB 使用要点

原文地址：https://www.cnblogs.com/bonelee/p/7204096.html

随机森林算法demo python spark

关键参数