spark 决策树分类算法demo

zoukankan html css js c++ java

spark 决策树分类算法demo
分类（Classification）

下面的例子说明了怎样导入LIBSVM 数据文件，解析成RDD[LabeledPoint]，然后使用决策树进行分类。GINI不纯度作为不纯度衡量标准并且树的最大深度设置为5。最后计算了测试错误率从而评估算法的准确性。
```
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils

# Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a DecisionTree model.
#  Empty categoricalFeaturesInfo indicates all features are continuous.
model = DecisionTree.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
                                     impurity='gini', maxDepth=5, maxBins=32)

# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())
print('Test Error = ' + str(testErr))
print('Learned classification tree model:')
print(model.toDebugString())

# Save and load model
model.save(sc, "myModelPath")
sameModel = DecisionTreeModel.load(sc, "myModelPath")
```
以下代码展示了如何载入一个LIBSVM数据文件，解析成一个LabeledPointRDD，然后使用决策树，使用Gini不纯度作为不纯度衡量指标，最大树深度是5.测试误差用来计算算法准确率。
# -*- coding:utf-8 -*-

"""

测试决策树

"""

import os

import sys

import logging

from pyspark.mllib.tree import DecisionTree,DecisionTreeModel

from pyspark.mllib.util import MLUtils

# Path for spark source folder

os.environ['SPARK_HOME']="D:javaPackagesspark-1.6.0-bin-hadoop2.6"

# Append pyspark to Python Path

sys.path.append("D:javaPackagesspark-1.6.0-bin-hadoop2.6python")

sys.path.append("D:javaPackagesspark-1.6.0-bin-hadoop2.6pythonlibpy4j-0.9-src.zip")

from pyspark import SparkContext

from pyspark import SparkConf

conf = SparkConf()

conf.set("YARN_CONF_DIR ", "D:javaPackageshadoop_conf_diryarn-conf")

conf.set("spark.driver.memory", "2g")

#conf.set("spark.executor.memory", "1g")

#conf.set("spark.python.worker.memory", "1g")

conf.setMaster("yarn-client")

conf.setAppName("TestDecisionTree")

logger = logging.getLogger('pyspark')

sc = SparkContext(conf=conf)

mylog = []

#载入和解析数据文件为 LabeledPoint RDDdata = MLUtils.loadLibSVMFile(sc,"/home/xiatao/machine_learing/")

#将数据拆分成训练集合测试集

(trainingData,testData) = data.randomSplit([0.7,0.3])

##训练决策树模型

#空的 categoricalFeauresInfo 代表了所有的特征都是连续的

model = DecisionTree.trainClassifier(trainingData, numClasses=2,categoricalFeaturesInfo={},impurity='gini',maxDepth=5,maxBins=32)

# 在测试实例上评估模型并计算测试误差

predictions = model.predict(testData.map(lambda x:x.features))

labelsAndPoint = testData.map(lambda lp:lp.label).zip(predictions)

testMSE = labelsAndPoint.map(lambda (v,p):(v-p)**2).sum()/float(testData.count())

mylog.append("测试误差是")

mylog.append(testMSE)

#存储模型

model.save(sc,"/home/xiatao/machine_learing/")

sc.parallelize(mylog).saveAsTextFile("/home/xiatao/machine_learing/log")

sameModel = DecisionTreeModel.load(sc,"/home/xiatao/machine_learing/")
查看全文

相关阅读:
C++中的string和stringstream用法1
回调函数简析
 Qt界面设计更新
 C/C++中的类型转换
 桥接模式 bridge pattern
装饰者模式
 适配器模式
 代理模型
 工厂类---抽象工厂(3)
[效率神技]Intellij 的快捷键和效率技巧|系列一|常用快捷键

原文地址：https://www.cnblogs.com/bonelee/p/7149804.html

热门文章
自定义播放器
 全屏切换
 H5C3
数组的合并
 数组去重
 鼠标拖拽事件
 H5的本地存储
 ajax
element-ui
ES6

spark 决策树分类算法demo

分类（Classification）