zoukankan      html  css  js  c++  java
  • python spark 随机森林入门demo

    class pyspark.mllib.tree.RandomForest[source]

    Learning algorithm for a random forest model for classification or regression.

    New in version 1.2.0.

    supportedFeatureSubsetStrategies = ('auto', 'all', 'sqrt', 'log2', 'onethird')
    classmethod trainClassifier(datanumClassescategoricalFeaturesInfonumTreesfeatureSubsetStrategy='auto'impurity='gini'maxDepth=4maxBins=32seed=None)[source]

    Train a random forest model for binary or multiclass classification.

    Parameters:
    • data – Training dataset: RDD of LabeledPoint. Labels should take values {0, 1, ..., numClasses-1}.
    • numClasses – Number of classes for classification.
    • categoricalFeaturesInfo – Map storing arity of categorical features. An entry (n -> k) indicates that feature n is categorical with k categories indexed from 0: {0, 1, ..., k-1}.
    • numTrees – Number of trees in the random forest.
    • featureSubsetStrategy – Number of features to consider for splits at each node. Supported values: “auto”, “all”, “sqrt”, “log2”, “onethird”. If “auto” is set, this parameter is set based on numTrees: if numTrees == 1, set to “all”; if numTrees > 1 (forest) set to “sqrt”. (default: “auto”)
    • impurity – Criterion used for information gain calculation. Supported values: “gini” or “entropy”. (default: “gini”)
    • maxDepth – Maximum depth of tree (e.g. depth 0 means 1 leaf node, depth 1 means 1 internal node + 2 leaf nodes). (default: 4)
    • maxBins – Maximum number of bins used for splitting features. (default: 32)
    • seed – Random seed for bootstrapping and choosing feature subsets. Set as None to generate seed based on system time. (default: None)
    Returns:

    RandomForestModel that can be used for prediction.

    Example usage:

    >>> from pyspark.mllib.regression import LabeledPoint
    >>> from pyspark.mllib.tree import RandomForest
    >>>
    >>> data = [
    ...     LabeledPoint(0.0, [0.0]),
    ...     LabeledPoint(0.0, [1.0]),
    ...     LabeledPoint(1.0, [2.0]),
    ...     LabeledPoint(1.0, [3.0])
    ... ]
    >>> model = RandomForest.trainClassifier(sc.parallelize(data), 2, {}, 3, seed=42)
    >>> model.numTrees()
    3
    >>> model.totalNumNodes()
    7
    >>> print(model)
    TreeEnsembleModel classifier with 3 trees
    
    >>> print(model.toDebugString())
    TreeEnsembleModel classifier with 3 trees
    
      Tree 0:
        Predict: 1.0
      Tree 1:
        If (feature 0 <= 1.0)
         Predict: 0.0
        Else (feature 0 > 1.0)
         Predict: 1.0
      Tree 2:
        If (feature 0 <= 1.0)
         Predict: 0.0
        Else (feature 0 > 1.0)
         Predict: 1.0
    
    >>> model.predict([2.0])
    1.0
    >>> model.predict([0.0])
    0.0
    >>> rdd = sc.parallelize([[3.0], [1.0]])
    >>> model.predict(rdd).collect()
    [1.0, 0.0]
    

    New in version 1.2.0.

    摘自:https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.tree.DecisionTree

  • 相关阅读:
    51nod乘积之和
    Dell服务器安装OpenManage(OMSA)
    Nginx反向代理PHP
    搭建haproxy
    108. Convert Sorted Array to Binary Search Tree
    60. Permutation Sequence
    142. Linked List Cycle II
    129. Sum Root to Leaf Numbers
    118. Pascal's Triangle
    26. Remove Duplicates from Sorted Array
  • 原文地址:https://www.cnblogs.com/bonelee/p/7150484.html
Copyright © 2011-2022 走看看