zoukankan      html  css  js  c++  java
  • pyspark 随机森林特征重要性

    # IMPORT
    >>> import numpy
    >>> from numpy import allclose
    >>> from pyspark.ml.linalg import Vectors
    >>> from pyspark.ml.feature import StringIndexer
    >>> from pyspark.ml.classification import RandomForestClassifier
    
    # PREPARE DATA
    >>> df = spark.createDataFrame([
    ...     (1.0, Vectors.dense(1.0)),
    ...     (0.0, Vectors.sparse(1, [], []))], ["label", "features"])
    >>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
    >>> si_model = stringIndexer.fit(df)
    >>> td = si_model.transform(df)
    
    # BUILD THE MODEL
    >>> rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="indexed", seed=42)
    >>> model = rf.fit(td)
    
    # FEATURE IMPORTANCES
    >>> model.featureImportances
    SparseVector(1, {0: 1.0}) 
    

      

    重要性:

    model.featureImportances

    pyspark 模型简单实例:

     https://blog.csdn.net/Katherine_hsr/article/details/80988994

    概率:

    predictions.select("probability", "label").show(1000)

    probability--->即为输出概率

    pandas 打乱样本:

    import pandas as pd
    df = pd.read_excel("window regulator01 _0914新增样本.xlsx")
    df = df.sample(frac = 1) #打乱样本

    pyspark train、test 随机划分

     train, test = labeled_v.randomSplit([0.75, 0.25])


  • 相关阅读:
    CMDB 理论
    分布式
    闲着无聊 一个python的,三级菜单。装逼版。
    献上一段,派遣网易云音乐,音频的代码。
    redis 安装
    selenium之 chromedriver与chrome版本映射表(更新至v2.46)
    简单的爬虫
    anaconda使用方法
    crm开发之用户重置密码
    模块和包,logging模块
  • 原文地址:https://www.cnblogs.com/Allen-rg/p/10445893.html
Copyright © 2011-2022 走看看