zoukankan      html  css  js  c++  java
  • pyspark

    是在整理文件时, 翻到的, 感觉是好久以前的代码了, 不过看了, 还是可以的. 起码注释还是蛮清晰的. 那时候我真的是妥妥的调包man....

    # 逻辑回归-标准化套路
    
    from pyspark.ml.feature import VectorAssembler
    import pandas as pd
    
    # 1. 准备数据 - 样本数据集
    sample_dataset = [
        (0, "male", 37, 10, "no", 3, 18, 7, 4),
        (0, "female", 27, 4, "no", 4, 14, 6, 4),
        (0, "female", 32, 15, "yes", 1, 12, 1, 4),
        (0, "male", 57, 15, "yes", 5, 18, 6, 5),
        (0, "male", 22, 0.75, "no", 2, 17, 6, 3),
        (0, "female", 32, 1.5, "no", 2, 17, 5, 5),
        (0, "female", 22, 0.75, "no", 2, 12, 1, 3),
        (0, "male", 57, 15, "yes", 2, 14, 4, 4),
        (0, "female", 32, 15, "yes", 4, 16, 1, 2),
        (0, "male", 22, 1.5, "no", 4, 14, 4, 5),
        (0, "male", 37, 15, "yes", 2, 20, 7, 2),
        (0, "male", 27, 4, "yes", 4, 18, 6, 4),
        (0, "male", 47, 15, "yes", 5, 17, 6, 4),
        (0, "female", 22, 1.5, "no", 2, 17, 5, 4),
        (0, "female", 27, 4, "no", 4, 14, 5, 4),
        (0, "female", 37, 15, "yes", 1, 17, 5, 5),
        (0, "female", 37, 15, "yes", 2, 18, 4, 3),
        (0, "female", 22, 0.75, "no", 3, 16, 5, 4),
        (0, "female", 22, 1.5, "no", 2, 16, 5, 5),
        (0, "female", 27, 10, "yes", 2, 14, 1, 5),
        (1, "female", 32, 15, "yes", 3, 14, 3, 2),
        (1, "female", 27, 7, "yes", 4, 16, 1, 2),
        (1, "male", 42, 15, "yes", 3, 18, 6, 2),
        (1, "female", 42, 15, "yes", 2, 14, 3, 2),
        (1, "male", 27, 7, "yes", 2, 17, 5, 4),
        (1, "male", 32, 10, "yes", 4, 14, 4, 3),
        (1, "male", 47, 15, "yes", 3, 16, 4, 2),
        (0, "male", 37, 4, "yes", 2, 20, 6, 4)
    ]
    
    columns = ["affairs", "gender", "age", "label", "children", "religiousness", "education", "occupation", "rating"]
    
    # pandas构建dataframe,方便
    pdf = pd.DataFrame(sample_dataset, columns=columns)
    
    # 2. 特征选取:affairs为目标值,其余为特征值 - 这是工作中最麻烦的地方, 多张表, 数据清洗
    df2 = df.select("affairs","age", "religiousness", "education", "occupation", "rating")
    
    # 3. 合并特征-将多列特征合并为一列"feature", 如果是离散数据, 需要先 onehot 再合并, 挺繁琐的
    # 3.1 用于计算特征向量的字段
    colArray2 = ["age", "religiousness", "education", "occupation", "rating"]
    # 3.2 计算出特征向量
    df3 = VectorAssembler().setInputCols(colArray2).setOutputCol("features").transform(df2)
    
    # 4. 划分分为训练集和测试集(随机)
    trainDF, testDF = df3.randomSplit([0.8,0.2])
    # print("训练集:")
    # trainDF.show(10)
    # print("测试集:")
    # testDF.show(10)
    
    # 5. 训练模型
    from pyspark.ml.classification import LogisticRegression
    # 5.1 创建逻辑回归训练器
    lr = LogisticRegression()
    # 5.2 训练模型
    model = lr.setLabelCol("affairs").setFeaturesCol("features").fit(trainDF)
    # 5.3 预测数据
    model.transform(testDF).show()
    
    # todo 
    # 6. 评估, 交叉验证, 保存, 封装.....

    主要也是作为一个历史的笔记, 当然也作为一个反例, 即如果不懂原理,来调用包的话, 你会发现, ML 其实是多么的无聊, 至少从代码套路上看这样的.

  • 相关阅读:
    基础总结深入:数据类型的分类和判断(数据、内存、变量) 对象 函数 回调函数 IIFE 函数中的this 分号
    BOM 定时器 通过修改元素的类来改变css JSON
    事件 事件的冒泡 事件的委派 事件的绑定 事件的传播
    DOM修改 使用DOM操作CSS
    包装类 Date Math 字符串的相关的方法 正则表达式 DOM DOM查询
    数组 call()、apply()、bind()的使用 this arguments
    autocad 二次开发 最小包围圆算法
    win10 objectarx向导在 vs2015中不起作用的解决办法
    AutoCad 二次开发 jig操作之标注跟随线移动
    AutoCad 二次开发 文字镜像
  • 原文地址:https://www.cnblogs.com/chenjieyouge/p/12535430.html
Copyright © 2011-2022 走看看