zoukankan      html  css  js  c++  java
  • 机器学习

    安装

     能直接安装就再好不过

    pip install xgboost

     如果不能就下载之后本地安装

    安装包下载地址 这里 想要啥包都有

    数据集

    pima-indians-diabetes.csv 文件

    调查印度糖尿病人的一些数据,  最终的预测结果是是否患病

    # 1. Number of times pregnant
    # 2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
    # 3. Diastolic blood pressure (mm Hg)
    # 4. Triceps skin fold thickness (mm)
    # 5. 2-Hour serum insulin (mu U/ml)
    # 6. Body mass index (weight in kg/(height in m)^2)
    # 7. Diabetes pedigree function
    # 8. Age (years)
    # 9. Class variable (0 or 1)

    共有 8 个特征变量, 以及 1 个分类标签

    Xgboost 使用

    基础使用框架

    from numpy import loadtxt
    from xgboost import XGBClassifier
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score
    
    # 下载数据集
    datasets = loadtxt('pima-indians-diabetes.csv', delimiter=',')
    
    # 切分 特征 标签
    X = datasets[:,0:8]
    Y = datasets[:,8]
    
    # 切分 训练集 测试集
    seed = 7
    test_size = 0.33
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
    
    # 模型创建 训练
    model = XGBClassifier()
    model.fit(X_train, y_train)
    
    # 预测模型
    y_pred = model.predict(X_test)
    predictions = [round(i) for i in y_pred]
    
    # 精度计算
    accuracy = accuracy_score(y_test, predictions)
    print("Accuracy: %.2f%%" %(accuracy * 100) )
    Accuracy: 77.95%

    中间过程展示

    Xgboost 的原理是在上一棵树的基础上通过添加树从而实现模型的提升的

    如果希望看到中间的升级过程可以进行如下的操作

    from numpy import loadtxt
    from xgboost import XGBClassifier
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score
    
    # 下载数据集
    datasets = loadtxt('pima-indians-diabetes.csv', delimiter=',')
    
    # 切分 特征 标签
    X = datasets[:,0:8]
    Y = datasets[:,8]
    
    # 切分 训练集 测试集
    seed = 7
    test_size = 0.33
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
    
    # 模型创建 训练
    model = XGBClassifier()
    eval_set = [(X_test, y_test)]
    model.fit(X_train, y_train,  # 传入的训练数据
              early_stopping_rounds=10,  # 当多少次的 lost值不在下降就停止模型 
              eval_metric='logloss',   # lost 评估标准
              eval_set=eval_set,   # 构造一个测试集, 没加入一个就进行一次测试
              verbose=True  # 是否展示出中间的详细数据打印
             )
    
    # 预测模型
    y_pred = model.predict(X_test)
    predictions = [round(i) for i in y_pred]
    
    # 精度计算
    accuracy = accuracy_score(y_test, predictions)
    print("Accuracy: %.2f%%" %(accuracy * 100) )

    打印的过程中会体现出 lost 值的变化过程

    [0]    validation_0-logloss:0.660186
    Will train until validation_0-logloss hasn't improved in 10 rounds.
    [1]    validation_0-logloss:0.634854
    [2]    validation_0-logloss:0.61224
    [3]    validation_0-logloss:0.593118
    [4]    validation_0-logloss:0.578303
    [5]    validation_0-logloss:0.564942
    [6]    validation_0-logloss:0.555113
    [7]    validation_0-logloss:0.54499
    [8]    validation_0-logloss:0.539151
    [9]    validation_0-logloss:0.531819
    [10]    validation_0-logloss:0.526065
    [11]    validation_0-logloss:0.519769
    [12]    validation_0-logloss:0.514979
    [13]    validation_0-logloss:0.50927
    [14]    validation_0-logloss:0.506086
    [15]    validation_0-logloss:0.503565
    [16]    validation_0-logloss:0.503591
    [17]    validation_0-logloss:0.500805
    [18]    validation_0-logloss:0.497605
    [19]    validation_0-logloss:0.495328
    [20]    validation_0-logloss:0.494777
    [21]    validation_0-logloss:0.494274
    [22]    validation_0-logloss:0.493333
    [23]    validation_0-logloss:0.492211
    [24]    validation_0-logloss:0.491936
    [25]    validation_0-logloss:0.490578
    [26]    validation_0-logloss:0.490895
    [27]    validation_0-logloss:0.490646
    [28]    validation_0-logloss:0.491911
    [29]    validation_0-logloss:0.491407
    [30]    validation_0-logloss:0.488828
    [31]    validation_0-logloss:0.487867
    [32]    validation_0-logloss:0.487297
    [33]    validation_0-logloss:0.487562
    [34]    validation_0-logloss:0.487789
    [35]    validation_0-logloss:0.487962
    [36]    validation_0-logloss:0.488218
    [37]    validation_0-logloss:0.489582
    [38]    validation_0-logloss:0.489334
    [39]    validation_0-logloss:0.490968
    [40]    validation_0-logloss:0.48978
    [41]    validation_0-logloss:0.490704
    [42]    validation_0-logloss:0.492369
    Stopping. Best iteration:
    [32]    validation_0-logloss:0.487297
    
    Accuracy: 77.56%
    详细打印

    特征重要性展示

    from numpy import loadtxt
    from xgboost import XGBClassifier
    from xgboost import plot_importance 
    from matplotlib import pyplot
    
    
    # 下载数据集
    datasets = loadtxt('pima-indians-diabetes.csv', delimiter=',')
    
    # 切分 特征 标签
    X = datasets[:,0:8]
    Y = datasets[:,8]
    
    # 模型创建 训练
    model = XGBClassifier()
    model.fit(X, Y)
    
    # 展示特征重要程度
    plot_importance(model)
    pyplot.show()

    参数调节

    Xgboost 有很多的参数可以调节

    常见参数

    学习率 

    learning rate  一般设置在  0.1 以下

    tree 相关参数 

    max_depth  最大深度

    min_child_weight  最小叶子权重

    subsample  随机选择比例

    colsample_bytree  速记特征比例

    gamma  损失率相关的一个参数

    正则化参数

    lambda

    alpha

    其他参数示例

    更详细的的参数可以参考官方文档 

    xgb1 = XGBClassifier(
        learning_rate= 0.1,
        n_estimators=1000,
        max_depth=5,
        min_child_weight=1,
        gamma=0,
        subsample=0.8,
        colsample_bytree=0.8,
        objective='binary:logistic', # 指定出是用什么损失函数, 一阶导还是二阶导
        nthread=4, #
        scale_pos_weight=1,
        seed=27 # 随机种子
    )

    参数选择示例

    from numpy import loadtxt
    from xgboost import XGBClassifier
    from sklearn.model_selection import GridSearchCV
    from sklearn.model_selection import StratifiedKFold
    
    # 下载数据集
    datasets = loadtxt('pima-indians-diabetes.csv', delimiter=',')
    
    # 切分 特征 标签
    X = datasets[:,0:8]
    Y = datasets[:,8]
    
    # 模型创建 训练
    model = XGBClassifier()
    
    # 学习率备选数据
    learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3]
    param_grid = dict(learning_rate=learning_rate) # 格式要求转换为字典格式
    
    # 交叉验证
    kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7)
    
    # 训练模型最佳学习率选择
    grid_serarch = GridSearchCV(model, 
                                param_grid, 
                                scoring='neg_log_loss', 
                                n_jobs=-1,  # 当前所有 cpu 都跑这个事
                                cv=kfold)
    grid_serarch = grid_serarch.fit(X, Y)
    
    # 打印结果
    print("Best: %f using %s" % (grid_serarch.best_score_, grid_serarch.best_params_))
    means = grid_serarch.cv_results_['mean_test_score']
    params = grid_serarch.cv_results_['params']
    
    for mean, param in zip(means, params):
        print("%f with: %r" % (mean, param))

    打印结果

    Best: -0.483304 using {'learning_rate': 0.1}
    -0.689811 with: {'learning_rate': 0.0001}
    -0.661827 with: {'learning_rate': 0.001}
    -0.531155 with: {'learning_rate': 0.01}
    -0.483304 with: {'learning_rate': 0.1}
    -0.515642 with: {'learning_rate': 0.2}
    -0.554158 with: {'learning_rate': 0.3}
  • 相关阅读:
    TTL电平和CMOS电平总结
    掩码
    关于Autosar中DCM(14229UDS)模块的理解
    Diagnostic Trouble Code诊断故障码
    eclipse搭建android开发环境
    在ubuntu下安装zookeeper
    redis的windows版本下载地址及windows的客户端工具
    最简单的启动并连接一个redis的docker容器
    转:Redis介绍及常用命令大全
    redis常用命令
  • 原文地址:https://www.cnblogs.com/shijieli/p/11958815.html
Copyright © 2011-2022 走看看