zoukankan      html  css  js  c++  java
  • Examples of Machine Learning Toolkit Usage

    Examples of Machine Learning Toolkit Usage

    Scikit-learn

    KFold K-折交叉验证

    >>> import numpy as np
    >>> from sklearn.model_selection import KFold
    
    >>> X = ["a", "b", "c", "d"]
    >>> kf = KFold(n_splits=2)
    >>> for train, test in kf.split(X):
    ...     print("%s %s" % (train, test))
    [2 3] [0 1]
    [0 1] [2 3]
    

    Reference : http://scikit-learn.org/stable/modules/cross_validation.html#k-fold

    Decision Trees Classification 决策树分类

    >>> from sklearn import tree
    >>> X = [[0, 0], [1, 1]]
    >>> Y = [0, 1]
    >>> clf = tree.DecisionTreeClassifier()
    >>> clf = clf.fit(X, Y)
    >>> clf.predict([[2., 2.]])
    array([1])
    

    Reference : http://scikit-learn.org/stable/modules/tree.html#classification

    KNN k近邻

    该算法可以用一句成语来帮助理解:近朱者赤近墨者黑。

    from sklearn.neighbors import KNeighborsClassifier
    
    knc = KNeighborsClassifier()
    knc.fit(X_train, y_train)
    y_pred = knc.predict(X_test)
    

    Logistic Regression 逻辑斯蒂回归

    >>> from sklearn.linear_model import LogisticRegression
    >>> x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=33)
    >>> model = LogisticRegression(penalty='l2', random_state=0, solver='newton-cg', multi_class='multinomial')
    >>> model = fit(x_train, y_train)
    >>> y_pred = model.predict(x_test)
    

    Reference: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression

    Leave One Out 留一法

    >>> from sklearn.model_selection import LeaveOneOut
    
    >>> X = [1, 2, 3, 4]
    >>> loo = LeaveOneOut()
    >>> for train, test in loo.split(X):
    ...     print("%s %s" % (train, test))
    [1 2 3] [0]
    [0 2 3] [1]
    [0 1 3] [2]
    [0 1 2] [3]
    

    Reference : http://scikit-learn.org/stable/modules/cross_validation.html#leave-one-out-loo

    train_test_split 随机分割

    随机地,将数组或矩阵分割成训练集和测试集

    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split
    
    iris = load_iris()
    x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=33)
    

    参数 test_size

    如果是 float,应该在0到1之间,并且代表数据集在列车分割中所包含的比例。

    如果是 int,表示训练样本的绝对数量。

    如果是 None,则自动将值设置为测试大小的补充。

    参数 random_state

    如果 int,随机状态是随机数生成器所使用的种子;

    如果是 RandomState 实例,随机数是随机数生成器;

    如果是 None,随机数生成器是NP-随机使用的随机状态实例。

    StandardScaler 特征标准化

    标准化数据特征,保证每个维度的特征数据方差为1,均值为0。使得预测结果1不会被某些维度过大的特征而主导

    from sklearn.preprocessing import StandardScaler
    
    ss = StandardScaler()
    X_train = ss.fit_transform(X_train)
    X_test = ss.transform(X_test)
    

    Reference: 《Python机器学习及实践》 https://book.douban.com/subject/26886337

    实践

    StandardScaler 在鸢尾花(Iris)数据上的表现并不好。未使用 StandardScaler 处理特征时,可以获得:

    accuracy 0.947368

    avg precision 0.96

    avg recall 0.95

    f1-score 0.95

    代码如下:

    # -*- encoding=utf8 -*-
    
    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.metrics import classification_report
    
    
    if __name__ == '__main__':
        iris = load_iris()
        X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=33)
    
        knc = KNeighborsClassifier()
        knc.fit(X_train, y_train)
        y_pred = knc.predict(X_test)
    
        print("accuracy is %f" % (knc.score(X_test, y_test)))
        print(classification_report(y_test, y_pred, target_names=iris.target_names))
    

    使用了 StandardScaler 以后,这四个指标反而下降了,分别如下所示:

    accuracy 0.894737

    avg precision 0.92

    avg recall 0.89

    f1-score 0.90

    而使用了 StandardScaler 的代码如下:

    # -*- encoding=utf8 -*-
    
    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.metrics import classification_report
    from sklearn.preprocessing import StandardScaler
    
    
    if __name__ == '__main__':
        iris = load_iris()
        X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=33)
    
        # 标准化数据特征,保证每个维度的特征数据方差为1,均值为0.
        # 使得预测结果1不会被某些维度过大的特征而主导
        ss = StandardScaler()
        X_train = ss.fit_transform(X_train)
        X_test = ss.transform(X_test)
    
        knc = KNeighborsClassifier()
        knc.fit(X_train, y_train)
        y_pred = knc.predict(X_test)
    
        print("accuracy is %f" % (knc.score(X_test, y_test)))
        print(classification_report(y_test, y_pred, target_names=iris.target_names))
    

    这是一个奇怪的问题,需要今后更进一步的探究。

    shuffle 随机打乱

    该函数可以随机地打乱训练数据和测试数据(让训练数据和测试数据保持对应)

    from sklearn.utils import shuffle
    
    x = [1,2,3,4]
    y = [1,2,3,4]
    
    x,y = shuffle(x,y)
    

    Out:

    x : [1,4,3,2]

    y : [1,4,3,2]

    Reference : http://scikit-learn.org/stable/modules/generated/sklearn.utils.shuffle.html

    Classification Report

    Presicion, recall and F1-score.

    >>> from sklearn.metrics import classification_report
    >>> print(classification_report(y_test, y_pred, target_names=iris.target_names))
    
                  precision    recall  f1-score   support
    
          setosa       1.00      1.00      1.00         8
      versicolor       0.79      1.00      0.88        11
       virginica       1.00      0.84      0.91        19
    
        accuracy                           0.92        38
       macro avg       0.93      0.95      0.93        38
    weighted avg       0.94      0.92      0.92        38
    

    reference: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report

    XGBoost

    from xgboost import XGBClassifier
    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import classification_report
    
    
    if __name__ == '__main__':
        iris = load_iris()
        x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target)
    
        xgb = XGBClassifier()
        xgb.fit(x_train, y_train)
        y_pred = xgb.predict(x_test)
    
        print(classification_report(y_test, y_pred))
    

    实验结果

                 precision    recall  f1-score   support
    
              0       1.00      1.00      1.00        14
              1       0.93      1.00      0.97        14
              2       1.00      0.90      0.95        10
    
    avg / total       0.98      0.97      0.97        38
    
  • 相关阅读:
    【linux基础err】bash: cannot create temp file for here-document: No space left on device
    【python基础】argparse模块
    第23课 优先选用make系列函数
    第22课 weak_ptr弱引用智能指针
    第21课 shared_ptr共享型智能指针
    第20课 unique_ptr独占型智能指针
    第19课 lambda vs std::bind
    第18课 捕获机制及陷阱
    第17课 lambda表达式
    第16课 处理万能引用和重载的关系
  • 原文地址:https://www.cnblogs.com/fengyubo/p/8024884.html
Copyright © 2011-2022 走看看