zoukankan      html  css  js  c++  java
  • scikit-learn入门导航

    scikit-learn是一个非常强大的机器学习库, 提供了很多常见机器学习算法的实现.

    scikit-learn可以通过pip进行安装:

    pip install -U scikit-learn
    

    不过这个包比较大, 若使用pip安装超时可以去pypi上下载适合自己系统的.exe.whl文件进行安装.

    安装成功后可以在python中导入:

    import sklearn
    

    sklearn的官方文档叙述非常详细清晰, 建议通过阅读User Guide学习sklearn.

    Dataset Loading

    sklearn基于numpy的矩阵与向量化运算支持, 可以采用类似numpy的导入:

    import numpy
    
    f = open('dataSet.txt')
    dataSet = numpy.loadtxt(f)
    

    dataSet为numpy的mat对象.

    或者用libsvm的导入格式:

    from sklearn.datasets import load_svmlight_file
    
    X_train, y_train = load_svmlight_file("dataSet.txt")
    X_train.todense()  # 将稀疏矩阵转换为完整矩阵
    

    sklearn包中内置了一些示例数据:

    from sklearn import datasets
    
    iris = datasets.load_iris()
    print(iris.data)
    

    上面导入了著名的安德森鸢尾花卉数据集, iris.data中存储了特征值, iris.target中存储了分类标签.

    更多关于数据载入的内容请参见User Guide - Dataset loading utilities

    Supervised learning

    LinearRegression

    线性回归是最经典的算法:

    from sklearn import linear_model
    
    train_x = [[0, 0], [1, 1]]
    train_y = [0, 1]
    test_x = [[0, 0.2]]
    regr = linear_model.LinearRegression()
    regr.fit(train_x, train_y)
    print(regr.predict(test_x))
    

    以及常见的变种逻辑回归:

    from sklearn import linear_model
    
    train_x = [[0, 0], [1, 1]]
    train_y = [0, 1]
    test_x = [[0, 0.2]]
    regr = linear_model.LogisticRegression()
    regr.fit(train_x, train_y)
    print(regr.predict(test_x))
    

    更多线性模型参见User Guide - Linear Model

    Support Vector Machine

    SVM是非常好用的分类算法, sklearn提供了SVC,NuSvc, LinearSVC三种基于SVM的分类器.

    SVC与NuSVC非常类似, SVC用参数C(惩罚因子, Cost)设置拟合程度,取值1到无穷; nu则是错分样本所占比例,取值0到1.

    from sklearn import svm
    
    train_x = [[0, 0], [1, 1]]
    train_y = [0, 1]
    clf = svm.SVC()
    clf.fit(train_x, train_y)
    print(clf.predict([0.9, 0.9]))    from sklearn import svm
    
    train_x = [[0, 0], [1, 1]]
    train_y = [0, 1]
    clf = svm.SVC()
    clf.fit(train_x, train_y)
    print(clf.predict([0.9, 0.9]))
    

    SVC和NuSVC采用one-against-one策略来进行多分类:

    from sklearn import svm
    
    train_x = [[0, 0], [1, 1], [2,2], [3, 3]]
    train_y = [0, 1, 2, 3]
    clf = svm.SVC(decision_function_shape='ovo')
    clf.fit(train_x, train_y)
    print(clf.predict([1.9, 1.9]))
    

    LinearSVC采用one-against-rest策略进行多分类:

    from sklearn import svm
    
    train_x = [[0, 0], [1, 1], [2,2], [3, 3]]
    train_y = [0, 1, 2, 3]
    clf = svm.LinearSVC()
    clf.fit(train_x, train_y)
    print(clf.predict([1.9, 1.9]))
    

    更多关于SVM的内容参见User Guide

    K Nearest Neighbors

    K临近算法是一种非常简单的分类算法:

    from sklearn.neighbors import NearestNeighbors
    import numpy as np
    
    x = [[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]
    y = [[0, 0], [-1, 2], [3,1]]
    nbrs = NearestNeighbors(n_neighbors=3, algorithm='ball_tree').fit(x)
    dist, index = nbrs.kneighbors(y)
    print(dist)
    print(index)
    

    dist显示测试集y中各点在x中最近邻居的距离:

    [[ 1.41421356  1.41421356  2.23606798]
    [ 2.23606798  3.          3.16227766]
    [ 1.          1.          2.        ]]
    

    index显示最近邻居的下标:

    [[0 3 1]
     [3 0 1]
     [4 5 3]]
    

    最近邻居的个数由n_neighbors参数指定, algorithm参数指定搜索算法, 可以选用"KDTree" 或"BallTree".

    更多关于knn算法内容参见User Guide

    Naive Bayes

    朴素贝叶斯算法是经典的概率分类算法:

    from sklearn import datasets
    from sklearn.naive_bayes import GaussianNB
    
    iris = datasets.load_iris()
    gnb = GaussianNB()
    gnb.fit(iris.data, iris.target)
    y_pred = gnb.predict(iris.data)
    y_proba= gnb.predict_proba(iris.data)
    

    更多内容参见User Guide

    Decision Tree

    sklearn提供了决策树进行分类和回归的实现:

    from sklearn import tree
    x = [[0, 0], [1, 1]]
    y = [0, 1]
    clf = tree.DecisionTreeClassifier()
    clf = clf.fit(x, y)
    clf.predict([[2, 2]])  # array([1]) 查看最优分类
    clf.predict_proba([[2., 2.]])  # array([[ 0.,  1.]]) 查看属于各类的贝叶斯概率值
    

    回归:

     from sklearn import tree
    
     x = [[0, 0], [2, 2]]
     y = [0.5, 2.5]
     clf = tree.DecisionTreeRegressor()
     clf = clf.fit(x, y)
     clf.predict([[1, 1]])  # array([ 0.5])
    

    更多关于决策树算法的内容参见User Guide

    Random Forest

    随机森林是采用多个决策树进行分类的集成方法(Ensemble Method)

    from sklearn.ensemble import RandomForestClassifier
    
    train_x = [[0, 0], [1, 1], [2,2], [3, 3]]
    train_y = [0, 1, 2, 3]
    test_x = [0.9, 0.9]
    clf = RandomForestClassifier(n_estimators=10)
    clf = clf.fit(train_x, train_y)
    clf.predict(test_x)
    

    Cross validation

    交叉验证是提高预测精确度的重要方法, sklearn提供了相应工具将数据集分为训练数据集和验证数据集,以提升训练效果:

    from sklearn import cross_validation
    from sklearn import svm
    from sklearn import datasets
    
    iris = datasets.load_iris()
    clf = svm.SVC()
    confindence = cross_validation.cross_val_score(clf, iris.data, iris.target, cv=5)
    

    confindence代表了对各类分类的准确程度(信心).

  • 相关阅读:
    Ubuntu 下安装 PHP Solr 扩展的安装与使用
    转载:Ubuntu14-04安装redis和php5-redis扩展
    Datagridview全选,更新数据源代码
    sftp不识别的问题ssh命令找不到
    linux:如何修改用户的密码
    win7.wifi热点
    Rico Board.1.环境配置
    linux学习记录.6.vscode调试c makefile
    linux学习记录.5.git & github
    linux学习记录.3.virtualbox 共享文件夹
  • 原文地址:https://www.cnblogs.com/Finley/p/5816097.html
Copyright © 2011-2022 走看看