zoukankan      html  css  js  c++  java
  • 机器学习算法,机器让我学习(1)

    机器学习的分类:

      有监督学习:

        基于连续输出称为回归,如股票预测

        基于离散量输出为分类,如数字识别

        (常见的有监督学习:回归预测分析,分类;垃圾邮件检测;模式检测;自然语言处理;情绪分析;自动图像分类;自动序列生成等)

      无监督的学习:

        对象分割,相似性检测,自动标记

      强化学习:

        (不懂啊,是不是还有一个半监督学习,还是看周志华的机器学习那本吧,现在是意大利的一个人的书)

    特征选择与特征工程:

      创建训练集和测试集:

      训练和测试数据集要满足原始数据分布,并随机混合,避免出现连续元素之间的相关性。

      一种将数据分割开为训练和测试的方法如下:0.75:0.25。随机数种子1000,用于再现实验过程等

    1 from sklearn.model_selection import train_test_split
    2 X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.25,random_state=1000)

      管理分类数据,将数据转换为数字量等:

    #创建
    import numpy as np
    X = np.random.uniform(0.0,1.0,size=(10,2))
    Y = np.random.choice(('Male','Female'),size=(10))
    #管理,输出为数字量表示
    from sklearn.preprocessing import LabelEncoder
    le = LabelEncoder()
    yt = le.fit_transform(Y)
    print(yt)

      获得反向的数据:

    output = [1,0,1,1,0,0]
    decoded_output = [le.classes_[i] for i in output]
    decoded_output

      其他方法的编码,如独热编码:每个数据标签转化为一个数组,数组为只有一个特征为1其它为0的数组量:

    from sklearn.preprocessing import LabelBinarizer
    lb = LabelBinarizer()
    Yb = lb.fit_transform(Y)
    Yb#独热编码

      数据量较小时,还可以采用其它编码:

    data = [
        {'feature_1':10.0,'feature_2':15.0},
        {'feature_1':-5.0,'feature_3':33.0},
        {'feature_3':-2.0,'feature_4':10.0}
    ]
    from sklearn.feature_extraction import DictVectorizer,FeatureHasher
    dv = DictVectorizer()
    Y_dict = dv.fit_transform(data)
    Y_dict.todense()
    #可以产生稀疏矩阵

      

      

      

      使用oneHotEncoding提供的独热编码:第一个数字进行了编码:

    from sklearn.preprocessing import OneHotEncoder
    data1 = [
        [0, 10],
        [1, 11],
        [1, 8],
        [0, 12],
        [0, 15]
    ]
    oh = OneHotEncoder(categorical_features=[0])
    Y_oh = oh.fit_transform(data1)
    Y_oh.todense()

      

      管理缺失特征:对缺失的数据进行平均值,中值,频率的补全

    import numpy as np
    from sklearn.preprocessing import Imputer
    data = np.array([[1, np.nan, 2], [2, 3, np.nan], [-1, 4, 2]])
    print(data)
    # Imputer with mean-strategy
    print('Mean strategy')
    imp = Imputer(strategy='mean')
    print(imp.fit_transform(data))
    # Imputer with median-strategy
    print('Median strategy')
    imp = Imputer(strategy='median')
    print(imp.fit_transform(data))
    # Imputer with most-frequent-strategy
    print('Most-frequent strategy')
    imp = Imputer(strategy='most_frequent')
    print(imp.fit_transform(data))

      

      数据缩放和归一化

     1 from __future__ import print_function
     2 import numpy as np
     3 import matplotlib.pyplot as plt
     4 from sklearn.preprocessing import StandardScaler, RobustScaler
     5 # For reproducibility
     6 np.random.seed(1000)
     7 if __name__ == '__main__':
     8     # Create a dummy dataset
     9     data = np.ndarray(shape=(100, 2))
    10     for i in range(100):
    11         data[i, 0] = 2.0 + np.random.normal(1.5, 3.0)
    12         data[i, 1] = 0.5 + np.random.normal(1.5, 3.0)
    13     # Show the original and the scaled dataset
    14     fig, ax = plt.subplots(1, 2, figsize=(14, 5))
    15 
    16     ax[0].scatter(data[:, 0], data[:, 1])
    17     ax[0].set_xlim([-10, 10])
    18     ax[0].set_ylim([-10, 10])
    19     ax[0].grid()
    20     ax[0].set_xlabel('X')
    21     ax[0].set_ylabel('Y')
    22     ax[0].set_title('Raw data')
    23 
    24     # Scale data
    25     ss = StandardScaler()
    26     scaled_data = ss.fit_transform(data)
    27 
    28     ax[1].scatter(scaled_data[:, 0], scaled_data[:, 1])
    29     ax[1].set_xlim([-10, 10])
    30     ax[1].set_ylim([-10, 10])
    31     ax[1].grid()
    32     ax[1].set_xlabel('X')
    33     ax[1].set_ylabel('Y')
    34     ax[1].set_title('Scaled data')
    35 
    36     plt.show()
    37 
    38     # Scale data using a Robust Scaler
    39     fig, ax = plt.subplots(2, 2, figsize=(8, 8))
    40 
    41     ax[0, 0].scatter(data[:, 0], data[:, 1])
    42     ax[0, 0].set_xlim([-10, 10])
    43     ax[0, 0].set_ylim([-10, 10])
    44     ax[0, 0].grid()
    45     ax[0, 0].set_xlabel('X')
    46     ax[0, 0].set_ylabel('Y')
    47     ax[0, 0].set_title('Raw data')
    48 
    49     rs = RobustScaler(quantile_range=(15, 85))
    50     scaled_data = rs.fit_transform(data)
    51 
    52     ax[0, 1].scatter(scaled_data[:, 0], scaled_data[:, 1])
    53     ax[0, 1].set_xlim([-10, 10])
    54     ax[0, 1].set_ylim([-10, 10])
    55     ax[0, 1].grid()
    56     ax[0, 1].set_xlabel('X')
    57     ax[0, 1].set_ylabel('Y')
    58     ax[0, 1].set_title('Scaled data (15% - 85%)')
    59 
    60     rs1 = RobustScaler(quantile_range=(25, 75))
    61     scaled_data1 = rs1.fit_transform(data)
    62 
    63     ax[1, 0].scatter(scaled_data1[:, 0], scaled_data1[:, 1])
    64     ax[1, 0].set_xlim([-10, 10])
    65     ax[1, 0].set_ylim([-10, 10])
    66     ax[1, 0].grid()
    67     ax[1, 0].set_xlabel('X')
    68     ax[1, 0].set_ylabel('Y')
    69     ax[1, 0].set_title('Scaled data (25% - 75%)')
    70 
    71     rs2 = RobustScaler(quantile_range=(30, 65))
    72     scaled_data2 = rs2.fit_transform(data)
    73 
    74     ax[1, 1].scatter(scaled_data2[:, 0], scaled_data2[:, 1])
    75     ax[1, 1].set_xlim([-10, 10])
    76     ax[1, 1].set_ylim([-10, 10])
    77     ax[1, 1].grid()
    78     ax[1, 1].set_xlabel('X')
    79     ax[1, 1].set_ylabel('Y')
    80     ax[1, 1].set_title('Scaled data (30% - 60%)')
    81 
    82     plt.show()
    View Code

        

        

        其中的Max L1 L2的缩放定义如下:

        

      特征选择和过滤:过滤掉信息量比较少的变量:

     1 from __future__ import print_function
     2 
     3 import numpy as np
     4 import matplotlib.pyplot as plt
     5 
     6 from sklearn.feature_selection import VarianceThreshold
     7 
     8 # For reproducibility
     9 np.random.seed(1000)
    10 
    11 if __name__ == '__main__':
    12     # Create a dummy dataset
    13     X = np.ndarray(shape=(100, 3))
    14 
    15     X[:, 0] = np.random.normal(0.0, 5.0, size=100)
    16     X[:, 1] = np.random.normal(0.5, 5.0, size=100)
    17     X[:, 2] = np.random.normal(1.0, 0.5, size=100)
    18 
    19     # Show the dataset
    20     fig, ax = plt.subplots(1, 1, figsize=(12, 8))
    21     ax.grid()
    22     ax.set_xlabel('X')
    23     ax.set_ylabel('Y')
    24 
    25     ax.plot(X[:, 0], label='STD = 5.0')
    26     ax.plot(X[:, 1], label='STD = 5.0')
    27     ax.plot(X[:, 2], label='STD = 0.5')
    28 
    29     plt.legend()
    30     plt.show()
    31 
    32     # Impose a variance threshold
    33     print('Samples before variance thresholding')
    34     print(X[0:10, :])
    35 
    36     vt = VarianceThreshold(threshold=1.5)
    37     X_t = vt.fit_transform(X)
    38 
    39     # After the filter has removed the componenents
    40     print('Samples after variance thresholding')
    41     print(X_t[0:10, :])
    View Code

           

          第三列是方差小的那一列,所以第三列被去掉了:

        

       对MNIST做主成分分析!:

     1 from __future__ import print_function
     2 
     3 import numpy as np
     4 import matplotlib.pyplot as plt
     5 
     6 from sklearn.datasets import load_digits
     7 from sklearn.decomposition import PCA
     8 
     9 # For reproducibility
    10 np.random.seed(1000)
    11 
    12 if __name__ == '__main__':
    13     # Load MNIST digits
    14     digits = load_digits()
    15 
    16     # Show some random digits
    17     selection = np.random.randint(0, 1797, size=100)
    18 
    19     fig, ax = plt.subplots(10, 10, figsize=(10, 10))
    20 
    21     samples = [digits.data[x].reshape((8, 8)) for x in selection]
    22 
    23     for i in range(10):
    24         for j in range(10):
    25             ax[i, j].set_axis_off()
    26             ax[i, j].imshow(samples[(i * 8) + j], cmap='gray')
    27 
    28     plt.show()
    29 
    30     # Perform a PCA on the digits dataset
    31     pca = PCA(n_components=36, whiten=True)
    32     X_pca = pca.fit_transform(digits.data / 255)
    33 
    34     print('Explained variance ratio')
    35     print(pca.explained_variance_ratio_)
    36 
    37     # Plot the explained variance ratio
    38     fig, ax = plt.subplots(1, 2, figsize=(16, 6))
    39 
    40     ax[0].set_xlabel('Component')
    41     ax[0].set_ylabel('Variance ratio (%)')
    42     ax[0].bar(np.arange(36), pca.explained_variance_ratio_ * 100.0)
    43 
    44     ax[1].set_xlabel('Component')
    45     ax[1].set_ylabel('Cumulative variance (%)')
    46     ax[1].bar(np.arange(36), np.cumsum(pca.explained_variance_)[::-1])
    47 
    48     plt.show()
    49 
    50     # Rebuild from PCA and show the result
    51     fig, ax = plt.subplots(10, 10, figsize=(10, 10))
    52 
    53     samples = [pca.inverse_transform(X_pca[x]).reshape((8, 8)) for x in selection]
    54 
    55     for i in range(10):
    56         for j in range(10):
    57             ax[i, j].set_axis_off()
    58             ax[i, j].imshow(samples[(i * 8) + j], cmap='gray')
    59 
    60     plt.show()
    View Code

           原图:64维,8X8点图,

         

          方差比和累计方差:

        

        pca = PCA(n_components=36, whiten=True)

        X_pca = pca.fit_transform(digits.data / 255)

        samples = [pca.inverse_transform(X_pca[x]).reshape((8, 8)) for x in selection]

          做逆变换投射到原始空间后的图:

        

         非负矩阵分解:当数据集由非负元素组成时:例子如下:

          

    1 from sklearn.datasets import load_iris
    2 from sklearn.decomposition import NMF
    3 iris = load_iris()
    4 iris.data.shape
    5 
    6 iris.data[0:5]
    7 Xt[0:5]
    8 nmf.inverse_transform(Xt[0:5])

          

          核PCA:可以对非线性的数据集分类,包含比较复杂的数学公式:

     1 from __future__ import print_function
     2 
     3 import numpy as np
     4 import matplotlib.pyplot as plt
     5 
     6 from sklearn.datasets.samples_generator import make_blobs
     7 from sklearn.decomposition import KernelPCA
     8 
     9 # For reproducibility
    10 np.random.seed(1000)
    11 
    12 if __name__ == '__main__':
    13     # Create a dummy dataset
    14     Xb, Yb = make_blobs(n_samples=500, centers=3, n_features=3)
    15 
    16     # Show the dataset
    17     fig, ax = plt.subplots(1, 1, figsize=(8, 8))
    18     ax.scatter(Xb[:, 0], Xb[:, 1])
    19     ax.set_xlabel('X')
    20     ax.set_ylabel('Y')
    21     ax.grid()
    22 
    23     plt.show()
    24 
    25     # Perform a kernel PCA (with radial basis function)
    26     kpca = KernelPCA(n_components=2, kernel='rbf', fit_inverse_transform=True)
    27     X_kpca = kpca.fit_transform(Xb)
    28 
    29     # Plot the dataset after PCA
    30     fig, ax = plt.subplots(1, 1, figsize=(8, 8))
    31     ax.scatter(kpca.X_transformed_fit_[:, 0], kpca.X_transformed_fit_[:, 1])
    32     ax.set_xlabel('First component: Variance')
    33     ax.set_ylabel('Second component: Mean')
    34     ax.grid()
    35 
    36     plt.show()
    View Code

          

      原子提取和字典学习:类似主成分,从原子的稀疏字典重建样本。

     1 from __future__ import print_function
     2 import numpy as np
     3 import matplotlib.pyplot as plt
     4 from sklearn.datasets import load_digits
     5 from sklearn.decomposition import DictionaryLearning
     6 
     7 # For reproducibility
     8 np.random.seed(1000)
     9 
    10 if __name__ == '__main__':
    11     # Load MNIST digits
    12     digits = load_digits()
    13 
    14     # Perform a dictionary learning (and atom extraction) from the MNIST dataset
    15     dl = DictionaryLearning(n_components=36, fit_algorithm='lars', transform_algorithm='lasso_lars')
    16     X_dict = dl.fit_transform(digits.data[0:35])
    17 
    18     # Show the atoms that have been extracted
    19     fig, ax = plt.subplots(6, 6, figsize=(8, 8))
    20 
    21     samples = [dl.components_[x].reshape((8, 8)) for x in range(34)]
    22 
    23     for i in range(6):
    24         for j in range(6):
    25             ax[i, j].set_axis_off()
    26             ax[i, j].imshow(samples[(i * 5) + j], cmap='gray')
    27 
    28     plt.show()
    View Code

        结果:

          

  • 相关阅读:
    各国语言缩写列表,各国语言缩写-各国语言简称,世界各国域名缩写
    How to see log files in MySQL?
    git 设置和取消代理
    使用本地下载和管理的免费 Windows 10 虚拟机测试 IE11 和旧版 Microsoft Edge
    在Microsoft SQL SERVER Management Studio下如何完整输出NVARCHAR(MAX)字段或变量的内容
    windows 10 x64系统下在vmware workstation pro 15安装macOS 10.15 Catelina, 并设置分辨率为3840x2160
    在Windows 10系统下将Git项目签出到磁盘分区根目录的方法
    群晖NAS(Synology NAS)环境下安装GitLab, 并在Windows 10环境下使用Git
    使用V-2ray和V-2rayN搭建本地代理服务器供局域网用户连接
    windows 10 专业版安装VMware虚拟机碰到的坑
  • 原文地址:https://www.cnblogs.com/bai2018/p/10579754.html
Copyright © 2011-2022 走看看