zoukankan      html  css  js  c++  java
  • Python机器学习(二十六)Sklearn 加载数据集

    机器学习是计算机科学的一个分支,研究的是无需人类干预,能够自己学习的算法。

    与TensorFlow不同,Scikit-learn(sklearn)的定位是通用机器学习库,而TensorFlow(tf)的定位主要是深度学习库。

    数据科学中的第一步通常都是加载数据,我们首先学习怎么使用SciKit-Learn来加载数据集。

    数据集的来源,通常有2个:

    • 自己准备
    • 第三方处获取

    如果你不是研究人员,一般都会选择从第三方获取。有一些网站上,可以获取数据集:

    这个网页上,列出了很多数据集分享地址:https://www.kdnuggets.com/datasets/index.html。

    注意:SciKit-Learn是SciKit库的一部分,SciKit意思是SciPy Tookits,名字来源于SciPy库,SciKit基于SciPy库构建,除了SciKit-Learn,还包含其他很多模块,可以打开这个网址查看。SciKit-Learn库是专注于机器学习和数据挖掘的模块。

    SciKit-Learn库中也自带一些数据集,我们可以尝试加载。

    先从sklearn导入数据集模块,然后,可以使用数据集中的load_digits()方法加载数据:

    数据加载代码实现:

    # Import `datasets` from `sklearn`
    from sklearn import datasets
    
    # 加载 `digits` 数据集
    digits = datasets.load_digits()
    
    # 打印 `digits` 数据 
    print(digits)

    执行结果:

    C:Anaconda3python.exe "C:Program FilesJetBrainsPyCharm 2019.1.1helperspydevpydevconsole.py" --mode=client --port=62310
    import sys; print('Python %s on %s' % (sys.version, sys.platform))
    sys.path.extend(['C:\app\PycharmProjects', 'C:/app/PycharmProjects'])
    Python 3.7.6 (default, Jan  8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)]
    Type 'copyright', 'credits' or 'license' for more information
    IPython 7.12.0 -- An enhanced Interactive Python. Type '?' for help.
    PyDev console: using IPython 7.12.0
    Python 3.7.6 (default, Jan  8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)] on win32
    runfile('C:/app/PycharmProjects/ArtificialIntelligence/test.py', wdir='C:/app/PycharmProjects/ArtificialIntelligence')
    {'data': array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],
           [ 0.,  0.,  0., ..., 10.,  0.,  0.],
           [ 0.,  0.,  0., ..., 16.,  9.,  0.],
           ...,
           [ 0.,  0.,  1., ...,  6.,  0.,  0.],
           [ 0.,  0.,  2., ..., 12.,  0.,  0.],
           [ 0.,  0., 10., ..., 12.,  1.,  0.]]), 'target': array([0, 1, 2, ..., 8, 9, 8]), 'target_names': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), 'images': array([[[ 0.,  0.,  5., ...,  1.,  0.,  0.],
            [ 0.,  0., 13., ..., 15.,  5.,  0.],
            [ 0.,  3., 15., ..., 11.,  8.,  0.],
            ...,
            [ 0.,  4., 11., ..., 12.,  7.,  0.],
            [ 0.,  2., 14., ..., 12.,  0.,  0.],
            [ 0.,  0.,  6., ...,  0.,  0.,  0.]],
           [[ 0.,  0.,  0., ...,  5.,  0.,  0.],
            [ 0.,  0.,  0., ...,  9.,  0.,  0.],
            [ 0.,  0.,  3., ...,  6.,  0.,  0.],
            ...,
            [ 0.,  0.,  1., ...,  6.,  0.,  0.],
            [ 0.,  0.,  1., ...,  6.,  0.,  0.],
            [ 0.,  0.,  0., ..., 10.,  0.,  0.]],
           [[ 0.,  0.,  0., ..., 12.,  0.,  0.],
            [ 0.,  0.,  3., ..., 14.,  0.,  0.],
            [ 0.,  0.,  8., ..., 16.,  0.,  0.],
            ...,
            [ 0.,  9., 16., ...,  0.,  0.,  0.],
            [ 0.,  3., 13., ..., 11.,  5.,  0.],
            [ 0.,  0.,  0., ..., 16.,  9.,  0.]],
           ...,
           [[ 0.,  0.,  1., ...,  1.,  0.,  0.],
            [ 0.,  0., 13., ...,  2.,  1.,  0.],
            [ 0.,  0., 16., ..., 16.,  5.,  0.],
            ...,
            [ 0.,  0., 16., ..., 15.,  0.,  0.],
            [ 0.,  0., 15., ..., 16.,  0.,  0.],
            [ 0.,  0.,  2., ...,  6.,  0.,  0.]],
           [[ 0.,  0.,  2., ...,  0.,  0.,  0.],
            [ 0.,  0., 14., ..., 15.,  1.,  0.],
            [ 0.,  4., 16., ..., 16.,  7.,  0.],
            ...,
            [ 0.,  0.,  0., ..., 16.,  2.,  0.],
            [ 0.,  0.,  4., ..., 16.,  2.,  0.],
            [ 0.,  0.,  5., ..., 12.,  0.,  0.]],
           [[ 0.,  0., 10., ...,  1.,  0.,  0.],
            [ 0.,  2., 16., ...,  1.,  0.,  0.],
            [ 0.,  0., 15., ..., 15.,  0.,  0.],
            ...,
            [ 0.,  4., 16., ..., 16.,  6.,  0.],
            [ 0.,  8., 16., ..., 16.,  8.,  0.],
            [ 0.,  1.,  8., ..., 12.,  1.,  0.]]]), 'DESCR': ".. _digits_dataset:
    
    Optical recognition of handwritten digits dataset
    --------------------------------------------------
    
    **Data Set Characteristics:**
    
        :Number of Instances: 5620
        :Number of Attributes: 64
        :Attribute Information: 8x8 image of integer pixels in the range 0..16.
        :Missing Attribute Values: None
        :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
        :Date: July; 1998
    
    This is a copy of the test set of the UCI ML hand-written digits datasets
    https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
    
    The data set contains images of hand-written digits: 10 classes where
    each class refers to a digit.
    
    Preprocessing programs made available by NIST were used to extract
    normalized bitmaps of handwritten digits from a preprinted form. From a
    total of 43 people, 30 contributed to the training set and different 13
    to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
    4x4 and the number of on pixels are counted in each block. This generates
    an input matrix of 8x8 where each element is an integer in the range
    0..16. This reduces dimensionality and gives invariance to small
    distortions.
    
    For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.
    T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
    L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,
    1994.
    
    .. topic:: References
    
      - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
        Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
        Graduate Studies in Science and Engineering, Bogazici University.
      - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
      - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
        Linear dimensionalityreduction using relevance weighted LDA. School of
        Electrical and Electronic Engineering Nanyang Technological University.
        2005.
      - Claudio Gentile. A New Approximate Maximal Margin Classification
        Algorithm. NIPS. 2000."}

    datasets模块中也包含了获取其他流行数据集的方法,例如datasets.fetch_openml可以从openml存储库获取数据集。

    上面示例中的数据集,也可以从这个网址获取:http://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/

    代码实现:

    # 导入 `pandas` 库
    import pandas as pd
    
    # 使用 `read_csv()` 加载数据集
    digits = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/optdigits.tra", header=None)
    
    # 打印 `digits` 数据
    print(digits)

    执行结果:

    C:Anaconda3python.exe "C:Program FilesJetBrainsPyCharm 2019.1.1helperspydevpydevconsole.py" --mode=client --port=62450
    import sys; print('Python %s on %s' % (sys.version, sys.platform))
    sys.path.extend(['C:\app\PycharmProjects', 'C:/app/PycharmProjects'])
    Python 3.7.6 (default, Jan 8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)]
    Type 'copyright', 'credits' or 'license' for more information
    IPython 7.12.0 -- An enhanced Interactive Python. Type '?' for help.
    PyDev console: using IPython 7.12.0
    Python 3.7.6 (default, Jan 8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)] on win32
    runfile('C:/app/PycharmProjects/ArtificialIntelligence/test.py', wdir='C:/app/PycharmProjects/ArtificialIntelligence')
    0 1 2 3 4 5 6 7 8 ... 56 57 58 59 60 61 62 63 64
    0 0 1 6 15 12 1 0 0 0 ... 0 0 6 14 7 1 0 0 0
    1 0 0 10 16 6 0 0 0 0 ... 0 0 10 16 15 3 0 0 0
    2 0 0 8 15 16 13 0 0 0 ... 0 0 9 14 0 0 0 0 7
    3 0 0 0 3 11 16 0 0 0 ... 0 0 0 1 15 2 0 0 4
    4 0 0 5 14 4 0 0 0 0 ... 0 0 4 12 14 7 0 0 6
    .. .. .. .. .. .. .. .. .. ... .. .. .. .. .. .. .. .. ..
    3818 0 0 5 13 11 2 0 0 0 ... 0 0 8 13 15 10 1 0 9
    3819 0 0 0 1 12 1 0 0 0 ... 0 0 0 4 9 0 0 0 4
    3820 0 0 3 15 0 0 0 0 0 ... 0 0 4 14 16 9 0 0 6
    3821 0 0 6 16 2 0 0 0 0 ... 0 0 5 16 16 16 5 0 6
    3822 0 0 2 15 16 13 1 0 0 ... 0 0 4 14 1 0 0 0 7
    [3823 rows x 65 columns]

    可以看到,上面下载网址中的文件后缀是.tra,表示是训练(train)数据集,在这个页面内还可以看到.tes文件,表示是测试(test)数据集,所以上面加载的数据集,是已经分割好训练数据集和测试数据集的。上面示例中,只加载了训练数据集。

  • 相关阅读:
    百度地图常用 获取中心点 缩放级别等
    sqlserver 临时表,多用户同时访问冲突吗?
    批量改ID 行形式
    C# post Json数据
    windows 激活venv问题
    spring 改变url
    conductor编译镜像
    springboot教程
    Microsoft Visual C++ Compiler for Python 2.7
    java 方法引用(method reference)
  • 原文地址:https://www.cnblogs.com/huanghanyu/p/13158539.html
Copyright © 2011-2022 走看看