zoukankan      html  css  js  c++  java
  • Hub --- 机器学习燃料(数据)的仓库

    Hub

    https://www.activeloop.ai/

    此工具的首页的也介绍,也是它的愿景:

    训练模型,不用背数据所累。

    ML领域现在的问题是, 数据准备花费太多的资源, 对于异构数据的处理,特别是个问题。

    Train ML models,
    don't mess with data

    Fast and simple framework for building and scaling data pipelines for machine learning

    Too many resources are spent on setting up the data

    pipelines, rapid iterations of machine learning experiments will result in models with superhuman accuracy.

    https://github.com/activeloopai/hub

    数据科学家和ML研究者 花费的大部分的时间, 在 数据的管理 和 预处理上。

    使用hub,可以解决这个问题。

    它可以存储你的数据集合作为单一的numpy类型的数组, 数据大小可以到PT级别, 并存储在云上, 所以你可以无缝地在任何机器上访问和使用这些数据。

    Hub使得任何类型的存储在云上的数据,可以同前端存储一样快速地被使用, 数据类型包括 图片 音频 和 视频。

    在相同的数据集视图先, 小组成员总是可以容易的同步数据,并快速理解,并使用。

    可以与torch和TensorFlow集成。

    What is Hub for?

    Software 2.0 needs Data 2.0, and Hub delivers it. Most of the time Data Scientists/ML researchers work on data management and preprocessing instead of training models. With Hub, we are fixing this. We store your (even petabyte-scale) datasets as single numpy-like array on the cloud, so you can seamlessly access and work with it from any machine. Hub makes any data type (images, text files, audio, or video) stored in cloud usable as fast as if it were stored on premise. With same dataset view, your team can always be in sync.

    Hub is being used by Waymo, Red Cross, World Resources Institute, Omdena, and others.

     

    Features

    • Store and retrieve large datasets with version-control
    • Collaborate as in Google Docs: Multiple data scientists working on the same data in sync with no interruptions
    • Access from multiple machines simultaneously
    • Deploy anywhere - locally, on Google Cloud, S3, Azure as well as Activeloop (by default - and for free!)
    • Integrate with your ML tools like Numpy, Dask, Ray, PyTorch, or TensorFlow
    • Create arrays as big as you want. You can store images as big as 100k by 100k!
    • Keep shape of each sample dynamic. This way you can store small and big arrays as 1 array.
    • Visualize any slice of the data in a matter of seconds without redundant manipulations

    ML workflow缺点 - 数据管理缺失(WHY HUB?)

    例如sklearn只有样本图像的加载接口, 应用自己的数据需要自行管理。

    load_sample_image

    https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_sample_image.html

    Load the numpy array of a single sample image

    Read more in the User Guide.

    >>> from sklearn.datasets import load_sample_image
    >>> china = load_sample_image('china.jpg')   
    >>> china.dtype                              
    dtype('uint8')
    >>> china.shape                              
    (427, 640, 3)
    >>> flower = load_sample_image('flower.jpg') 
    >>> flower.dtype                             
    dtype('uint8')
    >>> flower.shape                             
    (427, 640, 3)

    加载其它的外部数据,必须使用其它工具, 各种工具应对不同的数据格式。

    scikit-learn works on any numeric data stored as numpy arrays or scipy sparse matrices. Other types that are convertible to numeric arrays such as pandas DataFrame are also acceptable.

    Here are some recommended ways to load standard columnar data into a format usable by scikit-learn:

    • pandas.io provides tools to read data from common formats including CSV, Excel, JSON and SQL. DataFrames may also be constructed from lists of tuples or dicts. Pandas handles heterogeneous data smoothly and provides tools for manipulation and conversion into a numeric array suitable for scikit-learn.

    • scipy.io specializes in binary formats often used in scientific computing context such as .mat and .arff

    • numpy/routines.io for standard loading of columnar data into numpy arrays

    • scikit-learn’s datasets.load_svmlight_file for the svmlight or libSVM sparse format

    • scikit-learn’s datasets.load_files for directories of text files where the name of each directory is the name of each category and each file inside of each directory corresponds to one sample from that category

    For some miscellaneous data such as images, videos, and audio, you may wish to refer to:

    Categorical (or nominal) features stored as strings (common in pandas DataFrames) will need converting to numerical features using OneHotEncoder or OrdinalEncoder or similar. See Preprocessing data.

    样例demo - Color Quantization using K-Means

    加载 样例数据, 并进行聚类处理。

    https://scikit-learn.org/stable/auto_examples/cluster/plot_color_quantization.html#sphx-glr-auto-examples-cluster-plot-color-quantization-py

    图像数据预处理

    https://www.cnblogs.com/ningskyer/articles/7606174.html

    from sklearn.linear_model import LogisticRegression
    from sklearn import datasets
    from sklearn.cross_validation import train_test_split
    from sklearn.metrics import confusion_matrix,accuracy_score
    import numpy as np
    import scipy
    import cv2
    from fractions import Fraction
    
    
    def image2Digit(image):
        # 调整为8*8大小
        im_resized = scipy.misc.imresize(image, (8,8))
        # RGB(三维)转为灰度图(一维)
        im_gray = cv2.cvtColor(im_resized, cv2.COLOR_BGR2GRAY)
        # 调整为0-16之间(digits训练数据的特征规格)像素值——16/255
        im_hex = Fraction(16,255) * im_gray
        # 将图片数据反相(digits训练数据的特征规格——黑底白字)
        im_reverse = 16 - im_hex
        return im_reverse.astype(np.int)
    # 加载数字数据
    digits = datasets.load_digits()
    # 划分训练集与验证集
    Xtrain, Xtest, ytrain, ytest = train_test_split(digits.data, digits.target, random_state=2)
    # 创建模型
    clf = LogisticRegression(penalty='l2')
    # 拟合数据训练
    clf.fit(Xtrain, ytrain)
    # 预测验证集
    ypred = clf.predict(Xtest)
    # 计算准确度
    accuracy = accuracy_score(ytest, ypred)
    print("识别准确度:",accuracy)
     
    # 读取单张自定义手写数字的图片
    image = scipy.misc.imread("digit_image/2.png")
    # 将图片转为digits训练数据的规格——即数据的表征方式要统一
    im_reverse = image2Digit(image)
    # 显示图片转换后的像素值
    print(im_reverse)
    # 8*8转为1*64(预测方法的参数要求)
    reshaped = im_reverse.reshape(1,64)
    # 预测
    result = clf.predict(reshaped)
    print(result)

    理解(Personal Thoughts)

    对于模型使用的数据,大多数是以numpy的格式出现, 因为机器学习的基础数据类型就是numpy。

    原始数据总是要经过转换,处理,清洗,才能作为模型训练的输入。此阶段是最耗时的数据清洗/数据预处理。

    但是对于很多场合,这种属于预处理,是必须要重复的,因为raw到numpy转换, 目标数据并不能被存储。

    浪费的人力,浪费了算力。

    Hub横空出世

    使用此工具,可以将数据处理的最终结果, 存储到官方云, 或者私有云上, 甚至本地。

    理解:

    python 和 nodejs都有官方的代码包管理库, 数据处理典型的数据库, Hub是作为机器学习出现的数据库。

    模型训练也应该有对应的模型库。

    下面是你hub官方管理界面。

    https://app.activeloop.ai/datasets/explore

     

    从Hub库加载数据

    from hub import Dataset
    
    mnist = Dataset("activeloop/mnist")  # loading the MNIST data lazily
    # saving time with *compute* to retrieve just the necessary data
    mnist["image"][0:1000].compute()

    训练模型

    from hub import Dataset
    import torch
    
    mnist = Dataset("activeloop/mnist")
    # converting MNIST to PyTorch format
    mnist = mnist.to_pytorch(lambda x: (x["image"], x["label"]))
    
    train_loader = torch.utils.data.DataLoader(mnist, batch_size=1, num_workers=0)
    
    for image, label in train_loader:
        # Training loop here

    创建本地数据集

    from hub import Dataset, schema
    import numpy as np
    
    ds = Dataset(
        "./data/dataset_name",  # file path to the dataset
        shape = (4,),  # follows numpy shape convention
        mode = "w+",  # reading & writing mode
        schema = {  # named blobs of data that may specify types
        # Tensor is a generic structure that can contain any type of data
            "image": schema.Tensor((512, 512), dtype="float"),
            "label": schema.Tensor((512, 512), dtype="float"),
        }
    )
    
    # filling the data containers with data (here - zeroes to initialize)
    ds["image"][:] = np.zeros((4, 512, 512))
    ds["label"][:] = np.zeros((4, 512, 512))
    ds.flush()  # executing the creation of the dataset

    Upload your dataset and access it from anywhere in 3 simple steps

    https://github.com/activeloopai/hub#upload-your-dataset-and-access-it-from-anywhere-in-3-simple-steps

    Hub sphinx doc

    https://docs.activeloop.ai/en/latest/index.html

    pytorch安装

    https://pytorch.org/get-started/previous-versions/

    # CPU only
    pip install torch==1.6.0+cpu torchvision==0.7.0+cpu -f https://download.pytorch.org/whl/torch_stable.html

    Hub安装

    pip3 install hub
    出处:http://www.cnblogs.com/lightsong/ 本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接。
  • 相关阅读:
    对于基础资料的关联操作
    单据关联关系记录
    单据转换插件中新增行
    APK签名校验绕过
    android 安全需要关注
    安卓从业者应该关注:Android 6.0的运行时权限
    让阿里云的Centos,PHP组件 ImageMagick支持png和jpeg格式
    cocos2d-x 常规库的图文件配置
    cocos2d-x 添加 libLocalStorage 库...
    cocos2d-x3.9 默认是 gnustl_static 配置,但是 这个库缺少c++的基础功能... c++_static 功能全面些
  • 原文地址:https://www.cnblogs.com/lightsong/p/14464580.html
Copyright © 2011-2022 走看看