Hub --- 机器学习燃料（数据）的仓库

zoukankan html css js c++ java

Hub --- 机器学习燃料（数据）的仓库
Hub

https://www.activeloop.ai/

此工具的首页的也介绍，也是它的愿景：

训练模型，不用背数据所累。

ML领域现在的问题是，数据准备花费太多的资源，对于异构数据的处理，特别是个问题。

Train ML models,
don't mess with data

Fast and simple framework for building and scaling data pipelines for machine learning

Too many resources are spent on setting up the data

pipelines, rapid iterations of machine learning experiments will result in models with superhuman accuracy.

https://github.com/activeloopai/hub

数据科学家和ML研究者花费的大部分的时间，在数据的管理和预处理上。

使用hub，可以解决这个问题。

它可以存储你的数据集合作为单一的numpy类型的数组，数据大小可以到PT级别，并存储在云上，所以你可以无缝地在任何机器上访问和使用这些数据。

Hub使得任何类型的存储在云上的数据，可以同前端存储一样快速地被使用，数据类型包括图片音频和视频。

在相同的数据集视图先，小组成员总是可以容易的同步数据，并快速理解，并使用。

可以与torch和TensorFlow集成。
What is Hub for?

Software 2.0 needs Data 2.0, and Hub delivers it. Most of the time Data Scientists/ML researchers work on data management and preprocessing instead of training models. With Hub, we are fixing this. We store your (even petabyte-scale) datasets as single numpy-like array on the cloud, so you can seamlessly access and work with it from any machine. Hub makes any data type (images, text files, audio, or video) stored in cloud usable as fast as if it were stored on premise. With same dataset view, your team can always be in sync.

Hub is being used by Waymo, Red Cross, World Resources Institute, Omdena, and others.

Features

Store and retrieve large datasets with version-control

Collaborate as in Google Docs: Multiple data scientists working on the same data in sync with no interruptions

Access from multiple machines simultaneously

Deploy anywhere - locally, on Google Cloud, S3, Azure as well as Activeloop (by default - and for free!)

Integrate with your ML tools like Numpy, Dask, Ray, PyTorch, or TensorFlow

Create arrays as big as you want. You can store images as big as 100k by 100k!

Keep shape of each sample dynamic. This way you can store small and big arrays as 1 array.

Visualize any slice of the data in a matter of seconds without redundant manipulations
ML workflow缺点 - 数据管理缺失(WHY HUB?)

例如sklearn只有样本图像的加载接口，应用自己的数据需要自行管理。

load_sample_image

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_sample_image.html

Load the numpy array of a single sample image

Read more in the User Guide.
>>> from sklearn.datasets import load_sample_image >>> china = load_sample_image('china.jpg') >>> china.dtype dtype('uint8') >>> china.shape (427, 640, 3) >>> flower = load_sample_image('flower.jpg') >>> flower.dtype dtype('uint8') >>> flower.shape (427, 640, 3)
加载其它的外部数据，必须使用其它工具，各种工具应对不同的数据格式。
scikit-learn works on any numeric data stored as numpy arrays or scipy sparse matrices. Other types that are convertible to numeric arrays such as pandas DataFrame are also acceptable.

Here are some recommended ways to load standard columnar data into a format usable by scikit-learn:

pandas.io provides tools to read data from common formats including CSV, Excel, JSON and SQL. DataFrames may also be constructed from lists of tuples or dicts. Pandas handles heterogeneous data smoothly and provides tools for manipulation and conversion into a numeric array suitable for scikit-learn.

scipy.io specializes in binary formats often used in scientific computing context such as .mat and .arff

numpy/routines.io for standard loading of columnar data into numpy arrays

scikit-learn’s datasets.load_svmlight_file for the svmlight or libSVM sparse format

scikit-learn’s datasets.load_files for directories of text files where the name of each directory is the name of each category and each file inside of each directory corresponds to one sample from that category

For some miscellaneous data such as images, videos, and audio, you may wish to refer to:

skimage.io or Imageio for loading images and videos into numpy arrays

scipy.io.wavfile.read for reading WAV files into a numpy array

Categorical (or nominal) features stored as strings (common in pandas DataFrames) will need converting to numerical features using OneHotEncoder or OrdinalEncoder or similar. See Preprocessing data.
样例demo - Color Quantization using K-Means

加载样例数据，并进行聚类处理。

https://scikit-learn.org/stable/auto_examples/cluster/plot_color_quantization.html#sphx-glr-auto-examples-cluster-plot-color-quantization-py

图像数据预处理

https://www.cnblogs.com/ningskyer/articles/7606174.html
from sklearn.linear_model import LogisticRegression from sklearn import datasets from sklearn.cross_validation import train_test_split from sklearn.metrics import confusion_matrix,accuracy_score import numpy as np import scipy import cv2 from fractions import Fraction def image2Digit(image): # 调整为8*8大小 im_resized = scipy.misc.imresize(image, (8,8)) # RGB（三维）转为灰度图（一维） im_gray = cv2.cvtColor(im_resized, cv2.COLOR_BGR2GRAY) # 调整为0-16之间（digits训练数据的特征规格）像素值——16/255 im_hex = Fraction(16,255) * im_gray # 将图片数据反相（digits训练数据的特征规格——黑底白字） im_reverse = 16 - im_hex return im_reverse.astype(np.int) # 加载数字数据 digits = datasets.load_digits() # 划分训练集与验证集 Xtrain, Xtest, ytrain, ytest = train_test_split(digits.data, digits.target, random_state=2) # 创建模型 clf = LogisticRegression(penalty='l2') # 拟合数据训练 clf.fit(Xtrain, ytrain) # 预测验证集 ypred = clf.predict(Xtest) # 计算准确度 accuracy = accuracy_score(ytest, ypred) print("识别准确度：",accuracy) # 读取单张自定义手写数字的图片 image = scipy.misc.imread("digit_image/2.png") # 将图片转为digits训练数据的规格——即数据的表征方式要统一 im_reverse = image2Digit(image) # 显示图片转换后的像素值 print(im_reverse) # 8*8转为1*64（预测方法的参数要求） reshaped = im_reverse.reshape(1,64) # 预测 result = clf.predict(reshaped) print(result)
理解(Personal Thoughts)

对于模型使用的数据，大多数是以numpy的格式出现，因为机器学习的基础数据类型就是numpy。

原始数据总是要经过转换，处理，清洗，才能作为模型训练的输入。此阶段是最耗时的数据清洗/数据预处理。

但是对于很多场合，这种属于预处理，是必须要重复的，因为raw到numpy转换，目标数据并不能被存储。

浪费的人力，浪费了算力。

Hub横空出世

使用此工具，可以将数据处理的最终结果，存储到官方云，或者私有云上，甚至本地。

理解：

python 和 nodejs都有官方的代码包管理库，数据处理典型的数据库， Hub是作为机器学习出现的数据库。

模型训练也应该有对应的模型库。

下面是你hub官方管理界面。

https://app.activeloop.ai/datasets/explore

从Hub库加载数据
from hub import Dataset mnist = Dataset("activeloop/mnist") # loading the MNIST data lazily # saving time with *compute* to retrieve just the necessary data mnist["image"][0:1000].compute()
训练模型
from hub import Dataset import torch mnist = Dataset("activeloop/mnist") # converting MNIST to PyTorch format mnist = mnist.to_pytorch(lambda x: (x["image"], x["label"])) train_loader = torch.utils.data.DataLoader(mnist, batch_size=1, num_workers=0) for image, label in train_loader: # Training loop here
创建本地数据集
from hub import Dataset, schema import numpy as np ds = Dataset( "./data/dataset_name", # file path to the dataset shape = (4,), # follows numpy shape convention mode = "w+", # reading & writing mode schema = { # named blobs of data that may specify types # Tensor is a generic structure that can contain any type of data "image": schema.Tensor((512, 512), dtype="float"), "label": schema.Tensor((512, 512), dtype="float"), } ) # filling the data containers with data (here - zeroes to initialize) ds["image"][:] = np.zeros((4, 512, 512)) ds["label"][:] = np.zeros((4, 512, 512)) ds.flush() # executing the creation of the dataset
Upload your dataset and access it from anywhere in 3 simple steps

https://github.com/activeloopai/hub#upload-your-dataset-and-access-it-from-anywhere-in-3-simple-steps

Hub sphinx doc

https://docs.activeloop.ai/en/latest/index.html

pytorch安装

https://pytorch.org/get-started/previous-versions/
# CPU only pip install torch==1.6.0+cpu torchvision==0.7.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
Hub安装
pip3 install hub
出处：http://www.cnblogs.com/lightsong/ 本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接。
查看全文

相关阅读:
ribbon--eureka注册中心消费者
 eureka注册中心
 spring cloud简介
 Quartz定时任务
 ThreadLocal
分布式单点登录SSO
dubbo框架
 注册中心
 centos安装zookeeper及搭建集群
 7.19 基础数据结构选讲

原文地址：https://www.cnblogs.com/lightsong/p/14464580.html

Hub --- 机器学习燃料（数据）的仓库

Hub

Train ML models, don't mess with data

Fast and simple framework for building and scaling data pipelines for machine learning

Too many resources are spent on setting up the data

What is Hub for?

Features

ML workflow缺点 - 数据管理缺失(WHY HUB?)

load_sample_image

样例demo - Color Quantization using K-Means

图像数据预处理

理解(Personal Thoughts)

Hub横空出世

从Hub库加载数据

训练模型

创建本地数据集

Upload your dataset and access it from anywhere in 3 simple steps

Hub sphinx doc

pytorch安装

Hub安装

Train ML models,
don't mess with data