zoukankan      html  css  js  c++  java
  • Gluon Datasets and DataLoader

    mxnet.recordio

    MXRecordIO
    Reads/writes RecordIO data format, supporting sequential read and write.

    record = mx.recordio.MXRecordIO('tmp.rec', 'w')
    
    for i in range(5):
        record.write('record_%d'%i)
    record.close()
    record = mx.recordio.MXRecordIO('tmp.rec', 'r')
    for i in range(5):
        item = record.read()
        print(item)
    record_0
    record_1
    record_2
    record_3
    record_4
    record.close()
    

    MXIndexedRecordIO
    Reads/writes RecordIO data format, supporting random access.

    record = mx.recordio.MXIndexedRecordIO('tmp.idx', 'tmp.rec', 'w')
    for i in range(5):
        record.write_idx(i, 'record_%d'%i)
    record.close()
    record = mx.recordio.MXIndexedRecordIO('tmp.idx', 'tmp.rec', 'r')
    record.read_idx(3)
    record_3
    

    IRHeader
    An alias for HEADER. Used to store metadata (e.g. labels) accompanying a record.
    Parameters:

    • flag (int) – Available for convenience, can be set arbitrarily.
    • label (float or an array of float) – Typically used to store label(s) for a record.
    • id (int) – Usually a unique id representing record.
    • id2 (int) – Higher order bits of the unique id, should be set to 0 (in most cases).

    pack(header, s)
    Pack a string into MXImageRecord.

    label = 4 # label can also be a 1-D array, for example: label = [1,2,3]
    id = 2574
    header = mx.recordio.IRHeader(0, label, id, 0)
    with open(path, 'r') as file:
        s = file.read()
    packed_s = mx.recordio.pack(header, s)
    

    unpack(s)
    Unpack a MXImageRecord to string.

    record = mx.recordio.MXRecordIO('test.rec', 'r')
    item = record.read()
    header, s = mx.recordio.unpack(item)
    header
    HEADER(flag=0, label=14.0, id=20129312, id2=0)
    

    unpack_img(s, iscolor=-1)

    record = mx.recordio.MXRecordIO('test.rec', 'r')
    item = record.read()
    header, img = mx.recordio.unpack_img(item)
    header
    HEADER(flag=0, label=14.0, id=20129312, id2=0)
    img
    array([[[ 23,  27,  45],
            [ 28,  32,  50],
            ...,
            [168, 169, 167],
            [166, 167, 165]]], dtype=uint8)
    

    pack_img(header, img, quality=95, img_fmt='.jpg')[source]
    Pack an image into MXImageRecord.

    label = 4 # label can also be a 1-D array, for example: label = [1,2,3]
    id = 2574
    header = mx.recordio.IRHeader(0, label, id, 0)
    img = cv2.imread('test.jpg')
    packed_s = mx.recordio.pack_img(header, img)
    

    we use the Gluon API to define a Dataset and use a DataLoader to iterate through the dataset in mini-batches.

    Introduction to Datasets

    Dataset objects are used to represent collections of data, and include methods to load and parse the data.

    we’ll use the ArrayDataset to introduce the idea of a Dataset.

    import mxnet as mx
    import os
    import tarfile
    
    mx.random.seed(42) # Fix the seed for reproducibility
    X = mx.random.uniform(shape=(10, 3))
    y = mx.random.uniform(shape=(10, 1))
    dataset = mx.gluon.data.dataset.ArrayDataset(X, y)
    

    A key feature of a Dataset is the ability to retrieve a single sample given an index.
    Our random data and labels were generated in memory, so this ArrayDataset doesn’t have to load anything from disk, but the interface is the same for all Datasets.

    sample_idx = 4
    sample = dataset[sample_idx]
    
    assert len(sample) == 2
    assert sample[0].shape == (3, )
    assert sample[1].shape == (1, )
    

    We don’t usually retrieve individual samples from Dataset objects though (unless we’re quality checking the output samples). Instead we use a DataLoader.

    Introduction to DataLoader

    A DataLoader is used to create mini-batches of samples from a Dataset, and provides a convenient iterator interface for looping these batches.

    A required parameter of DataLoader is the size of the mini-batches you want to create, called batch_size.

    Another benefit of using DataLoader is the ability to easily load data in parallel using multiprocessing. You can set the num_workers parameter to the number of CPUs avalaible on your machine for maximum performance.

    from multiprocessing import cpu_count
    CPU_COUNT = cpu_count()
    
    data_loader = mx.gluon.data.DataLoader(dataset, batch_size=5, num_workers=CPU_COUNT)
    
    for X_batch, y_batch in data_loader:
        print("X_batch has shape {}, and y_batch has shape {}".format(X_batch.shape, y_batch.shape))
    

    Our data_loader loop will stop when every sample of dataset has been returned as part of a batch.

    Sometimes the dataset length isn’t divisible by the mini-batch size, leaving a final batch with a smaller number of samples. DataLoader‘s default behavior is to return this smaller mini-batch, but this can be changed by setting the last_batch parameter to discard (which ignores the last batch) or rollover (which starts the next epoch with the remaining samples).

    Machine learning with Datasets and DataLoaders

    Common use cases for loading data are covered already (e.g. mxnet.gluon.data.vision.datasets.ImageFolderDataset), but it’s simple to create your own custom Dataset classes for other types of data.

    You can even use included Dataset objects for common datasets if you want to experiment quickly.
    Many of the image Datasets accept a function (via the optional transform parameter) which is applied to each sample returned by the Dataset. It’s useful for performing data augmentation, but can also be used for more simple data type conversion and pixel value scaling as seen below.

    def transform(data, label):
        data = data.astype('float32')/255
        return data, label
    
    train_dataset = mx.gluon.data.vision.datasets.FashionMNIST(train=True, transform=transform)
    valid_dataset = mx.gluon.data.vision.datasets.FashionMNIST(train=False, transform=transform)
    
    sample_idx = 234
    sample = train_dataset[sample_idx]
    data = sample[0]
    label = sample[1]
    

    When training machine learning models it is important to shuffle the training samples every time you pass through the dataset (i.e. each epoch). Sometimes the order of your samples will have a spurious relationship with the target variable, and shuffling the samples helps remove this. With DataLoader it’s as simple as adding shuffle=True. You don’t need to shuffle the validation and testing data though.

    If you have more complex shuffling requirements (e.g. when handling sequential data), take a look at mxnet.gluon.data.BatchSampler and pass this to your DataLoader instead.

    batch_size = 32
    train_data_loader = mx.gluon.data.DataLoader(train_dataset, batch_size, shuffle=True, num_workers=CPU_COUNT)
    valid_data_loader = mx.gluon.data.DataLoader(valid_dataset, batch_size, num_workers=CPU_COUNT)
    

    Hybrid
    Deep Learning Programming Style

  • 相关阅读:
    sweetalert使用随笔
    Web框架之Django_07 进阶操作(MTV与MVC、多对多表三种创建方式、前后端传输数据编码格式contentType、ajax、自定义分页器)
    Web框架之Django_06 模型层了解(F查询、Q查询、事务、update和save、only和defer、choice属性、bulk_create)
    Web框架之Django_05 模型层了解(单表查询、多表查询、聚合查询、分组查询)
    Web框架之Django_04 模板层了解(过滤器、标签、自定义过滤器、标签、inclusion_tag、模板的继承与导入)
    Web框架之Django_03 路由层了解(路有层 无名分组、有名分组、反向解析、路由分发 视图层 JsonResponse,FBV、CBV、文件上传)
    SQLAlchemy相关文档
    TCP三次握手四次挥手
    网络编程与并发编程参考资料
    MySQL数据库相关资料
  • 原文地址:https://www.cnblogs.com/houkai/p/9522889.html
Copyright © 2011-2022 走看看