zoukankan      html  css  js  c++  java
  • How to use Datasets and DataLoader in PyTorch for custom text data

    ref:

    https://towardsdatascience.com/how-to-use-datasets-and-dataloader-in-pytorch-for-custom-text-data-270eed7f7c00

    https://pytorch.org/tutorials/beginner/data_loading_tutorial.html

    https://sparrow.dev/pytorch-dataloader/

    Creating a PyTorch Dataset and managing it with Dataloader keeps your data manageable and helps to simplify your machine learning pipeline. a Dataset stores all your data, and Dataloader is can be used to iterate through the data, manage batches, transform the data, and much more.

    Import libraries

    import pandas as pd
    import torch
    from torch.utils.data import Dataset, DataLoader

    Create a custom Dataset class

    If the original data are as follows:

    in numbers.cvs:

    torch.utils.data.Dataset is an abstract class representing a dataset. Your custom dataset should inherit Dataset and override the following methods:

    • __len__ so that len(dataset) returns the size of the dataset.
    • __getitem__ to support the indexing such that dataset[i] can be used to get iith sample.

    We will read the csv in __init__ but leave the reading of images to __getitem__. This is memory efficient because all the images are not stored in the memory at once but read as required.

    class SeqDataset(Dataset):
        def __init__(self, file_root, max_length) -> None:
            super(SeqDataset).__init__()
    
            self.sentences = pd.read_csv(file_root)
            self.max_length = max_length
    
        def __len__(self):
            return len(self.sentences)
        
        def __getitem__(self, index):
            # 字符串处理
            sentence_a = self.sentences.sentence_a[index][1:-1].split(",")
            sentence_b = self.sentences.sentence_b[index][1:-1].split(",")
            # ['3', '4', '5']
            # ['6', '7', '8']
    
            # listz转array
            sentence_a = np.array([int(x) for x in sentence_a])
            sentence_b = np.array([int(x) for x in sentence_b])
            # array([3, 4, 5])
            # array([6, 7, 8])
    
            # 补齐
            sentence_a = np.pad(sentence_a, (0, self.max_length-sentence_a.shape[0]), 'constant', constant_values=(0,0))
            sentence_b = np.pad(sentence_b, (0, self.max_length-sentence_b.shape[0]), 'constant', constant_values=(0,0))
            # array([3, 4, 5, 0, 0, 0, 0, 0, 0, 0])
            # array([6, 7, 8, 0, 0, 0, 0, 0, 0, 0])
    
            return sentence_a, sentence_b

    Iterating through the dataset

    We can iterate over the created dataset with a for in range loop as before.

    However, we are losing a lot of features by using a simple for loop to iterate over the data. In particular, we are missing out on:

    • Batching the data
    • Shuffling the data
    • Load the data in parallel using multiprocessing workers.

    torch.utils.data.DataLoader is an iterator which provides all these features. Parameters used below should be clear. One parameter of interest is collate_fn. You can specify how exactly the samples need to be batched using collate_fn. However, default collate should work fine for most use cases.

        dataloader = DataLoader(dataset, batch_size=4,
                            shuffle=False, num_workers=0,  collate_fn=None)
    
        for batch_idx, batch in enumerate(dataloader):
            src, trg = batch
            print(src.shape)
            print(trg.shape)

    Output:

    (deeplearning) ➜  TransformerScratch python generate_data.py
    torch.Size([4, 10]) torch.Size([4, 10])
    tensor([[3, 4, 5, 0, 0, 0, 0, 0, 0, 0],
            [2, 3, 4, 0, 0, 0, 0, 0, 0, 0],
            [3, 0, 0, 0, 0, 0, 0, 0, 0, 0],
            [5, 6, 7, 8, 9, 0, 0, 0, 0, 0]])
    tensor([[ 6,  7,  8,  0,  0,  0,  0,  0,  0,  0],
            [ 5,  6,  7,  0,  0,  0,  0,  0,  0,  0],
            [ 4,  0,  0,  0,  0,  0,  0,  0,  0,  0],
            [10, 11, 12, 13, 14,  0,  0,  0,  0,  0]])
    torch.Size([4, 10]) torch.Size([4, 10])
    tensor([[4, 5, 0, 0, 0, 0, 0, 0, 0, 0],
            [5, 0, 0, 0, 0, 0, 0, 0, 0, 0],
            [1, 2, 3, 4, 5, 0, 0, 0, 0, 0],
            [1, 2, 3, 4, 5, 0, 0, 0, 0, 0]])
    tensor([[ 6,  7,  0,  0,  0,  0,  0,  0,  0,  0],
            [ 6,  0,  0,  0,  0,  0,  0,  0,  0,  0],
            [ 6,  7,  8,  9, 10,  0,  0,  0,  0,  0],
            [ 6,  7,  8,  9, 10,  0,  0,  0,  0,  0]])
    torch.Size([2, 10]) torch.Size([2, 10])
    tensor([[4, 5, 6, 0, 0, 0, 0, 0, 0, 0],
            [3, 4, 5, 0, 0, 0, 0, 0, 0, 0]])
    tensor([[7, 8, 9, 0, 0, 0, 0, 0, 0, 0],
            [6, 7, 8, 0, 0, 0, 0, 0, 0, 0]])

    Full Code

    import torch
    from torch.utils.data import Dataset, DataLoader
    import pandas as pd
    import numpy as np
    import ipdb
    
    
    class SeqDataset(Dataset):
        def __init__(self, file_root, max_length) -> None:
            super(SeqDataset).__init__()
    
            self.sentences = pd.read_csv(file_root)
            self.max_length = max_length
    
        def __len__(self):
            return len(self.sentences)
        
        def __getitem__(self, index):
            # 字符串处理
            sentence_a = self.sentences.sentence_a[index][1:-1].split(",")
            sentence_b = self.sentences.sentence_b[index][1:-1].split(",")
            # ['3', '4', '5']
            # ['6', '7', '8']
    
            # listz转array
            sentence_a = np.array([int(x) for x in sentence_a])
            sentence_b = np.array([int(x) for x in sentence_b])
            # array([3, 4, 5])
            # array([6, 7, 8])
    
            # 补齐
            sentence_a = np.pad(sentence_a, (0, self.max_length-sentence_a.shape[0]), 'constant', constant_values=(0,0))
            sentence_b = np.pad(sentence_b, (0, self.max_length-sentence_b.shape[0]), 'constant', constant_values=(0,0))
            # array([3, 4, 5, 0, 0, 0, 0, 0, 0, 0])
            # array([6, 7, 8, 0, 0, 0, 0, 0, 0, 0])
    
            return sentence_a, sentence_b
    
    
    
    if __name__ == "__main__":
        dataset = SeqDataset("./numbers.csv", 10)
        # print(dataset.__len__())
        # print(dataset.__getitem__(0))
        # print(dataset.__getitem__(6))
    
    
        dataloader = DataLoader(dataset, batch_size=4,
                            shuffle=False, num_workers=0,  collate_fn=None)
    
        for batch_idx, batch in enumerate(dataloader):
            src, trg = batch
            print(src.shape, trg.shape)
            print(src)
            print(trg)
            # ipdb.set_trace()
    个性签名:时间会解决一切
  • 相关阅读:
    技术晨读_20160611
    浏览器退出之后php还会继续执行么?
    大话keepalive
    也说说TIME_WAIT状态
    PHP的错误机制总结
    ASP.NET MVC中使用Unity Ioc Container
    Unity依赖注入使用详解
    小菜学习设计模式(五)—控制反转(Ioc)
    程序员的人性思考(续)
    Delegate、Predicate、Action和Func
  • 原文地址:https://www.cnblogs.com/lfri/p/15479166.html
Copyright © 2011-2022 走看看