zoukankan      html  css  js  c++  java
  • How to use Datasets and DataLoader in PyTorch for custom text data

    ref:

    https://towardsdatascience.com/how-to-use-datasets-and-dataloader-in-pytorch-for-custom-text-data-270eed7f7c00

    https://pytorch.org/tutorials/beginner/data_loading_tutorial.html

    https://sparrow.dev/pytorch-dataloader/

    Creating a PyTorch Dataset and managing it with Dataloader keeps your data manageable and helps to simplify your machine learning pipeline. a Dataset stores all your data, and Dataloader is can be used to iterate through the data, manage batches, transform the data, and much more.

    Import libraries

    import pandas as pd
    import torch
    from torch.utils.data import Dataset, DataLoader

    Create a custom Dataset class

    If the original data are as follows:

    in numbers.cvs:

    torch.utils.data.Dataset is an abstract class representing a dataset. Your custom dataset should inherit Dataset and override the following methods:

    • __len__ so that len(dataset) returns the size of the dataset.
    • __getitem__ to support the indexing such that dataset[i] can be used to get iith sample.

    We will read the csv in __init__ but leave the reading of images to __getitem__. This is memory efficient because all the images are not stored in the memory at once but read as required.

    class SeqDataset(Dataset):
        def __init__(self, file_root, max_length) -> None:
            super(SeqDataset).__init__()
    
            self.sentences = pd.read_csv(file_root)
            self.max_length = max_length
    
        def __len__(self):
            return len(self.sentences)
        
        def __getitem__(self, index):
            # 字符串处理
            sentence_a = self.sentences.sentence_a[index][1:-1].split(",")
            sentence_b = self.sentences.sentence_b[index][1:-1].split(",")
            # ['3', '4', '5']
            # ['6', '7', '8']
    
            # listz转array
            sentence_a = np.array([int(x) for x in sentence_a])
            sentence_b = np.array([int(x) for x in sentence_b])
            # array([3, 4, 5])
            # array([6, 7, 8])
    
            # 补齐
            sentence_a = np.pad(sentence_a, (0, self.max_length-sentence_a.shape[0]), 'constant', constant_values=(0,0))
            sentence_b = np.pad(sentence_b, (0, self.max_length-sentence_b.shape[0]), 'constant', constant_values=(0,0))
            # array([3, 4, 5, 0, 0, 0, 0, 0, 0, 0])
            # array([6, 7, 8, 0, 0, 0, 0, 0, 0, 0])
    
            return sentence_a, sentence_b

    Iterating through the dataset

    We can iterate over the created dataset with a for in range loop as before.

    However, we are losing a lot of features by using a simple for loop to iterate over the data. In particular, we are missing out on:

    • Batching the data
    • Shuffling the data
    • Load the data in parallel using multiprocessing workers.

    torch.utils.data.DataLoader is an iterator which provides all these features. Parameters used below should be clear. One parameter of interest is collate_fn. You can specify how exactly the samples need to be batched using collate_fn. However, default collate should work fine for most use cases.

        dataloader = DataLoader(dataset, batch_size=4,
                            shuffle=False, num_workers=0,  collate_fn=None)
    
        for batch_idx, batch in enumerate(dataloader):
            src, trg = batch
            print(src.shape)
            print(trg.shape)

    Output:

    (deeplearning) ➜  TransformerScratch python generate_data.py
    torch.Size([4, 10]) torch.Size([4, 10])
    tensor([[3, 4, 5, 0, 0, 0, 0, 0, 0, 0],
            [2, 3, 4, 0, 0, 0, 0, 0, 0, 0],
            [3, 0, 0, 0, 0, 0, 0, 0, 0, 0],
            [5, 6, 7, 8, 9, 0, 0, 0, 0, 0]])
    tensor([[ 6,  7,  8,  0,  0,  0,  0,  0,  0,  0],
            [ 5,  6,  7,  0,  0,  0,  0,  0,  0,  0],
            [ 4,  0,  0,  0,  0,  0,  0,  0,  0,  0],
            [10, 11, 12, 13, 14,  0,  0,  0,  0,  0]])
    torch.Size([4, 10]) torch.Size([4, 10])
    tensor([[4, 5, 0, 0, 0, 0, 0, 0, 0, 0],
            [5, 0, 0, 0, 0, 0, 0, 0, 0, 0],
            [1, 2, 3, 4, 5, 0, 0, 0, 0, 0],
            [1, 2, 3, 4, 5, 0, 0, 0, 0, 0]])
    tensor([[ 6,  7,  0,  0,  0,  0,  0,  0,  0,  0],
            [ 6,  0,  0,  0,  0,  0,  0,  0,  0,  0],
            [ 6,  7,  8,  9, 10,  0,  0,  0,  0,  0],
            [ 6,  7,  8,  9, 10,  0,  0,  0,  0,  0]])
    torch.Size([2, 10]) torch.Size([2, 10])
    tensor([[4, 5, 6, 0, 0, 0, 0, 0, 0, 0],
            [3, 4, 5, 0, 0, 0, 0, 0, 0, 0]])
    tensor([[7, 8, 9, 0, 0, 0, 0, 0, 0, 0],
            [6, 7, 8, 0, 0, 0, 0, 0, 0, 0]])

    Full Code

    import torch
    from torch.utils.data import Dataset, DataLoader
    import pandas as pd
    import numpy as np
    import ipdb
    
    
    class SeqDataset(Dataset):
        def __init__(self, file_root, max_length) -> None:
            super(SeqDataset).__init__()
    
            self.sentences = pd.read_csv(file_root)
            self.max_length = max_length
    
        def __len__(self):
            return len(self.sentences)
        
        def __getitem__(self, index):
            # 字符串处理
            sentence_a = self.sentences.sentence_a[index][1:-1].split(",")
            sentence_b = self.sentences.sentence_b[index][1:-1].split(",")
            # ['3', '4', '5']
            # ['6', '7', '8']
    
            # listz转array
            sentence_a = np.array([int(x) for x in sentence_a])
            sentence_b = np.array([int(x) for x in sentence_b])
            # array([3, 4, 5])
            # array([6, 7, 8])
    
            # 补齐
            sentence_a = np.pad(sentence_a, (0, self.max_length-sentence_a.shape[0]), 'constant', constant_values=(0,0))
            sentence_b = np.pad(sentence_b, (0, self.max_length-sentence_b.shape[0]), 'constant', constant_values=(0,0))
            # array([3, 4, 5, 0, 0, 0, 0, 0, 0, 0])
            # array([6, 7, 8, 0, 0, 0, 0, 0, 0, 0])
    
            return sentence_a, sentence_b
    
    
    
    if __name__ == "__main__":
        dataset = SeqDataset("./numbers.csv", 10)
        # print(dataset.__len__())
        # print(dataset.__getitem__(0))
        # print(dataset.__getitem__(6))
    
    
        dataloader = DataLoader(dataset, batch_size=4,
                            shuffle=False, num_workers=0,  collate_fn=None)
    
        for batch_idx, batch in enumerate(dataloader):
            src, trg = batch
            print(src.shape, trg.shape)
            print(src)
            print(trg)
            # ipdb.set_trace()
    个性签名:时间会解决一切
  • 相关阅读:
    spring boot 打包 jar 实现第三方零配置引用
    spring boot 整合log4j2
    linux cat 模糊查询日志命令
    docker深入浅出
    docker容器为啥一定要前台运行
    李诚云原生技术分享
    k8s中对应的stateful有状态服务的讲解
    k8s networkpolicy网络策略详解
    k8s中IngressIp和egressIp的区别
    技术丨小团队微服务落地实践
  • 原文地址:https://www.cnblogs.com/lfri/p/15479166.html
Copyright © 2011-2022 走看看