zoukankan      html  css  js  c++  java
  • Torchtext使用教程 文本数据处理

    Torchtext

    文本数据预处理工具

    Doc | Code

    Field

    定义数据处理的方式,将原始数据转为TENSOR

    Field使用

    from torchtext import data
    
    TEXT = data.Field(sequential=True, tokenize=tokenize, lower=True, fix_length=200)
    LABEL = data.Field(sequential=False, use_vocab=False)
    
    

    Field参数

    参数名 说明
    sequential Default: True 是否是序列数据,如果不是就不使用tokenization
    use_vocab Default: True 是否使用a Vocab object.如果不使用的话,原始数据应已是数字类型.
    init_token Default: None A token that will be prepended to every example using this field, or None for no initial token.
    eos_token A token that will be appended to every example using this field, or None for no end-of-sentence token. Default: None.
    fix_length Default: None. 设置序列数据的定长 eg. 100
    dtype The torch.dtype class that represents a batch of examples of this kind of data. Default: torch.long.
    preprocessing The Pipeline that will be applied to examples using this field after tokenizing but before numericalizing. Many Datasets replace this attribute with a custom preprocessor. Default: None.
    postprocessing A Pipeline that will be applied to examples using this field after numericalizing but before the numbers are turned into a Tensor. The pipeline function takes the batch as a list, and the field’s Vocab. Default: None.
    lower Default: False. 字符串转为小写
    tokenize Default: string.split 对原始数据进行字符串操作,eg. 输入tokenize = lambda x: x.split()
    tokenizer_language The language of the tokenizer to be constructed. Various languages currently supported only in SpaCy.
    include_lengths Whether to return a tuple of a padded minibatch and a list containing the lengths of each examples, or just a padded minibatch. Default: False.
    batch_first Default: False 是否返回batch维度在第一个维度的数据
    pad_token The string token used as padding. Default: “”.
    unk_token The string token used to represent OOV words. Default: “”.
    pad_first Do the padding of the sequence at the beginning. Default: False.
    truncate_first Do the truncating of the sequence at the beginning. Default: False
    stop_words Tokens to discard during the preprocessing step. Default: None
    is_target Whether this field is a target variable. Affects iteration over batches. Default: False

    Dataset

    使用Field来定义数据组成形式,得到数据集

    Dataset使用

    自定义Dataset类

    from torchtext import data
    import random
    import numpy as np
    class MyDataset(data.Dataset):
        def __init__(self, csv_path, text_field, label_field, test=False, aug=False, **kwargs):
            
            csv_data = pd.read_csv(csv_path)
            
            # 数据处理操作格式
            fields = [("id", None),("text", text_field), ("label", label_field)]
            
            examples = []
            if test:
                # 如果为测试集,则不加载标签
                for text in tqdm(csv_data['text']):
                    examples.append(data.Example.fromlist([None, text, None], fields))
            else:
                for text, label in tqdm(zip(csv_data['text'], csv_data['label'])):
                    # 数据增强
                    if aug:
                        rate = random.random()
                        if rate > 0.5:
                            text = self.dropout(text)
                        else:
                            text = self.shuffle(text)
                    examples.append(data.Example.fromlist([None, text, label], fields))
                    
            # 上面是一些预处理操作,此处调用super调用父类构造方法,产生标准Dataset
            # super(MyDataset, self).__init__(examples, fields, **kwargs)
            super(MyDataset, self).__init__(examples, fields)
    
        def shuffle(self, text):
            # 序列随机排序
            text = np.random.permutation(text.strip().split())
            return ' '.join(text)
    
        def dropout(self, text, p=0.5):
            # 随机删除一些文本
            text = text.strip().split()
            len_ = len(text)
            indexs = np.random.choice(len_, int(len_ * p))
            for i in indexs:
                text[i] = ''
            return ' '.join(text)
    

    Iterator

    迭代器 Iterator / BucketIterator

    Iterator

    保持数据样本顺序不变来构建批数据

    BucketIterator

    自动选取样本长度相似的数据来构建批数据,最大程度地减少所需的填充量

    from torchtext import data
    def data_iter(train_path, valid_path, test_path, TEXT, LABEL):
        train = MyDataset(train_path, text_field=TEXT, label_field=LABEL, test=False, aug=1)
        valid = MyDataset(valid_path, text_field=TEXT, label_field=LABEL, test=False, aug=1)
        test = MyDataset(test_path, text_field=TEXT, label_field=None, test=True, aug=1)
        # 传入用于构建词表的数据集
        # TEXT = data.Field(sequential=True, tokenize=tokenize, lower=True, fix_length=200)
        TEXT.build_vocab(train)
        weight_matrix = TEXT.vocab.vectors
        # 只针对训练集构造迭代器
        # train_iter = data.BucketIterator(dataset=train, batch_size=8, shuffle=True, sort_within_batch=False, repeat=False)
        
        # 同时对训练集和验证集构造迭代器
        train_iter, val_iter = data.BucketIterator.splits(
                (train, valid),
                batch_sizes=(8, 8),
                # 如果使用gpu,此处将-1更换为GPU的编号
                device=-1,
                # 用来排序的指标
                sort_key=lambda x: len(x.text),
                sort_within_batch=False,
                repeat=False
        )
        test_iter = Iterator(test, batch_size=8, device=-1, sort=False, sort_within_batch=False, repeat=False)
        return train_iter, val_iter, test_iter, weight_matrix
    

    Word Embedding

    在使用pytorch或tensorflow等神经网络框架进行nlp任务的处理时,可以通过对应的Embedding层做词向量的处理。使用预训练好的词向量会带来更优的性能,下面介绍如何在torchtext中使用预训练的词向量,进而传送给神经网络模型进行训练。

    torchtext 默认支持的预训练词向量

    自动下载对应的预训练词向量文件到当前文件夹下的.vector_cache目录下,.vector_cache为默认的词向量文件和缓存文件的目录。

    from torchtext.vocab import GloVe
    from torchtext import data
    TEXT = data.Field(sequential=True)
    # 以下两种指定预训练词向量的方式等效
    # TEXT.build_vocab(train, vectors="glove.6B.200d")
    TEXT.build_vocab(train, vectors=GloVe(name='6B', dim=300))
    # 在这种情况下,会默认下载glove.6B.zip文件,进而解压出glove.6B.50d.txt, glove.6B.100d.txt
    

    外部预训练的词向量

    通过name参数指定预训练文件,通过cache参数指定预训练文件目录

    cache = '.vector_cache'
    vectors = Vectors(name='myvector/glove/glove.6B.200d.txt', cache=cache)
    TEXT.build_vocab(train, vectors=vectors)
    

    在模型中指定Embedding层参数

    import torch.nn as nn
    # pytorch创建的Embedding层
    embedding = nn.Embedding(input_dim, hidden_dim)
    # 权重在词汇表vocab的vectors属性中
    weight_matrix = TEXT.vocab.vectors
    # 指定嵌入矩阵的初始权重
    embedding.weight.data.copy_(weight_matrix)
    
  • 相关阅读:
    智能指针
    C++学习之对类中的成员函数的定义和声明最后添加一个const作用
    动态链接,静态链接库
    Java 位运算
    Java 工具类
    Java 枚举
    Java 内部类
    Java 异常机制
    Java hashCode 和 equals
    Java 字节流和字符流
  • 原文地址:https://www.cnblogs.com/linzhenyu/p/13277552.html
Copyright © 2011-2022 走看看